This is a bit of a change of pace, but I got some inquiries about this and thought I would offer my own two cents on something that often confuses people. My qualifications for this are two-fold:
- In my past life I was a professor who taught classes in Statistics;
- I have worked for a political consulting company that among other things performed polling for clients.
So you can use this in deciding if you want to pay any attention to what I have to say on the subject.
To get started, the basic question of epistemology: How do we know what we say we know. In the case of statistics, the basic mathematics began to be developed as a way of analyzing gambling. When you play poker, and a hand with three of a kind beats a hand with two pair, that is because two pair (shows up 4.75% of the time) is more likely than three of a kind (shows up 2.11% of the time). But after its start in gambling, statistics took a big step during the Napoleonic wars, when for the first time large armies met and the casualties mounted up. Some doctors realized that gathering evidence about wounds and their treatment would lead them to select the best treatments. But they key factor is that this is all based on probability. And the best way to think about probability is to think about what would happen if you did the same thing and over and over. You might well get a range of outcomes, but some outcomes would show up more often. And this is the first thing that throws a lot of people, because they often have this sense that if something is unlikely, it won’t happen at all. And that is simply untrue. Unlikely things will happen, just not as often. As a joke has it, if you are one in a million, there are 1,500 people in China exactly like you. But the heritage of gambling persists in the technique called Monte Carlo simulations, which run an experiment many, many times, often via a computer algorithm, to generate random data to test theories. John von Neumann understood the significance of this approach, and programmed one the first computers, ENIAC, to carry out Monte Carlo simulations
The next key concept is called the Law of Large Numbers, which in layman’s terms says that if you repeat the experiment many times, the average result should be equal to the expected result. Now this is the average we are talking about here. Any particular experiment could give weird results that are nothing like the expected result, and that is to be expected in a distribution of results. But when you average out between each experiment, the occasional high ones are offset by the occasional low ones, and the average result is pretty good. But to get this you need to do it many, many times. The more times you repeat the experiment the closer your results should be.
Our third key concept is Random Sampling. This says that every member of a population has an equal chance of being selected for a sample. And the population is whatever group you want to make a claim about. If you want to make a claim about left-handed Mormons, your sample should exclude anyone right-handed people or any Lutherans, but it should afford an equal chance of selection for all left-handed Mormons. This is where a lot of problems can arise. For instance, many medical studies in the 20th century included all or mostly men, but the results were applied to all adults. This is now recognized as a big problem in medicine. When this happens we call the problem Sampling bias.
So, with these basic concepts (and see, I did not use any math yet!) we can start to look at polling, and just how good it is or isn’t as the case may be. And it is often very good, but history does show some big blunders along the way.
The first thing to get out of the way is that sampling, done properly, works. This is a mathematical fact and has been proven many times over. You may have trouble believing that 1000 people are an accurate measure of what a million people, or even 100 million people will do, but in fact it does work. When there are problems it is usually because someone made a mistake, such as drawing a sample that is not truly an unbiased sample from the population in question. This does happen and you need to be careful about this in examining polling results. In the earlier part of the twentieth century there were some polls done via telephone surveys, but because telephones were not universally available at that time these polls overstated the views of more affluent people who were more likely to have phones. By the latter part of the century, however, telephone surveys were perfectly valid because almost everyone had a phone (and the few who didn’t were not likely to be voters anyway). But now we have a different problem, in that many people (myself included) have gone to using mobile phones exclusively, and the sampling methods in many cases relied solely on landline telephones. Polling outfits are beginning to adjust for this, so it should not be a problem. But you need to watch out for ways pollsters will limit the sample. A big issue is whether you should include all registered voters (in the U.S., you need to be registered before you can vote. I am not familiar with how other countries handle this.), or if you want to limit it to “likely voters”. Deciding who is a “likely voter” is place where some serious bias can creep in, since it is purely a judgement call by the pollster.
So how do we know that samples work? We have two strong pieces of evidence. First, we know from Monte Carlo simulations how well samples compare to the underlying populations in controlled experiments. You create a population with known parameters, pull a bunch of samples, and see how well they match up to the known population. Second, we have the results of many surveys which we can compare to what actually happens when an election (for instance) is held. Both of these give us confidence that we understand the fundamental mathematics involved.
The next concept to understand is Confidence Interval. This comes from the fact that even an unbiased sample will not match the population exactly. To see what I mean, consider what happens if you toss a fair (unbiased) coin. If it is a truly fair coin, you should get heads 50% of the time, on average, and tails 50% of the time. But the key here is “on average”. If you tossed this coin 100 times, would you always get exactly 50 heads and 50 tails? Of course not. You might get 48 heads and 52 tails the first time, 53 heads and 47 tails the second time, etc. If you did this a whole bunch of times and averaged your results, you would get ever closer to that 50/50 split, but probably not hit it exactly. And what this means is that your results will be close to what is in the population most of the time, but terms like “close” and “most of the time” are very imprecise. How close, and how often really should be specified more precisely. And we can do that with the Confidence Interval. This starts with the “how often” question, and the standard usually used is 95% of the time. This is called a 95% confidence interval, but sometimes the complement is used and it gets referred to as “accurate to the .05 level. These are essentially the same thing for our purposes. And if you are a real statistician, please remember that this podcast is not intended to be a graduate-level statistics course, but rather a guide for the intelligent lay person who wants to understand the subject. The 95% level of confidence is kind of arbitrary, and in some scientific applications this can be raised or lowered, but in polling you can think of this as the “best practice” industry standard.
The other part, the “how close” question, is not at all arbitrary. It is called formally the Margin of Error, and once you have chosen the level of confidence, it is a pretty straightforward function of the sample size In other words, if you toss a coin ten times, getting six heads and four tails is very likely. But if you toss it 100 times, getting 60 heads and 40 tails is much less likely. So the bigger the sample size, the closer it should match the population. You might think that pollsters would therefore use very large sample sizes to get better accuracy, but you run into a problem. Sampling has a linear cost. If you double the sample size, you double the cost of the survey. If that resulted in double the accuracy it might be worth it, but in fact for reasonable sample sizes it won’t. Doubling the sample size might get you 10% more accuracy in your results, and is that worth spending twice the money? Not really. So you are looking for a sweet spot where the cost of the survey is not too much, but the accuracy is acceptable.
Any reputable poll should make available some basic information about the survey. The facts that should be reported include:
- When the poll was taken. Timing can mean a lot. If one candidate was caught having sex with a live man or a dead woman, as the joke has it, it matters a lot whether the poll was taken before or after that fact came out in the news.
- How big a sample was it?
- What kinds of people were sampled? Was there an attempt to limit it to likely voters?
- What is the margin of error?
- What is the confidence interval?
Now a reputable pollster will make these available, but that does not mean they will be reported in a newspaper or television story about the poll. Or they may be buried in a footnote. But these factors all affect how you should interpret the poll.
In this brief news report we don’t get everythng, but we got a lot of it. This story is about two polls just done (as I write this) on people’s opinions regarding “Obamacare”.
The Pew survey of 2,001 adults was conducted Dec. 3 to Dec. 8 and has a margin of error of plus-or-minus 2.6 percentage points.
The Quinnipiac survey of 2,692 voters was conducted from Dec. 3 to Dec. 9 and has a margin of error of plus-or-minus 1.9 percentage points.
What I would note is the the first poll says it was a poll of “adults”, while the second poll was one of “voters”. That makes me wonder about any differences in the results (and the polls did indeed have different results). They were sampling different populations, so the results are not comparable. If the purpose of the survey is to look at how people in general feel, a survey of adults would probably make sense. If the purpose was to forecast how this will affect candidates in the 2014 elections, the second poll may be more relevant.
Second, note that the survey with the larger sample size had a slightly smaller margin of error. That is what we should expect to see.
Third, note that the second poll was “in the field” as we say for one more day than the first poll. Does that matter? It might if some very significant news event happened on the 9th of December that might affect the results.
What I don’t see in this report is any explanation of how the people were contacted, but if I went to their web site, here is what I found on the Quinnipiac site:
From December 3 – 9, Quinnipiac University surveyed 2,692 registered voters nationwide with a margin of error of +/- 1.9 percentage points. Live interviewers call land lines and cell phones.
So if you dig you can get all of this. And note that they specifically mentioned calling cellphones as part of their sample.
One final thing to point out is that if you accept a 95% confidence level, that means that by definition approximately one out of every 20 polls will be, the use the technical term, “Batcrap crazy”. That is why you should never assign too much significance to any one poll, particularly if it gives you results different from all other polls. You are probably looking at that one out of twenty polls that should be ignored. There is a human tendency to seize on it if it tells you what you want to hear, but that is usually a mistake. It is when a number of pollsters do a number of polls and get roughly the same result that you should start to believe it. That does not mean they will agree exactly, there is still the usual margin of error. That is why a poll that show one candidate getting 51% of the vote and her opponent getting 49% will be described as a “dead heat”. With the margin of error, the candidate could be getting anywhere between 53% and 49% assuming the poll is accurate and unbiased.