Tag Archives: Statistics

Statistics and Polling

This is a bit of a change of pace, but I got some inquiries about this and thought I would offer my own two cents on something that often confuses people. My qualifications for this are two-fold:

  1. In my past life I was a professor who taught classes in Statistics;
  2. I have worked for a political consulting company that among other things performed polling for clients.

So you can use this in deciding if you want to pay any attention to what I have to say on the subject. 🙂

To get started, the basic question of epistemology: How do we know what we say we know. In the case of statistics, the basic mathematics began to be developed as a way of analyzing gambling. When you play poker, and a hand with three of a kind beats a hand with two pair, that is because two pair (shows up 4.75% of the time) is more likely than three of a kind (shows up 2.11% of the time). But after its start in gambling, statistics took a big step during the Napoleonic wars, when for the first time large armies met and the casualties mounted up. Some doctors realized that gathering evidence about wounds and their treatment would lead them to select the best treatments. But they key factor is that this is all based on probability. And the best way to think about probability is to think about what would happen if you did the same thing and over and over. You might well get a range of outcomes, but some outcomes would show up more often. And this is the first thing that throws a lot of people, because they often have this sense that if something is unlikely, it won’t happen at all. And that is simply untrue. Unlikely things will happen, just not as often. As a joke has it, if you are one in a million, there are 1,500 people in China exactly like you. But the heritage of gambling persists in the technique called Monte Carlo simulations, which run an experiment many, many times, often via a computer algorithm, to generate random data to test theories. John von Neumann understood the significance of this approach, and programmed one the first computers, ENIAC, to carry out Monte Carlo simulations

The next key concept is called the Law of Large Numbers, which in layman’s terms says that if you repeat the experiment many times, the average result should be equal to the expected result. Now this is the average we are talking about here. Any particular experiment could give weird results that are nothing like the expected result, and that is to be expected in a distribution of results. But when you average out between each experiment, the occasional high ones are offset by the occasional low ones, and the average result is pretty good. But to get this you need to do it many, many times. The more times you repeat the experiment the closer your results should be.

Our third key concept is Random Sampling. This says that every member of a population has an equal chance of being selected for a sample. And the population is whatever group you want to make a claim about. If you want to make a claim about left-handed Mormons, your sample should exclude anyone right-handed people or any Lutherans, but it should afford an equal chance of selection for all left-handed Mormons. This is where a lot of problems can arise. For instance, many medical studies in the 20th century included all or mostly men, but the results were applied to all adults. This is now recognized as a big problem in medicine. When this happens we call the problem Sampling bias.

So, with these basic concepts (and see, I did not use any math yet!) we can start to look at polling, and just how good it is or isn’t as the case may be. And it is often very good, but history does show some big blunders along the way.

The first thing to get out of the way is that sampling, done properly, works. This is a mathematical fact and has been proven many times over. You may have trouble believing that 1000 people are an accurate measure of what a million people, or even 100 million people will do, but in fact it does work. When there are problems it is usually because someone made a mistake, such as drawing a sample that is not truly an unbiased sample from the population in question. This does happen and you need to be careful about this in examining polling results. In the earlier part of the twentieth century there were some polls done via telephone surveys, but because telephones were not universally available at that time these polls overstated the views of more affluent people who were more likely to have phones. By the latter part of the century, however, telephone surveys were perfectly valid because almost everyone had a phone (and the few who didn’t were not likely to be voters anyway). But now we have a different problem, in that many people (myself included) have gone to using mobile phones exclusively, and the sampling methods in many cases relied solely on landline telephones. Polling outfits are beginning to adjust for this, so it should not be a problem. But you need to watch out for ways pollsters will limit the sample. A big issue is whether you should include all registered voters (in the U.S., you need to be registered before you can vote. I am not familiar with how other countries handle this.), or if you want to limit it to “likely voters”. Deciding who is a “likely voter” is place where some serious bias can creep in, since it is purely a judgement call by the pollster.

So how do we know that samples work? We have two strong pieces of evidence. First, we know from Monte Carlo simulations how well samples compare to the underlying populations in controlled experiments. You create a population with known parameters, pull a bunch of samples, and see how well they match up to the known population. Second, we have the results of many surveys which we can compare to what actually happens when an election (for instance) is held. Both of these give us confidence that we understand the fundamental mathematics involved.

The next concept to understand is Confidence Interval. This comes from the fact that even an unbiased sample will not match the population exactly. To see what I mean, consider what happens if you toss a fair (unbiased) coin. If it is a truly fair coin, you should get heads 50% of the time, on average, and tails 50% of the time. But the key here is “on average”. If you tossed this coin 100 times, would you always get exactly 50 heads and 50 tails? Of course not. You might get 48 heads and 52 tails the first time, 53 heads and 47 tails the second time, etc. If you did this a whole bunch of times and averaged your results, you would get ever closer to that 50/50 split, but probably not hit it exactly. And what this means is that your results will be close to what is in the population most of the time, but terms like “close” and “most of the time” are very imprecise. How close, and how often really should be specified more precisely. And we can do that with the Confidence Interval. This starts with the “how often” question, and the standard usually used is 95% of the time. This is called a 95% confidence interval, but sometimes the complement is used and it gets referred to as “accurate to the .05 level. These are essentially the same thing for our purposes. And if you are a real statistician, please remember that this podcast is not intended to be a graduate-level statistics course, but rather a guide for the intelligent lay person who wants to understand the subject.  The 95% level of confidence is kind of arbitrary, and in some scientific applications this can be raised or lowered, but in polling you can think of this as the “best practice” industry standard.

The other part, the “how close” question, is not at all arbitrary. It is called formally the Margin of Error, and once you have chosen the level of confidence, it is a pretty straightforward function of the sample size  In other words, if you toss a coin ten times, getting six heads and four tails is very likely. But if you toss it 100 times, getting 60 heads and 40 tails is much less likely. So the bigger the sample size, the closer it should match the population. You might think that pollsters would therefore use very large sample sizes to get better accuracy, but you run into a problem. Sampling has a linear cost. If you double the sample size, you double the cost of the survey. If that resulted in double the accuracy it might be worth it, but in fact for reasonable sample sizes it won’t. Doubling the sample size might get you 10% more accuracy in your results, and is that worth spending twice the money? Not really. So you are looking for a sweet spot where the cost of the survey is not too much, but the accuracy is acceptable.

Any reputable poll should make available some basic information about the survey. The facts that should be reported include:

  • When the poll was taken. Timing can mean a lot. If one candidate was caught having sex with a live man or a dead woman, as the joke has it, it matters a lot whether the poll was taken before or after that fact came out in the news.
  • How big a sample was it?
  • What kinds of people were sampled? Was there an attempt to limit it to likely voters?
  • What is the margin of error?
  • What is the confidence interval?

Now a reputable pollster will make these available, but that does not mean they will be reported in a newspaper or television story about the poll. Or they may be buried in a footnote. But these factors all affect how you should interpret the poll.

Example: http://www.politico.com/story/2013/12/polls-obamacare-100967.html

In this brief news report we don’t get everythng, but we got a lot of it. This story is about two polls just done (as I write this) on people’s opinions regarding “Obamacare”.

The Pew survey of 2,001 adults was conducted Dec. 3 to Dec. 8 and has a margin of error of plus-or-minus 2.6 percentage points.

The Quinnipiac survey of 2,692 voters was conducted from Dec. 3 to Dec. 9 and has a margin of error of plus-or-minus 1.9 percentage points.

What I would note is the the first poll says it was a poll of “adults”, while the second poll was one of “voters”. That makes me wonder about any differences in the results (and the polls did indeed have different results). They were sampling different populations, so the results are not comparable. If the purpose of the survey is to look at how people in general feel, a survey of adults would probably make sense. If the purpose was to forecast how this will affect candidates in the 2014 elections, the second poll may be more relevant.

Second, note that the survey with the larger sample size had a slightly smaller margin of error. That is what we should expect to see.

Third, note that the second poll was “in the field” as we say for one more day than the first poll. Does that matter? It might if some very significant news event happened on the 9th of December that might affect the results.

What I don’t see in this report is any explanation of how the people were contacted, but if I went to their web site, here is what I found on the Quinnipiac site:

From December 3 – 9, Quinnipiac University surveyed 2,692 registered voters nationwide with a margin of error of +/- 1.9 percentage points. Live interviewers call land lines and cell phones.

So if you dig you can get all of this. And note that they specifically mentioned calling cellphones as part of their sample.

One final thing to point out is that if you accept a 95% confidence level, that means that by definition approximately one out of every 20 polls will be, the use the technical term, “Batcrap crazy”. That is why you should never assign too much significance to any one poll, particularly if it gives you results different from all other polls. You are probably looking at that one out of twenty polls that should be ignored. There is a human tendency to seize on it if it tells you what you want to hear, but that is usually a mistake. It is when a number of pollsters do a number of polls and get roughly the same result that you should start to believe it. That does not mean they will agree exactly, there is still the usual margin of error. That is why a poll that show one candidate getting 51% of the vote and her opponent getting 49% will be described as a “dead heat”. With the margin of error, the candidate could be getting anywhere between 53% and 49% assuming the poll is accurate and unbiased.

Listen to the audio version of this post on Hacker Public Radio!

Data-Driven Objectivity

I recently had an exchange online with someone I tend to like, and it was about self-driving cars. My correspondent said that he would never, under any circumstances, get into a self-driven car. In fact, he seemed to think that self-driven cars would lead to carnage on the roads. My own opinion is that human driven cars have already led to a very demonstrable carnage, and that in all likelihood computers would do a better job. As you might imagine, this impressed my correspondent not the least. When  I observed that his opbjections were irrational, he said I shouild choose my words more carefully, but that he would overlook the insult this time.

Possibly that is a bad way to phrase my objection, but it is also, in the strict sense of the term, the precisely proper word to use. What I was saying is that his view had no basis in data or facts, and was purely an emotional response. We all have those, and I’m not claiming any superiority on that ground. But when the Enlightenment philosophers talked of reason it was in contrast to religion and superstition, and really did mean thinking in terms of data, facts, and logical thinking. It is my own view that this type of thinking has the major reponsibility for the progress the human race has made in science and technology over the last few centuries. And it is also my view that this type of thinking is being attacked severely in these days.

The hallmark of rational thinking is that it starts from a basis in observed facts, but always keeps a willingness to revise the conclusion if new facts come to light. If that seems reasonable to you, good. Now think of how the worst insult you can pin on a politician is flip-flopping. The great 20th century economist John Maynard Keynes was accused of this and responded “When my information changes, I alter my conclusions. What do you do, sir?” That is how a rational person thinks. There are people who attack science for being of no use because occasionally scientists change thier minds about what is going on. But that is an uninformed (to be most charitable about it) view. Science is a process of deriving the best possible explanations for the data we have, but always ready to discard them in favor of other explanations when new data comes in. That may bother people who insist on iron-clad certainty in everything, but in fact it does work. If it didn’t work you wouldn’t be reading this. (Did you ever notice the irony of television commentators attacking scientists? You might think the plans for television were found in the Bible/Koran/etc.)

One of the biggest obstacles to clear, rational thinking is what is termed confirmation bias. This is the tendency of people to see the evidence that supports their view, while simultaneously ignoring any evidence that does not support their view. This why the only studies that are given credibility are what we call “double-blind” studies. An example is a drug trial. We know there is a tednency for people to get better because they believe they are being given a new drug. In addition, we know that just being shown attention helps. So we take great care (in a good study) to divide the sample into two groups, with one group getting the great new drug, and the other group getting something that looks exactly like it, but has no active ingredient. It may be a sugar pill, or a solution of saline liquid being injected, just so long as the patient cannot tell which group they are in. But the bias can also be on the experimenter side. If a team of doctors has devoted years to developing a new drug, they will naturally have some investment in wanting it to succeed. And that can lead to seeing results that are not there, or even in “suggesting” in sub-conscious ways to the patient that they are getting the drug or not. So none of those doctors can be a part of it either. Clinicians are recruited who only know that they have two groups, A & B, and have no idea which is which. This is the classic double-blind study: neither the patient nor the experimenter has any idea who is getting the drug and who isn’t.

The reason we need to be this careful is that people are, by and large, irrational. People will be afraid of flying in an airplane but think nothing of getting into a car and driving, even though every bit of data says that driving is far more dangerous. People are far more afraid of sharks than they are of the food they eat, though more people die every year from food poisoning than are ever killed by sharks. And we all suffer from a massive case of the Lake Wobegone effect, in that we all tend to think we are above average, even though by definition roughly half of us are below average on any given characteristic. We just are not good judges of our own capabilities in most cases. In fact, the Dunning-Kruger effect suggests that we are frequently wrong in self-assessments.

But the worst case is the person who is absolutely certain, no matter what he is certain of. Certainty is great enemy of rationality. Years ago, Jacob Bronowski filmed a series called The Ascent Of Man. In one scene, he stood in a puddle outside at Auschwitz and talked about people who had certainty, and said “I beg of you to consider the possiblity that you may be wrong.” This is the hallmark of a rational person, this is the standard by which every scientist is judged. If you know anyone who can say “This is what I think, but I might be wrong,” you will have found the rarest kind of person, and you should cultivate their aquaintance. This type of wisdom is all too rare. And if you ever find a politician who says that, please vote for them, no matter what their party affiliation. They are worth infinitely more than a hundred of the kind that never have changed their minds about anything.