You have probably heard the old saying that there are lies, damned lies, and statistics.
There are several reasons why statistics are often misinterpreted. One of the biggest is the confusion between the two concepts of correlation and causation.
This confusion is not only made by laypeople but also by members of the media and scientists.
Learn more about correlation and causation and why one doesn’t necessarily imply the other on this episode of Everything Everywhere Daily.
I’ve had several episodes where I dealt with logical and statistical fallacies.
Today I want to focus on what is perhaps the most frequent logical and statistical error that many people succumb to, the confusion of correlation and causation, or as it is known in Latin, Cum hoc ergo propter hoc, which translates to “with the fact, therefore, because of the fact.”
Let’s say you are doing a study and you are looking at two different things. These things can quite literally be anything.
Once you’ve collected the data, you then plot the data on a chart to see how they correlate.
Roughly speaking, there are three ways two different variables can correlate.
There can be a positive correlation. When one variable goes up, the other goes up, and when one goes down, the other goes down.
There can be a negative correlation. When one variable goes up, the other goes down, and vice versa.
The third possibility is not to correlate whatsoever. When one variable goes up, the other might go up or might go down.
I’m oversimplifying this because there are gradations within these. Two variables could be strongly correlated or weakly correlated.
Let me give some examples.
What do you think the relationship is between the number of doctorates awarded in civil engineering and the per capita consumption of mozzarella cheese?
Or how about the divorce rate in the state of Maine and the per capita consumption of margarine.
Or what about the annual number of people who drowned by falling into a swimming pool and the annual number of movies released starring Nicholas Cage?
All three of these things you might be thinking have absolutely nothing to do with each other, and you’d be right. They don’t really have anything to do with each other.
Yet, shockingly, all three of these examples have very strong correlations.
Nicholas Cage movies and swimming pool drownings were positively correlated by 67% between 1999 and 2009.
Mozzarella consumption and civil engineering doctorates positively correlate by 95% from 2000 to 2009.
Finally, divorce rates in Main and margarine consumption were positively correlated by 99.2% between 2000 and 2009.
When you hear or see something with such a strong correlation, you might start to wonder why? What was the cause between these things?
In the examples I just gave, there really isn’t anything. It was all just chance. They were discovered by data mining. Just keep looking at data sets of everything, and eventually, you’ll find two sets of variables that correlate.
These are spurious examples.
I don’t think anyone listening really believes that new Nicholas Cage movies are causing people to jump into swimming pools….but then again, he has made a lot of really bad movies.
I bring up these examples to prove the point that correlation doesn’t necessarily imply causation.
I have to mention this because the confusion comes into play when there is causation. If there is causation, there will be a correlation.
Let’s look at another trivial example that illustrates this point.
There is a positive correlation between the number of points a sports team scores (and it really doesn’t matter the sport) and the number of wins.
This is because there is direct causation between points and wins. The team that scores the most points will win a game. Over the course of a season, the team with more points will probably win more games.
This is not perfect, of course. A team could lose a bunch of very close games and then win one game in a blowout. However, this relationship is very strong the more games you look at.
The English Premier League just ended recently, and looking at the end of year standings, the team with the four highest goal differentials were the top four teams, and the team with the six worst were the bottom six.
Things, however, can be, and usually are, a lot messier than these trivial examples.
In the 19th century, lung cancer was a relatively rare form of cancer. However, in the 20th century, cases of lung cancer exploded. Researchers soon found a very strong correlation between people who smoked cigarettes and people with lung cancer.
Very early on, many people rightly pointed out that correlation doesn’t imply causation. Just because people smoked doesn’t mean that it caused cancer. As late as 1960, ? of doctors in the United State didn’t think that the case linking smoking and lung cancer had been firmly established.
While correlation doesn’t imply causation, if there were causation, there would be a correlation.
Eventually, researchers determined that smoking causes cancer by conducting experiments at the cellular level, and no one really doubts this relationship anymore.
The case between smoking and cancer is pretty straightforward. But let’s add another variable into the mix.
To the best of my knowledge, no study on this has ever been done, but for the sake of argument, let’s assume that there is a positive correlation between people who carry lighters in their pocket and cancer.
Even though this study has never been done, I’d be willing to be there is some positive correlation between these two things.
Why would there be a correlation between carrying a lighter and cancer? Do lighters cause cancer? Are people with cancer compelled to carry a lighter?
In this case, it would be an example of a confounding variable.
There is some other thing that is behind both of the variables. In this example, the confounding variable would be smoking. Smokers are more likely to get cancer, and smokers are more likely to carry a lighter. Hence, people with lighters would be more likely to have cancer.
The problem of confounding variables is a huge one in most research. This is especially a big problem in nutritional science. The vast majority of papers published on nutrition are what are known as epidemiological studies. These are observational studies, surveys really, which only produce correlations.
Yet, more often not, these studies are reported as if there is some sort of causation, or at least it is heavily implied. I’m sure you’ve seen headlines that report “eating X causes cancer or heart disease.”
We’ll maybe it does, but you can’t determine that via a correlation. The problem with nutrition surveys is that there are so many confounding variables.
How many different foods have you consumed over the last six months? Could you possibly list them all, including how much of each food you ate?
It is extremely difficult, to near impossible, to isolate a single variable that would be causal with some health result. On top of that, eating certain foods might correlate with certain lifestyle choices, like working out and exercising.
Another interesting confounding variable study was showing the effects of alcohol consumption on heart disease.
There have been several studies showing that people who drink moderately showed lower risks of heart disease than those who drank heavily or those who didn’t drink at all.
Why would some alcohol be better than no alcohol or a lot of alcohol? Why is there a Goldilocks amount of alcohol?
It could be that moderate alcohol consumption wasn’t the cause of anything. It was just something that people who were already healthy did.
Another problem with studies is what is known as p-hacking. As I mentioned before, there can be strong correlations and weak correlations.
When you make a hypothesis for a published scientific paper, the probability value of a hypothesis being false, which goes by the variable p, usually has to be under 0.05, or 5%.
There is nothing special about a p-value of 0.05. It is the value everyone uses because it is the value everyone uses.
The problem with this is that you don’t have to test that many variables to get something statistically significant by chance.
Let’s assume you gather data on 7 different variables and compare each of the 7 variables against each other. That is 21 possible combinations, which means that the odds are that at least one of those combinations will yield a statistically significant result.
P-hacking is a very fancy word to describe throwing stuff against a wall to see what sticks.
You can then write your research paper on the two variables that correlate, totally ignoring and not telling anyone about the 20 other combinations you tried that yielded no results.
One final thing to know about correlation and causation is often times the causation can be reversed. Even if there might be causation, it is often hard to figure out in what direction the causation flows.
One popular statistic which is often floated is that people who attend college will earn more money in their lifetime than people who do not.
The correlation is true. However, everyone just assumes that attending college is what is responsible for the increase in income. However, believe it or not, there is shockingly little actually to support this.
In fact, there is a very good argument that it might be the other way around. People who are destined to make more money just happen to be more likely to attend college.
This comes as a shock to most people, but there are ways to test it.
First, you can look at just the subset of people who had similar grades and test scores in high school and compare those who went to college and those who didn’t.
Those who didn’t go to college might have done so for personal or financial reasons, even though they could have got accepted.
What studies have found is that the group that could have gone to college but didn’t had similar lifetime incomes to those that did attend college. They were only slightly lower, and that doesn’t even factor in debt that may have been accrued going to college.
Likewise, there was a study done on people who were accepted and graduated from elite universities and those who applied but were not accepted and went to college elsewhere.
Again, the results showed that the lifetime earnings of the two groups were pretty much the same. Winning the lottery of getting accepted was secondary to just being part of a group that was smart and ambitious enough even to bother to apply.
Understanding the differences between correlation and causation is really important, and it is very easy to confuse them. The reason it is so confusing is that if a relationship is causal, it will show a correlation.
When you hear a news report or read a headline with some scientific finding that shows some link between two things, you should always have a big dose of skepticism, or at least think of possible confounding variables which might be a better explanation than what is presented.