# How to Lie With Statistics

## Podcast Transcript

Mark Twain once said, ‘There are three kinds of lies: lies, damned lies, and statistics.’

The reason why he placed statistics into its own category is because it is possible to use numbers to misrepresent the truth, distort reality, or outright lie.

However, if you know what to look for, you can catch misuses of statistics, and if really pay attention, you can find these misuses almost everywhere.

In the past, I’ve done episodes on subjects such as logical fallacies, correlation vs causation, and survivor bias.

This episode is in the same ballpark in so far as it deals with critical thinking, but it is a bit different in that it is more nuanced. There is some overlap with the previous episodes I’ve done, but for the most part, it’s very different.

The reason why it’s different is that lying with statistics isn’t necessarily lying. It certainly can be lying, but more often than not, it might just be putting a positive spin on something or interpreting data in such a way that it supports a hypothesis.

The goal of this episode is to give you an idea of some of the things to look for when you hear the results of studies, surveys, or polling data. This is important because if you pay attention to the news, you will hear statistics reported all the time….and there are big problems with a lot of statistics you see reported in the media.

Sometimes, it could be an honest mistake, and other times, it might actually be a case of fraud.

So, let’s start with the most extreme case, that of fraud. Statistics is ultimately about collecting data and then trying to interpret that data in a meaningful way.

Whether it is a poll, a survey, econometric data, or nutritional research, ultimately, there has to be trust in the researcher that they honestly collected the data.

However, there have been cases, far too many cases, of outright research fraud. In 2023, over 10,000 research papers were retracted from various journals. Not all of those retractions were fraud, but many of them were.

The Wiley Corporation, a publisher of academic journals, had to shutter 19 different academic journals because the retractions were getting out of hand.

In January of this year, the Dana-Farber Cancer Institute, an affiliate of Harvard Medical School, had to retract six studies and change the data on another 31. The papers were published by some of the most senior and distinguished researchers at the institute.

I don’t want to belabor the point on research fraud because that isn’t the focus of this episode, but you should be aware that it is a thing. The vast majority of published research is not fraudulent, but it does exist.

If the data is bad, then it doesn’t matter how good your statistical analysis of it is because you will get a fraudulent result.

So, let’s assume that data wasn’t completely made up, which is the case the vast majority of the time. There are still a whole bunch of other things that can go wrong.

If you have a data set, most researchers are looking for a strong correlation between two things. This is known as a p-value. A p-value is the statistical probability that an observed result is true.  Usually, a p-value of 0.05 is what a researcher wants, which means there is a 95% probability that the results are correct.

I should note that there is nothing magical about 0.05. It’s just a number that has been used traditionally in research. There are some very good arguments that a more stringent p-value, maybe 0.01, should be used, but that is for another episode.

P-hacking is when you find a correlation first and then make up a hypothesis to fit the data.

If you have enough data, then it isn’t hard to find something that will correlate with 95% confidence. There is an XKCD comic that makes a joke of this. A scientist claims that jelly beans cause acne. After finding no correlation, they checked 20 individual colors of jelly beans before finding one with a p-value of 0.05 and publishing the results that green jelly beans cause acne.

Something similar to p-hacking is cherry picking data. Unlike p-hacking, where you go through the data to find a hypothesis to fit the data, with cherry-picking data, you ignore data that doesn’t fit your hypothesis.

This could be a form of data fraud insofar as you are committing the crime of omission rather than commission.

Let’s suppose you have a hypothesis, you collect your data, and everything looks great….except for the fact that you have some outlier data that messes everything up. You could conveniently leave that data out to make your data set look better.

Cherry-picking data happens all the time. If a politician wants to show that crime in increasing or decreasing, you just have to pick the right crime to look at. One type of crime might be going up while others might be going down.

You could also pick the right time frame to illustrate a point. By picking the right start and end points, you can make a stock look bad or good. You could say that a stock has gone up 5% over the last month, totally ignoring the fact that it is actually down 80% for the entire year.

Another statistical problem that creeps up quite often is having a sample size that is too small.

You might have seen the TV commercial that says 4 out of 5 dentists recommend sugarless gum for their patients who chew gum.

Well, how many dentists did they ask? If they literally only asked five dentists, that is a very small sample size.

This is always a big problem when polling elections. People with similar political opinions tend to clump together. They might congregate in the same city, same occupation, same religion, or social circles. If you take too small of a sample size, you run a high risk of only sampling people from a single group.

One area where you see small sample sizes is in presidential elections. Every presidential election cycle, you see people who try to show trends in elections. The problem is that presidential elections don’t happen that often. There are only 20 that take place each century.

Go back to even ten presidential elections, and much of the electorate wasn’t even born. So, trying to glean trends from such a small sample size is very difficult to do.

Certain types of medical research can’t avoid the problem of small sample sizes. Certain rare diseases only affect a very small number of people, making studying the disease difficult. In such cases, there isn’t much you can do about it other than not use certain statistical techniques.

One of the most common misuse of statistics is misleading percentages and proportions.

This happens quite frequently when talking about risks.

Let’s assume that we are talking about something that has a risk of causing cancer, and just for the sake of argument, let’s assume that the risk is real.

One thing you might hear is that something increases the risk of cancer. Knowing that something increases the risk of cancer actually doesn’t tell you much.

Leaving the house increases the risk of getting hit by a car, but people do it all the time. We do it because although the risk is real, it is very low.

Knowing that something increases your risk of cancer doesn’t mean much without knowing what the absolute risk is.

If there is a type of cancer that someone has a 1 in 100,000 risk of contracting, and you do X that increases that risk to 2 in 100,000, it can be described in three different ways.

You can say that doing X increases your risk of cancer. That would be true, but it doesn’t give the magnitude of the increase or the size of the risk.

You could also say that doing X increases your risk of cancer by 100%. This sounds more dramatic, and many people confuse a percentage increase in risk with absolute risk. Going from 1 in 100,000 to 2 in 100,000 is indeed a 100% increase, but many people confuse the risk as 100%.

The other way you could describe is to say that doing X will increase your risk of cancer by 1 in 100,000. This is also true, but it sounds much less menacing than a 100% increase.

In all three of these examples, the way I described it was technically true, but how you would describe it would depend on whether you are trying to play down or magnify the risk.

Another big problem with any sort of data collection has to do with leading questions. Most people who make surveys know that you can get the results you want by just asking the right question.

Let’s say you are doing a customer satisfaction survey, and you want to show that people were very satisfied with your product or server.

The question you would want to ask is, “How satisfied were you with our service?”

Logically, you could phrase the question as “How dissatisfied were you with our service? But that would be phrasing the question as a negative instead of a positive.

The first question assumes that you were satisfied, and the question only asks how much.

This is very important when it comes to election or issue polling. One thing to look for in any poll is to see if they provide the question that was asked in the poll results.

Most major presidential polls provide this information as well as providing data on whether the interviewee intends to vote or is a registered voter.

Nutritional surveys often run into problems with data collection and questions. In these surveys, the problem isn’t with leading questions. The problem is asking questions that people either can’t remember or feel they have to provide the correct answer.

Large-scale nutritional studies will give people food surveys that ask them what they’ve eaten over the last six months to a year.

Most people can’t remember everything they’ve eaten over that long period of time. The number of cookies, pork chops, and apples they’ve consumed is, at best, a guess.

The problem is that people often give the answer that they think they should give. They will overreport the number of apples, for example, and underreport the number of cookies because they want to appear to be eating right.

Conducting statistical surveys or polling is a very difficult thing to do. I’ve only covered some of the more obvious problems with statistics, which are easy to observe. There are actually many more technical issues that can come up in statistics that can only be detected with full access to data and the techniques used, and full data sets are often not made available.

Whenever I hear “research shows,” or something to that effect, I always take it with a grain of salt, at least initially. It might be true, but one survey or poll is usually not enough to establish anything. It is only through replication of the results, which often never happens that you can really determine if a statistical finding is in fact true.