P-values Broke Scientific Statistics—Can We Fix Them?


[♪ INTRO] A little over a decade ago, a neuroscientist
stopped by a grocery store on his way to his lab to buy a large Atlantic salmon. The fish was placed in an MRI machine, and then it completed what was called an
“open-ended mentalizing task” where it was asked to determine the emotions that were being experienced by different people in photos. Yes, the salmon was asked to do that. The
dead one from the grocery store. But that’s not the weird part. The weird part is that researchers found that so-called significant activation occurred in neural tissue in a couple places in the dead fish. Turns out, this was a little bit of a stunt. The researchers weren’t studying the mental abilities of dead fish; they wanted to make a point about statistics,
and how scientists use them. Which is to say, stats can be done wrong,
so wrong that they can make a dead fish seem alive. A lot of the issues surrounding scientific
statistics come from a little something called a p-value. The p stands for probability, and it refers to the probability that you would have gotten the results you did just by chance. There are lots of other ways to provide statistical
support for your conclusion in science, but p-value is by far the most common, and, I mean, it’s literally what scientists mean when they report that their findings are “significant”. But it’s also one of the most frequently misused and misunderstood parts of scientific research. And some think it’s time to get
rid of it altogether. The p-value was first proposed by a statistician
named Ronald Fisher in 1925. Fisher spent a lot of time thinking about how to determine if the results of a study were really meaningful. And, at least according to some accounts, his big breakthrough came after a party in the early 1920s. At this party there was a fellow scientist
named Muriel Bristol, and reportedly, she refused a cup of tea from Fisher because he had added milk after the tea was poured. She only liked her tea when the milk was added
first. Fisher didn’t believe she could really taste
the difference, so he and a colleague designed an experiment to test her assertion. They made eight cups of tea, half of which
were milk first, and half of which were tea first. The order of the cups was random, and, most
importantly, unknown to Bristol, though she was told there would be four of each cup. Then, Fisher had her taste each tea one by
one and say whether it that cup was milk or tea first. And to Fisher’s great surprise, she went
8 for 8. She guessed correctly every time which cup was tea-first and which was milk-first! And that got him to thinking, what are the
odds that she got them all right just by guessing? In other words, if she really couldn’t taste
the difference, how likely would it be that she got them all right? He calculated that are 70 possible orders
for the 8 cups if there are four of each mix. Therefore, the probability that she’d guess
the right one by luck alone is 1 in 70. Written mathematically, the value of P is about 0.014. That, in a nutshell, is a p-value, the probability that you’d get that result
if chance is the only factor. In other words, there’s really no relationship
between the two things you’re testing, in this case, how tea is mixed versus how it
tastes, but you could still wind up with data that suggest there is a relationship. Of course, the definition of “chance”
varies depending on the experiment, which is why p-values depend a lot on experimental
design. Say Fisher had only made 6 cups, 3 of each
tea mix. Then, there are only 20 possible orders for the cups, so the odds of getting them all correct is 1 in 20, a p-value of 0.05. Fisher went on to describe an entire field
of statistics based on this idea, which we now call Null Hypothesis Significance Testing. The “null hypothesis” refers to the experiment’s
assumption of what “by chance” looks like. Basically, researchers calculate how likely
it is that they’ve gotten the data that they did, even if the effect they’re testing
for doesn’t exist. Then, if the results are extremely unlikely
to occur if the null hypothesis is true, then they can infer that it isn’t. So, in statistical speak, with a low enough
p-value, they can reject the null hypothesis, leaving them with whatever alternate hypothesis
they had as the explanation for the results. The question becomes, how low does a p-value
have to be before you can reject that null hypothesis. Well, the standard answer used in science
is less than 1 in 20 odds, or a p-value below 0.05. The problem is, that’s an arbitrary choice. It also traces back to Fisher’s 1925 book,
where he said 1 in 20 was quote “convenient”. A year later, he admitted the cutoff was somewhat
subjective, but that 0.05 was generally his personal preference. Since then, the 0.05 threshold has become
the gold standard in scientific research. A p of less than 0.05, and your results are
quote “significant”. It’s often talked about as determining whether
or not an effect is real. But the thing is, a result with a p-value of 0.049 isn’t more true than one with a p-value of 0.051. It’s just ever so slightly
less likely to be explained by chance or sampling error. This is really key to understand. You’re
not more right if you get a lower p-value, because a p-value says nothing about how correct
your alternate hypothesis is. Let’s bring it back to tea for a moment. Bristol aced Fisher’s 8-cup study by getting them all correct, which as we noted, has a
p-value of 0.014, solidly below the 0.05 threshold. But it being unlikely that she randomly guessed
doesn’t prove she could taste the difference. See, it tells us nothing about other possible
explanations for her correctness. Like, if the teas had different colors rather
than tastes. Or she secretly saw Fisher pouring each cup! Also, it still could have been a one-in-seventy
fluke. And sometimes, one might even argue often, 1 in 20 is not a good enough threshold to really rule out that a result is a fluke. Which brings us back to that seemingly undead
fish. The spark of life detected in the salmon was actually an artifact of how MRI data is collected and analyzed. See, when researchers analyze MRI data, they look at small units about a cubic millimeter or two in volume. So for the fish, they took
each of these units and compared the data before and after the pictures were shown to
the fish. That means even though they were just looking
at one dead fish’s brain before and after, they were actually making multiple comparisons,
potentially, thousands of them. The same issue crops up in all sorts of big
studies with lots of data, like nutritional studies where people provide detailed diet information about hundreds of foods, or behavioral studies where participants fill out surveys
with dozens of questions. In all cases, even though each individual
comparison is unlikely, with enough comparisons, you’re bound to find some false positives. There are statistical solutions for this problem,
of course, which are simply known as multiple comparison corrections. Though they can get fancy, they usually amount
to lowering the threshold for p-value significance. And to their credit, the researchers who looked
at the dead salmon also ran their data with multiple comparison corrections, when they
did, their data was no longer significant. But not everyone uses these corrections. And though individual studies might give various
reasons for skipping them, one thing that’s hard to ignore is that researchers are under a lot of pressure to publish their work, and significant results are more likely to get
published. This can lead to p-hacking: the practice of
analyzing or collecting data, until you get significant p-values. This doesn’t have to be intentional, because researchers make many small choices that lead to different results, like we saw with 6 versus
8 cups of tea. This has become such a big issue because,
unlike when these statistics were invented, people can now run tests lots of different
ways fairly quickly and cheaply, and just go with what’s most likely to get their work
published. Because of all of these issues surrounding
p-values, some are arguing that we should get rid of them altogether. And one journal has
totally banned them. And many that say we should ditch the p-value
are pushing for an alternate statistical system called Bayesian statistics. P-values, by definition, only examine null
hypotheses. The result is then used to infer if the alternative is likely. Bayesian statistics actually look at the probability
of both the null and alternative hypotheses. What you wind up with is an exact ratio of
how likely one explanation is compared to another. This is called a Bayes factor. And this is a much better answer if you want
to know how likely you are to be wrong. This system was around when Fisher came up
with p-values. But, depending on the dataset, calculating Bayes factors can require some serious computing power, power that wasn’t available at the time,
since, y’know, it was before computers. Nowadays, you can have a huge network of computers
thousands of miles from you to run calculations while you throw a tea party. But the truth is, replacing p-values with
Bayes factors probably won’t fix everything. A loftier solution is to completely separate
a study’s publishability from its results. This is the goal of two-step manuscript submission, where you submit an introduction to your study and a description of your method, and the journal decides whether to publish before seeing your results. That way, in theory at least, studies would get published based on whether they represent good science, not whether they worked out
the way researchers hoped, or whether a p-value or Bayes factor was more or less than some arbitrary threshold. This sort of idea isn’t widely used yet, but it may become more popular as statistical significance meets more sharp criticism. In the end, hopefully, all this controversy
surrounding p-values means that academic culture is shifting toward a clearer portrayal of what research results do and don’t really show. And that will make things more accessible
for all of us who want to read and understand science, and keep any more zombie fish from
showing up. Now, before I go make myself a cup of Earl
Grey, milk first, of course, I want to give a special shout out to today’s President
of Space, SR Foxley. Thank you so much for your continued support! Patrons like you give
us the freedom to dive deep into complex topics like p-values, so really, we can’t thank
you enough. And if you want to join SR in supporting this channel and the educationalcontent we make here at SciShow, you can learn more at Patreon.com/SciShow. Cheerio! [♪ OUTRO]