# P-values Broke Scientific Statistics—Can We Fix Them?

[♪ INTRO] A little over a decade ago, a neuroscientist

stopped by a grocery store on his way to his lab to buy a large Atlantic salmon. The fish was placed in an MRI machine, and then it completed what was called an

“open-ended mentalizing task” where it was asked to determine the emotions that were being experienced by different people in photos. Yes, the salmon was asked to do that. The

dead one from the grocery store. But that’s not the weird part. The weird part is that researchers found that so-called significant activation occurred in neural tissue in a couple places in the dead fish. Turns out, this was a little bit of a stunt. The researchers weren’t studying the mental abilities of dead fish; they wanted to make a point about statistics,

and how scientists use them. Which is to say, stats can be done wrong,

so wrong that they can make a dead fish seem alive. A lot of the issues surrounding scientific

statistics come from a little something called a p-value. The p stands for probability, and it refers to the probability that you would have gotten the results you did just by chance. There are lots of other ways to provide statistical

support for your conclusion in science, but p-value is by far the most common, and, I mean, it’s literally what scientists mean when they report that their findings are “significant”. But it’s also one of the most frequently misused and misunderstood parts of scientific research. And some think it’s time to get

rid of it altogether. The p-value was first proposed by a statistician

named Ronald Fisher in 1925. Fisher spent a lot of time thinking about how to determine if the results of a study were really meaningful. And, at least according to some accounts, his big breakthrough came after a party in the early 1920s. At this party there was a fellow scientist

named Muriel Bristol, and reportedly, she refused a cup of tea from Fisher because he had added milk after the tea was poured. She only liked her tea when the milk was added

first. Fisher didn’t believe she could really taste

the difference, so he and a colleague designed an experiment to test her assertion. They made eight cups of tea, half of which

were milk first, and half of which were tea first. The order of the cups was random, and, most

importantly, unknown to Bristol, though she was told there would be four of each cup. Then, Fisher had her taste each tea one by

one and say whether it that cup was milk or tea first. And to Fisher’s great surprise, she went

8 for 8. She guessed correctly every time which cup was tea-first and which was milk-first! And that got him to thinking, what are the

odds that she got them all right just by guessing? In other words, if she really couldn’t taste

the difference, how likely would it be that she got them all right? He calculated that are 70 possible orders

for the 8 cups if there are four of each mix. Therefore, the probability that she’d guess

the right one by luck alone is 1 in 70. Written mathematically, the value of P is about 0.014. That, in a nutshell, is a p-value, the probability that you’d get that result

if chance is the only factor. In other words, there’s really no relationship

between the two things you’re testing, in this case, how tea is mixed versus how it

tastes, but you could still wind up with data that suggest there is a relationship. Of course, the definition of “chance”

varies depending on the experiment, which is why p-values depend a lot on experimental

design. Say Fisher had only made 6 cups, 3 of each

tea mix. Then, there are only 20 possible orders for the cups, so the odds of getting them all correct is 1 in 20, a p-value of 0.05. Fisher went on to describe an entire field

of statistics based on this idea, which we now call Null Hypothesis Significance Testing. The “null hypothesis” refers to the experiment’s

assumption of what “by chance” looks like. Basically, researchers calculate how likely

it is that they’ve gotten the data that they did, even if the effect they’re testing

for doesn’t exist. Then, if the results are extremely unlikely

to occur if the null hypothesis is true, then they can infer that it isn’t. So, in statistical speak, with a low enough

p-value, they can reject the null hypothesis, leaving them with whatever alternate hypothesis

they had as the explanation for the results. The question becomes, how low does a p-value

have to be before you can reject that null hypothesis. Well, the standard answer used in science

is less than 1 in 20 odds, or a p-value below 0.05. The problem is, that’s an arbitrary choice. It also traces back to Fisher’s 1925 book,

where he said 1 in 20 was quote “convenient”. A year later, he admitted the cutoff was somewhat

subjective, but that 0.05 was generally his personal preference. Since then, the 0.05 threshold has become

the gold standard in scientific research. A p of less than 0.05, and your results are

quote “significant”. It’s often talked about as determining whether

or not an effect is real. But the thing is, a result with a p-value of 0.049 isn’t more true than one with a p-value of 0.051. It’s just ever so slightly

less likely to be explained by chance or sampling error. This is really key to understand. You’re

not more right if you get a lower p-value, because a p-value says nothing about how correct

your alternate hypothesis is. Let’s bring it back to tea for a moment. Bristol aced Fisher’s 8-cup study by getting them all correct, which as we noted, has a

p-value of 0.014, solidly below the 0.05 threshold. But it being unlikely that she randomly guessed

doesn’t prove she could taste the difference. See, it tells us nothing about other possible

explanations for her correctness. Like, if the teas had different colors rather

than tastes. Or she secretly saw Fisher pouring each cup! Also, it still could have been a one-in-seventy

fluke. And sometimes, one might even argue often, 1 in 20 is not a good enough threshold to really rule out that a result is a fluke. Which brings us back to that seemingly undead

fish. The spark of life detected in the salmon was actually an artifact of how MRI data is collected and analyzed. See, when researchers analyze MRI data, they look at small units about a cubic millimeter or two in volume. So for the fish, they took

each of these units and compared the data before and after the pictures were shown to

the fish. That means even though they were just looking

at one dead fish’s brain before and after, they were actually making multiple comparisons,

potentially, thousands of them. The same issue crops up in all sorts of big

studies with lots of data, like nutritional studies where people provide detailed diet information about hundreds of foods, or behavioral studies where participants fill out surveys

with dozens of questions. In all cases, even though each individual

comparison is unlikely, with enough comparisons, you’re bound to find some false positives. There are statistical solutions for this problem,

of course, which are simply known as multiple comparison corrections. Though they can get fancy, they usually amount

to lowering the threshold for p-value significance. And to their credit, the researchers who looked

at the dead salmon also ran their data with multiple comparison corrections, when they

did, their data was no longer significant. But not everyone uses these corrections. And though individual studies might give various

reasons for skipping them, one thing that’s hard to ignore is that researchers are under a lot of pressure to publish their work, and significant results are more likely to get

published. This can lead to p-hacking: the practice of

analyzing or collecting data, until you get significant p-values. This doesn’t have to be intentional, because researchers make many small choices that lead to different results, like we saw with 6 versus

8 cups of tea. This has become such a big issue because,

unlike when these statistics were invented, people can now run tests lots of different

ways fairly quickly and cheaply, and just go with what’s most likely to get their work

published. Because of all of these issues surrounding

p-values, some are arguing that we should get rid of them altogether. And one journal has

totally banned them. And many that say we should ditch the p-value

are pushing for an alternate statistical system called Bayesian statistics. P-values, by definition, only examine null

hypotheses. The result is then used to infer if the alternative is likely. Bayesian statistics actually look at the probability

of both the null and alternative hypotheses. What you wind up with is an exact ratio of

how likely one explanation is compared to another. This is called a Bayes factor. And this is a much better answer if you want

to know how likely you are to be wrong. This system was around when Fisher came up

with p-values. But, depending on the dataset, calculating Bayes factors can require some serious computing power, power that wasn’t available at the time,

since, y’know, it was before computers. Nowadays, you can have a huge network of computers

thousands of miles from you to run calculations while you throw a tea party. But the truth is, replacing p-values with

Bayes factors probably won’t fix everything. A loftier solution is to completely separate

a study’s publishability from its results. This is the goal of two-step manuscript submission, where you submit an introduction to your study and a description of your method, and the journal decides whether to publish before seeing your results. That way, in theory at least, studies would get published based on whether they represent good science, not whether they worked out

the way researchers hoped, or whether a p-value or Bayes factor was more or less than some arbitrary threshold. This sort of idea isn’t widely used yet, but it may become more popular as statistical significance meets more sharp criticism. In the end, hopefully, all this controversy

surrounding p-values means that academic culture is shifting toward a clearer portrayal of what research results do and don’t really show. And that will make things more accessible

for all of us who want to read and understand science, and keep any more zombie fish from

showing up. Now, before I go make myself a cup of Earl

Grey, milk first, of course, I want to give a special shout out to today’s President

of Space, SR Foxley. Thank you so much for your continued support! Patrons like you give

us the freedom to dive deep into complex topics like p-values, so really, we can’t thank

you enough. And if you want to join SR in supporting this channel and the educationalcontent we make here at SciShow, you can learn more at Patreon.com/SciShow. Cheerio! [♪ OUTRO]

There is a typo at 7:37! The P-value for 6 tea cups is 0.05, not 0.5. Thanks to everyone who pointed it out!

The Pee-Value

IMO, there should be some body that requires journals to publish a specific percentage of "replication reports." Studies that cannot be replicated are useless, regardless of the p value. Or, require that a study be replicated by an independent group before any results are published.

Coming from an "aspiring" industrial and systems engineer a few dots were connected that were left distant from the few statistics and probability classes i have taken at university. Hypothesis testing and Bayes Theorem have made a bit more sense to me. I praise You

So the tea and milk issue helps explain the problem with the P value more than what this video says. Because there is a physical difference due to the way the liquids are distributed depending on which one is being used in a pouring and effect. So when we have a concept we should always be actively trying to disprove it through multiple experiments and approaches most notably through trying to account for all the properties of the issue rather than defaulting to probabilities that easily can obscure characteristics

I can taste if coffee has been poured onto milk or milk been poured into coffee. It is sweeter if coffee is poured into the milk. And I think the reason is best consider through the point of contact. If you pour coffee a small portion of the coffee is being rearranged by the milk. And eventually whatever mechanism may come into play here doesn't matter but this is enough 2 then create a follow-up experiment with an attempt to harvest coffee that has had this particular interaction and then perhaps being able to concentrate the difference in taste in somehow maintaining equal proportion of coffee and milk

Hi, great video. It’s important to know that : the habilita to publish reproduction studies is key, 2 physics use insane p-values.

Critical Hits FTW

Confirmed v8 > v6

And this is exactly why they keep blaming "climate change" on mans activity. They keep running the numbers until they get what they want.

Bayesian statistics are very likely to get you blank stares from reviewers.

Okay, so I tested things out, with the tea, and let me tell you something. There is a slight difference. I think you all should do a live show where you test out the milk and tea thing. Each host goes through and we can all see if there is a main notable difference. I think the key thing is. Is well you are adding more tea to the milk. So you want to have exact ratios of tea to milk. Then as exact steeping for each cup as you can get. It would be a nifty thing to do!

As a side point, it would have been nice to mention why tea can taste different depending on the order of the milk and tea into a cup.

I agree with Judy Tunanuda from SNL " It could happen".

Water first you heathen!

This explains the anti-vaccine research linking it to autism

I have been heartbroken about all of the systemic fraud that has been uncovered in science lately. Everybody has agendas, egos, conflicts of interests, etc. Two step manuscript submission would certainly help, but based on what we know about human nature I don't know if we can ever truly trust academia 100% ever again. Peer reviews need to get brutal, rather than everybody wanting to support their colleagues or further their agendas. The current peer review system seems like a giant circle jerk to me.

Literally soooo important in every science class to watch this video. My tchr just makes us do this test for every report and it can be quite dumb

Dude p values fuckin sucks

https://xkcd.com/1132/

Plot Twist:the fish was alive that whole timeEarl Grey with Milk?? You Savage!!

8:45 "Before computers"

Even though I was born before home computing was a thing, its still crazy to think that "before computers" was really not long ago lol…. Trippy.

P of 0.05 is definitely not good enough. d20s really don't offer that much variation after all.

I worry that 2 step manuscripts will suffer from the political/sociological bias of the boards more than p value would, we already see this perversion of publishing in pseudo sciences like gender/feminism and i worry a lot of studies wont even be green lit from the start simply because lets say it aims to proves gender is not a social construct or something like that(its just an example) .

It is a step in the right decision but i wish we could have objective greenlighting is all im saying,on the other hand it does add objectivity because then they can't approve based on weather the result fits their narrative?

Objectivity is so hard to come by these days

The only thing more surprising is that this is not very surprising. Fixing the P-value would fix nothing. The competitive nature of aquiring grants and funding as well as the drive to get published is far more damaging. These along with things like confirmation bias are eroding our trust in the some of the most important fields of study that impact our daily lives. Once it passes a certain point that trust we placed in those pushing our boundaries, will be all but impossible to restore in a single generation.

So they used a fish to disprove the value of mr. Fisher? xD love science… xD

How do you add milk first?

First of all: you rock, and I may be in love

2nd: you gotta account for bias while working with p ( not to mention why even acknowledge anything based on low pool count)

That said hand me a suitable stand in for p in large date sets and I'm in

The 2 stage system sounds way better, as now studies that come back as inconclusive will get published.

Guys I was struck by a question today: do guys store fat in the scrotum? If not why?

The tea thing is real so is food tasting different after the microwave.

1/72 keeps getting mentioned, however what is the possibility that she gets every single cup correct? If she had eight cups correct, the statistical significance dramatically increase s. 8 times in a row, brings some significance to the P-value. While this is not an end-all, Be-all (correlation), it still holds some prevalence, to statistical analysis.

I forgot to mention, this is one of the most amazing videos I've seen on You Tube. Thank you for covering this subject.

Olivia is in fine form this video

Bring back old host

I'm really glad people stopped shitting on this woman's characteristics and appearance.

Milk first in solidarity for the proletariat!

Anybody know where the video at 7:25 is from? I don't think that blue box is a Weller soldering station.

In the research center that i use to work, they even meme the P=0.05:

"Yeah, you can use that… if you want to be a casual joke!!" XD

she probably was able to tell the difference because the milk first cups were colder

So…you're telling me pee has nothing to do with this and I should stop serving pee with milktea?

3:50 …looks like a Mustard Glass.

Milk first??)? Psychopath!

Usually, math is like a microscope, allowing you to see the inner workings of the Universe. Statistics is a hammer. good for building or bludgeoning.

the scientic reference is not that scientific,no wonder theres art in medicine.

oi i wrote like 10 criticisms of this and every minute you guys shut me down. not sure if any went thru, but if they did, you probably fixed it. sample size as example.

I am really really mad. Who the hell thumbs-down this video? 303? Dumb idiots piss me off! Beautiful video! Keep it up, think I have to go to Patreon to support, given my daughters and son immensely love the videos and learn a lot.

Physical law proves bumblebees and helicopters can't fly!

I dont agree with this entirely..the problem is not the 5%…in fact..i dont use 5%…you just check the p-value and know the "odds" of being right.

And saying that rejecting the NH doesnt mean that the AH is correct..is also not a good point. If the experiment was done correctly the only thing that should have changed is how the tea was made. You cant say the woman saw them making the tea.

I understand your point…but I would have added those comments in the video.

this is the salmon of doubt

I also believe it is unreasonable to talk about probability of guessing in the case of tea-cups. it wasn't a game of heads or tails. the woman could not have guessed it because she knew. mathematically there's 0 chance of guessing when you are not doing the guessing.

Great video. Thanks

Two flaws in your thinking of vilifying p values. One, the p value is stated before any research happens and are not set in stone to be 0.05. They pick the value in correspondence with the harm level if they are wrong. With medicine that could potentially harm the user, it needs to be very effective for it to pass. So researchers would use a much smaller p value of 0.01 or 0.001 to make sure that they aren't hurting people. Number two is science isn't about one study. One study should not be used to define if things are true or not. You need to look at converging evidence that what you think is true, actually is. That means multiple studies need to show the same thing for them to be accepted as good enough. You never can confirm cause and effect unless it's a true experiment, but researchers are very good at finding out what actually happens in our world. They already struggle with getting funding and participants who want to be in the study. I don't think this video is helpful in getting research where it needs to be

Wow that sounds like a hectic party!

SR FOXLEY keep the god work of supporting Sci-show and Zod

Mother p-hackers!

this is really a good video.

Earl Grey, milk first, cheerio! Good grief, she's perfect!

Oh, and the informative and educationally entertaining video was, as always, good too.

wow i feel like ive wasted a whole semester learning stats

It seems to me the larger issue is brought up at the end. Researchers are being encouraged to manipulate the data to get results that are favorable so they can be published. But, here's the deal, a negative result is just as helpful as a positive result. I see the solution as this: before a researcher performs the experiment, a journal needs to guarantee that the results (no matter what they are) will be published. This benefits everyone in several ways: 1)Researcher knows results will be published, which encourages neutrality 2) Journal can reject "uninteresting" subjects, redirecting researchers to subjects that are of more interest to the wider community 3) Researchers who results claim to be significant, don't stand alone, but can be compared against similar published research.

I think the last one is very important, because if the P-Value is .05 and 20 researchers are looking at the same thing; currently 19 don't get published because their work didn't turn up anything significant. But, 1 researcher does find significance, so his work is published. Only issue is that, he's the 1 in 20 false positive; which if all 20 researchers work was published would become evident very quickly.

Even worse, some journals will never publish followup studies, so there isn't any fame in

checkingresults. At least astrophysics have a "more places have to see it before it counts" thing going.I found this video to be under-description of the use of p-values and misrepresentation of the purpose of the deadfish experiment.

1) "1 in 20" is the standard, and yes, arbitrary cutoff, but it is not enough to draw a scientific journal in any reputable journal. Such a journal requires around 3 independent validations of a conclusion usually including positive and negative controls. The joint probability of 3 independent trials, diverging from a negative control and converging on a positive control is always much lower than 0.05. Fisher's arbitrary cutoff is exactly a "convenience." The cutoff doesn't matter, its consistency across independent queries that make a conclusion stable. Your video makes it seem as though scientists are rolling a 20 sided die (repeated multiple times in the comments) and making totally arbitrary conclusions.

2) The deadfish experiment, as you eventually mentioned as a side note, was an important objection to neuroscientists' refusal to use multiple test corrections and a demonstration of the spuriously significant (most much smaller than 0.05) generated by continued sampling from the null distribution. An issue resolved by multiple test corrections which as not "simply calculating a lower p-value;" we've come a long way from Bonferroni, FDR literally estimates the background distribution and subtracts it off. Despite your blanket discredit of broad studies, it is pretty rare for a paper neglecting necessary multiple test corrections to get accepted

In my opinion, the misunderstandings and misrepresentations in your video work to undermine the credibility of scientists.

Shhhh this is how a ton of climate change "science" is produced.

When you have nothing to research about in life.

Just do mri of dead fish and just keep argue on stupid reasoning.

I believe milk first tastes different. Maybe that is just in my head but it feels fancier.

And this is why, in second year stats, you learn methods to control the family-wide error rate

The problem is not the p-value method. Is with the assumptions you made for the test statistics to work correctly. Don't blame statistics, blame bad scientists.

Also Bayes factor and p-values, that being the Bayesian or frequentist approach, change only in the fundamentals of how you view probability, but doesn't change it's "robustness to error" by itself. Again, it's all in the assumptions.

To do correct statistics you need to

studystatistics. Understand what you use, and there will be no problem.Better video title "scientists abuse statistics to lie to others.. can we implement techniques to avoid incorrect assessments of data?"

This is like blind auditions for applicants for positions in symphony orchestras. This practice has become standard among orchestras, resulting in more unbiased choices and greater diversity. Wonderful! I hope the journals adopt the two step process. But will that help us find the studies that confirm or negate previous studies? Will we in fact get the best science, or only the most sensational hypotheses?

It's easy to tell, one will be warmer than the other. So that's one way to guess them all correctly.

Actually, tea first and milk first DOES have a flavor variability due to three factors. The cup, air mixing and thermal blending temperatures. 118% of scientists who use statistics fail to identify variables with major effects on outcome.

Thats why report type 1 and type 2 errors

https://periodicos.ufsm.br/cienciaenatura/article/view/13195 Please see this alternative to p-values. This is solid.

I have much less problem with P when it’s derived from a calculated F when compared to the serious problem of incorrect test statistics, especially with for non-normally distributed data and repeated measures.

ENTROPY!!!

Me at 8pm: I should really go to sleep

Me at 3am:

SaLmON iN aN fMrIi think P-value should be P-valued itself !

Idea is good but politics and money would rule out good studies

Have my science babies!

"p-value the probability that you would have gotten the results you did just by chance". Way to get it completely wrong in the first 60 seconds. I suggest a reading of "The ASA's Statement on p-Values…", or better yet, having a competent mathematician review your productions involving mathematics before publication to youtube.

The problem isn't with the p-value. It's with people's understanding of what the p-value is. In six sigma, we make sure to call things 'statistically significant' when a p-value is low enough, but 'practical significance' needs to be taken into consideration too. There is always the possibility of issues with the experiment execution, confounding variables due to poor design, or statistical flukes.

youtube is also giving p values to its content creators now

This is such a difficult topic to teach, and you did a marvelous job. I will use this video in my classes.

The problem is statistics of science is not science.

I have a solution!

Drink green tea. No milk.

You're welcome

What happens when you get published? Why is it such a big deal? Do you get paid for it or something? Because I just don't understand the significance of getting published in an era where there are hundreds of websites that you can push content to for free.

If you add milk to hot tea you scald the milk. You need to add the tea to the milk to temper the milk and keep it from scalding. Hands down they taste different.

I like my tea best when the milk is added after. For coffee i'm fine either way though, but i do think there's a slight difference in taste. It probably has to do with the (maximum) temperature the milk gets, as boiled milk (or pasteurized, for that matter) definitively tasted different from unboiled.

"Studying mental abilities of dead fish…" rofl

AMAZING VIDEO thank you! Subscribed in order to check out other stuff you have.

Repeating experiments with same conclusions does multiply the P-factor and should be way easier to do than change the whole statistics approach.

Regarding the tea vs milk poured first in 8 cups, isn't a guess getting all orders correct randomly arrived at only one in 2^8 times if 4 are tea first and 4 are milk first? That is, 50:50 each time times eight? What am I missing?

Tea milk first is not tea. It is milky hot water and sadness.

You can’t use 8 cups or 6 cups, sample size too small for p-testing does not follow normal distribution.

The new publication approach sounds brilliant

so we are all here for a dead fish?

As a stats student and layperson I can say with 95% certainty that the science of statistics needs to be overhauled. Because it’s arbitrary b******t.

lets call it then Tea-value

Shamefully misleading or worse, ill informed. Like correlations and other methods, statistical results don’t prove anything – they merely support well rationalized hypotheses. Further, the actual research design must preclude extraneous effects. Furthermore, appropriate methods would need to be used (i.e., multivariate in this case) and the strength of the covariance reported alongside. Also, sample size based on a priori power calculations are typically used to avoid approaches such as overpowering.

Now if there’s purposeful dishonesty in research, it doesn’t affect the validity of the method, nor would it make results less “hackable”.