(A longer version of this is going in the technical notes for my book, so there are references here to it.)
We can use an analysis like the one in [book chapter] to understand the so-called ``replication crisis,'' of many research papers failing to replicate.
We'll sidestep the controversies about suspicion of fraud in specific results and show how the same random clusters that happen in car crashes and diseases can lead to this replication crisis (i.e. they create positive results with data that doesn't show them).
(Some authors of the papers that failed to replicate have admitted to various levels of fraud and several papers have been retracted due to strong indications of fraud. We're not disputing that fraud exists. But here we show that large numbers of well-intentioned researchers in a field, combined with a publication bias towards positive results, can generate false results just by probability alone.)
As we saw in the Statistics Are Weird Interlude, when using a sample to make judgments about the population, a traditional (frequentist) way to express confidence in the results is by testing them and reporting a confidence level. For historical reasons many fields accept 95% as that confidence level.
Reminder: The general public assumes that 95% confidence level means that the phenomenon is true with probability $0.95$, given the test results. That's not how it works: the 95% confidence means that the probability that we get the test result when the phenomenon is true is $0.95$. In other words, the public perception is a probability of the truth of the phenomenon conditional on the result of the test, the real meaning is a probability of the result conditioned on the truth of the phenomenon, and in general
\[ \text{[perception] } \Pr(\text{phenomenon}|\text{result}) \neq \Pr(\text{result}|\text{phenomenon}) \text{ [reality]}\]
There's a schism in statistics about whether and how to use uninformative priors and chained Bayes rule ---as described in the Bayesian Interlude--- to deal with multiple experiments (the frequentists' approach being ``not to use them at all''). We'll sidestep those discussions here, but point out their implications further down. End of reminder.
So, the probability that we get a false positive from testing a data set for a non-existent phenomenon at the 95% confidence level is $p \le 0.05$ and we'll use the equal value for our modeling, $p = 0.05$.
Let's say we have a team of scientists investigating the very important phenomenon of whether liking pineapple on pizza makes a person more likely to put real cream, rather than non-dairy creamer, in their coffee.
To investigate this important matter, the team chooses a sample of the population and runs some controlled experiment. Maybe the result is positive, or maybe it's negative; but since they have funding, the scientists can run more experiments, controlling for different variables (is the pizza deep-dish? do people put honey or sugar in the coffee with their cream/creamer? these are called covariates or, sometimes, controls). Say the team runs 100 experiments in total.
Now we make our important assumption, which we can do because we always play in God-mode: in reality there's no relationship between pineapple on pizza and the choice of cream or creamer in coffee.*
Because of that assumption, all the positives in the data will be false positives, and we know that those happen with $p = 0.05$, by our setup. So, in 100 experiments there should be an average of 5 false positives, and the number of false positives itself is a random variable, say $N$, distributed Poisson with parameter 5. The next figure shows that distribution.
We can also ask some simple questions: how likely is the team to observe at least, say 7 positive results? How about 10? These are easily calculated:
\[ \Pr(N\ge 7) = 1-F_N(6) = 0.24 \quad\text{and}\quad \Pr(N\ge 10) = 1-F_N(9) = 0.03.\]
So, in almost one-quarter of the parallel universes that people use to interpret probabilities as frequencies there'll be seven or more positive results, but only in 3% there'll be ten or more.
This may look like cherry-picking of experiments, but it's very easy for a well-meaning team to come up with reasons why an experiment that failed to produce the desired result had a design flaw. Whereas the usual form of cherry-picking, selecting the data points within an experiment, is generally recognized by all researchers as fraud.**
But wait, there's more…
Important scientific questions like the relationship of pineapple on pizza to real cream in coffee attract large numbers of people to scientific research; so there are other teams investigating this important phenomenon. When we consider that, we can ask how likely it is for us to see a paper with 7, 8, 9, or even 10 positive results, given a total number of teams $T$ investigating this question.
The next figure shows the probability of at least one team in $T$ finding at least $N$ positive results at a 95% certainty level when each team runs 100 experiments; remember that this is given that there's no phenomenon.
With 10 teams there's a 50:50 chance of a team finding 9 or more positive results. That's usually considered a very high level of cross validation (each experiment validating and being validated by at least 8 others). And it's all based on a non-existent phenomenon.
Because scientific publications (and the mass media) tend to publish only positive results (results that show something different from what was expected), this publication bias together with the large numbers of teams $T$ and repeated experiments (here we used 100 experiments per team) can create the illusion of phenomena where none exist. Later, when these phenomena are assumed to be the default position, other researchers find them to be non-existent, and the replication crisis that we're seeing now ensues.
It's not a replication crisis, it's a ``too many phenomena were artifacts'' crisis.***
- - - -
* As this is a blog post and not the book or technical notes, here are my opinions on these weighty matters: pineapple on pizza, yes (but no pizza in the P:E diet, so it's a moot point); creamer, cream, honey, sugar, syrups, or any other adulterants in coffee, Heck no!
** Note how the Bayesian approach deals with chains of experiments and non/low-informative results: the probability of the phenomenon given the result, $\Pr(P|R)$, given by Bayes's rule is
\[\Pr(P|R) = \frac{\Pr(R|P) \, \Pr(P)}{\Pr(R)}.\]
Bayesians can easily integrate new results with previous experiments by using the $\Pr(P)$, therefore chaining inference instead of disposing of experiments that "didn't work." And a low-informative result, where $\Pr(R|P) \simeq \Pr(R)$, i.e. the result happens with almost the same probability whether or not the phenomenon is true, will be automatically accounted as such by making $\Pr(P|R) \simeq \Pr(P)$, in other words negating its effect without disposing of its data as cherry-picking would.
*** Granted, there was some fraud as well.