Tuesday, November 24, 2020

We don't need fraud to get a replication crisis

(A longer version of this is going in the technical notes for my book, so there are references here to it.)

We can use an analysis like the one in [book chapter] to understand the so-called ``replication crisis,'' of many research papers failing to replicate.

We'll sidestep the controversies about suspicion of fraud in specific results and show how the same random clusters that happen in car crashes and diseases can lead to this replication crisis (i.e. they create positive results with data that doesn't show them).

(Some authors of the papers that failed to replicate have admitted to various levels of fraud and several papers have been retracted due to strong indications of fraud. We're not disputing that fraud exists. But here we show that large numbers of well-intentioned researchers in a field, combined with a publication bias towards positive results, can generate false results just by probability alone.)

As we saw in the Statistics Are Weird Interlude, when using a sample to make judgments about the population, a traditional (frequentist) way to express confidence in the results is by testing them and reporting a confidence level. For historical reasons many fields accept 95% as that confidence level.

Reminder: The general public assumes that 95% confidence level means that the phenomenon is true with probability $0.95$, given the test results. That's not how it works: the 95% confidence means that the probability that we get the test result when the phenomenon is true is $0.95$. In other words, the public perception is a probability of the truth of the phenomenon conditional on the result of the test, the real meaning is a probability of the result conditioned on the truth of the phenomenon, and in general

\[ \text{[perception] } \Pr(\text{phenomenon}|\text{result}) \neq \Pr(\text{result}|\text{phenomenon}) \text{ [reality]}\]

There's a schism in statistics about whether and how to use uninformative priors and chained Bayes rule ---as described in the Bayesian Interlude--- to deal with multiple experiments (the frequentists' approach being ``not to use them at all''). We'll sidestep those discussions here, but point out their implications further down. End of reminder.

So, the probability that we get a false positive from testing a data set for a non-existent phenomenon at the 95% confidence level is $p \le 0.05$ and we'll use the equal value for our modeling, $p = 0.05$.

Let's say we have a team of scientists investigating the very important phenomenon of whether liking pineapple on pizza makes a person more likely to put real cream, rather than non-dairy creamer, in their coffee.

To investigate this important matter, the team chooses a sample of the population and runs some controlled experiment. Maybe the result is positive, or maybe it's negative; but since they have funding, the scientists can run more experiments, controlling for different variables (is the pizza deep-dish? do people put honey or sugar in the coffee with their cream/creamer? these are called covariates or, sometimes, controls). Say the team runs 100 experiments in total.

Now we make our important assumption, which we can do because we always play in God-mode: in reality there's no relationship between pineapple on pizza and the choice of cream or creamer in coffee.*

Because of that assumption, all the positives in the data will be false positives, and we know that those happen with $p = 0.05$, by our setup. So, in 100 experiments there should be an average of 5 false positives, and the number of false positives itself is a random variable, say $N$, distributed Poisson with parameter 5. The next figure shows that distribution.



We can also ask some simple questions: how likely is the team to observe at least, say 7 positive results? How about 10? These are easily calculated:

\[ \Pr(N\ge 7) = 1-F_N(6) = 0.24 \quad\text{and}\quad \Pr(N\ge 10) = 1-F_N(9) = 0.03.\]

So, in almost one-quarter of the parallel universes that people use to interpret probabilities as frequencies there'll be seven or more positive results, but only in 3% there'll be ten or more.

This may look like cherry-picking of experiments, but it's very easy for a well-meaning team to come up with reasons why an experiment that failed to produce the desired result had a design flaw. Whereas the usual form of cherry-picking, selecting the data points within an experiment, is generally recognized by all researchers as fraud.**

But wait, there's more…

Important scientific questions like the relationship of pineapple on pizza to real cream in coffee attract large numbers of people to scientific research; so there are other teams investigating this important phenomenon. When we consider that, we can ask how likely it is for us to see a paper with 7, 8, 9, or even 10 positive results, given a total number of teams $T$ investigating this question.

The next figure shows the probability of at least one team in $T$ finding at least $N$ positive results at a 95% certainty level when each team runs 100 experiments; remember that this is given that there's no phenomenon.



With 10 teams there's a 50:50 chance of a team finding 9 or more positive results. That's usually considered a very high level of cross validation (each experiment validating and being validated by at least 8 others). And it's all based on a non-existent phenomenon.

Because scientific publications (and the mass media) tend to publish only positive results (results that show something different from what was expected), this publication bias together with the large numbers of teams $T$ and repeated experiments (here we used 100 experiments per team) can create the illusion of phenomena where none exist. Later, when these phenomena are assumed to be the default position, other researchers find them to be non-existent, and the replication crisis that we're seeing now ensues.

It's not a replication crisis, it's a ``too many phenomena were artifacts'' crisis.***



- - - -

* As this is a blog post and not the book or technical notes, here are my opinions on these weighty matters: pineapple on pizza, yes (but no pizza in the P:E diet, so it's a moot point); creamer, cream, honey, sugar, syrups, or any other adulterants in coffee, Heck no! 


** Note how the Bayesian approach deals with chains of experiments and non/low-informative results: the probability of the phenomenon given the result, $\Pr(P|R)$, given by Bayes's rule is

\[\Pr(P|R) = \frac{\Pr(R|P) \, \Pr(P)}{\Pr(R)}.\]

Bayesians can easily integrate new results with previous experiments by using the $\Pr(P)$, therefore chaining inference instead of disposing of experiments that "didn't work." And a low-informative result, where $\Pr(R|P) \simeq \Pr(R)$, i.e. the result happens with almost the same probability whether or not the phenomenon is true, will be automatically accounted as such by making $\Pr(P|R) \simeq \Pr(P)$, in other words negating its effect without disposing of its data as cherry-picking would.


*** Granted, there was some fraud as well.

Wednesday, November 18, 2020

A thought about the DANMASK study and presenting quantitative results


This post is about the analysis and presentation of the results, not about the substantive question of whether to wear masks. Link to the study

The main point here is that the way the results are presented, without a comparison counterfactual, makes it difficult to understand what the study really means:



So, without further ado, the results themselves. Cutting through a lot of important but uninteresting stuff, there are four numbers that matter:

Size of the sub-sample with masks: 2392

Number of infected in the mask sub-sample: 43

Size of the sub-sample with no masks: 2470

Number of infected in the no-mask sub-sample: 52

From these four numbers we can compute the incidence of infection given masks (1.8%) and no masks (2.1%). We can also test these numbers in a variety of ways, including using the disaggregate data to calibrate a logit model (no, I won't call it ``logistic regression''), but for now let's look at those two incidences only.


A likelihood ratio "test"


Here's a simple test we like: the likelihood ratio test between two hypotheses: that both samples are drawn from a common incidence (1.95%) or that each sample is drawn from its own incidence. In other words, we want

\[ LR = \frac{\Pr(52 \text{ pos out of } 2470|p = 0.021)\, \Pr(43 \text{ pos out of } 2392|p = 0.018)}{\Pr(52 \text{ pos out of } 2470|p = 0.0195)\, \Pr(43 \text{ pos out of } 2392|p = 0.0195)}\]

Using log-space computations to get around precision problems, we get $LR = 1.083$.

In other words, it's only 8.3% more likely that the data comes from two groups with different incidences than from a group with a common incidence. In order to be minimally convinced we'd like that likelihood ratio to be 20 or so, at least, so LR = 1.083 supports the frequentist analysis that these numbers seem to come from the same population.

(Yes, this is the same test we apply to Rotten Tomatoes ratings.)


Presenting the results: make comparisons!


A problem with the paper is that the academese of the results is hard for many people to understand. One way to make the lack of effect of masks more obvious is to create a comparison with an alternative. We choose a simple 2:1 ratio of protection, which is weak protection (a person wearing a mask has half the likelihood of infection of that of someone with no mask), but is enough to make the point.

Since we want to make fair comparison, we need to use the same size population for both the mask and no-mask conditions (we'll choose 2400 as it's in the middle of those sample sizes) and infection ratios similar to those of the test (we choose 1.25% and 2.5% for mask and no-mask, respectively). Now all we need to do is plot and compare:



(The more eagle-eyed readers will notice that these are Poisson distributions.)

The comparison with an hypothetical, even one as basic as a 2:1 protection ratio makes the point that the distributions on the left overlap a lot and therefore there's a fair chance that they come from the same population (in other words that there's no difference in incidence of infections between the use and non-use of masks).


Bayesians (and more attentive frequentists) might note at this point that having non-significant differences isn't the same thing as having a zero effect size; and that a richer model (including the distributions of the estimates themselves, which are random variables) might be useful to drive policy.

But for now, the point is that those four lines in the figure are much easier to interpret than the word-and-number salad under the subheading "results" in the paper itself.


Tuesday, November 10, 2020

Why is it so hard for people to change their minds?

The usual explanation is that people discount information that contradicts their beliefs. Let's build a model to explore this idea. 

We analyze the evolution of beliefs of a decision-maker receiving outside information, and to start we'll assume that the decision-maker is rational in the Bayesian sense, i.e. uses Bayes's rule with the correct conditional probabilities to update beliefs.

We call the variable of interest $X$ and it's binary: true/false, 1/0, red/blue. For the purposes of this discussion $X$ could be the existence of water on Pluto or whether the New England Patriots deflate their footballs. 

There's a true $X \in \{0,1\}$ out there, but the decision-maker doesn't know it. We will denote the probability of $X=1$ by $p$, and since there's information coming in, we index it: $p[n]$ is the probability that $X=1$ given $n$ pieces of information $X_1 \ldots X_n$.

We start with $p[0] = 1/2$, as the decision-maker doesn't know anything. A piece of information $X_i$ comes in, with reliability $r_i$, defined as $\Pr(X_i = X) = r_i$; in other words false positives and false negatives have the same probability, $(1-r_i)$.

Using Bayes's rule to update information, we have

$p[i] = \frac{r_i \, p[i-1]}{r_i \, p[i-1] + (1-r_i)(1-p[i-1])}$, if $X_i = 1$ and

$p[i] = \frac{(1-r_i) \, p[i-1]}{r_i \, (1-p[i-1]) + (1-r_i) \, p[i-1]}$, if $X_i = 0$.

For illustration, let's have the true $X=1$ (so there's indeed water in Pluto and/or the Patriots do deflate their balls), and $r_i = r$, fixed for all $i$; with these definitions, $\Pr(X_i = 1) = r$. We can now iterate $p[i]$ using some random draws for $X_i$ consistent with the $r$; here are some simulations of the path of $p[i]$, three each for $r = 0.6, 0.7, 0.8$.*



Essentially, truth wins out eventually. The more reliable the information, the faster the convergence. So, that whole "it's easier to fool someone than to get them to realize they were fooled" was wrong, wasn't it?

Only if people are Bayesian updaters with accurate perception of the reliability. In particular, when they don't let their beliefs bias that perception.

Huh-Oh!

Let us consider the case of biased perception. The simplest approach is to consider that the decision-maker's perception of reliability depends on whether the $X_i$ is in support or against current beliefs.

For simplicity the true reliability of information will still be a constant, denoted $r$; but the decision maker uses a $r_i$ that is dependent on the $X_i$ and the $p[i-1]$: if they agree (for example $p[i-1]>1/2$ and $X_i = 1$),  then $r_i = r$; if they don't (for example $p[i-1]<1/2$ and $X_i = 1$), then $r_i = (1-r)$.

Note that the $X_i$ are still generated by a process that has $\Pr(X_i = 1) = r$, but now the decision-maker's beliefs are updated using the $r_i$, which are only correct  ($r_i = r$) for draws of $X_i$ that are consistent with the beliefs $p[i-1]$, and are precisely opposite ($r_i = 1-r$) otherwise.

To illustrate this biased behavior, in the following charts we force $X_1 = 0$ (recall that $X=1$), so that the decision-maker starts with the wrong information.



There are just a few of the many paths, but they illustrate three elements that tended to be common across most simulations:

  1. There's a lot more volatility in the beliefs, and much slower convergence. Sometimes, like the middle case with $r=0.8$, there's a complete flip from $p[i] \simeq 0$ to a quick convergence to the true $p[i] \simeq 1$; this was rare but worth showing one example in the image.
  2. There are many cases when the decision-maker stays very close to the wrong $p[i]\simeq 0$ for very long periods (sometimes for the total length of the simulation, 1000 steps; the graphs are for the first 60 because that was enough for illustration).
  3. The higher the reliability the more volatile the results can be, unlike in the case with fixed $r_i$. In general increasing reliability $r$ didn't help much with convergence or stability.

So, when people start biasing their perspectives (which might come from changing the reliability, what was simulated here, or from ignoring information that contradicts their beliefs, which is similar in effect), to counter the effect of bad early information (the $X_1 = 0$) it takes a lot of counteracting new information.

Lucky for us, people in the real world don't have these biases, they're all perfect Bayesians. Otherwise, things could get ugly.

😉


- - - - 

* In case it's not obvious, the effect of $r_i$ is symmetric around 0.5: because of how the information is integrated, the patterns for $r=0.6$ and $r=0.4$ are identical. As a consequence, when $r=0.5$ there's no learning at all and the decision-maker never moves away from $p = 0.5$.

OBSERVATION: There are also strategic and signaling reasons why people don't publicly change their minds, because that can be used against them by other players in competitive situations; but that's a more complicated — and to some extent trivial, because obvious in hindsight — situation, since it involves the incentives of many decision-makers and raises questions of mechanism design.

Sunday, November 1, 2020

Why can't we have civil discussions?

Partly, or mostly, because some people profit handsomely from the lack of civil discourse, and their audiences, to a large extent, like the demonizing of the opposition.


Monetizing the echo chamber versus persuading the opposition


To illustrate these two approaches to being a public intellectual [broadly defined], consider the difference between Carl Sagan and Person 2, both of whom were not fans of religion:

[Vangelis music plays] Carl Sagan: The Cosmos is all there is, there ever was, or there ever will be. [...] The size and age of the Cosmos are beyond normal human understanding. Lost somewhere between immensity and eternity is our tiny planetary home, the Earth. [...] In the last few Millennia we've made astonishing discoveries about the Cosmos and our place in it.
Person 2 [who will remain anonymous] wrote the following in a book :
The God of the Old Testament is arguably the most unpleasant character in all fiction: jealous and proud of it; a petty, unjust, unforgiving control-freak; a vindictive, bloodthirsty ethnic cleanser; a misogynistic, homophobic, racist, infanticidal, genocidal, filicidal, pestilential, megalomaniacal, sado-masochistic, capriciously malevolent bully.
Carl Sagan welcomed all into his personal voyage (the subtitle of the Cosmos show) to explore the Cosmos. Sagan's approach was a non-confrontational way to make religious people question, for example, young-Earth creationism. Sagan was persuading the opposition.

Person 2 wrote books (and gave speeches and appeared on television programs) to make his audience feel superior to the religious, just for not being religious; this is what is called monetizing the echo chamber.

Some public intellectuals are more interested in monetizing their echo chamber than in persuading the opposition. This has dire consequences for public discourse: it aggravates divisions, demonizes positions and people, and makes cooperation difficult across party lines.

But it appeals to some of the darker parts of human nature and is easier than persuading the opposition, so a lot of people seem to prefer it (both speakers/authors and audiences). And since it tends to demonize the opposition, it also gives that opposition both incentives and material to respond in the same fashion.

In other words, it's a self-reinforcing mechanism and one that makes us all worse-off in the long term, as divisions and demonizations aren't conducive to civil and productive societies.



A simple model of this behavior


Let's define the neutral position (as in we don't badmouth anyone and stick to the science) as the zero point in some scale, and consider the evolution of the discourse by two opposing factions, $A$ and $B$. Movement away from the zero means that these factions are more and more focused on monetizing their echo chamber rather than convincing those who may be neutral or even believe in the opposite position.

We model the evolution of the discourse by a discrete stochastic process (i.e. the positions change with time but there are some random events that affect their movement). Each side has a drift $D$ in the direction of their position and also a tendency to respond to the other side's positions. To model this response, we'll use a tit-for-tat approach, in which the response at a given time is proportional to what the other side's position was in the previous time. In other words, we have the following dynamics:

$A[i+1] = A[i] + D  - S \times B[i] + \epsilon_{Ai} \qquad\text{with } \epsilon_{Ai} \sim \mathrm{N}(0, \sigma)$

and 

$B[i+1] = B[i] – D  - S \times A[i] + \epsilon_{Bi}  \qquad\text{with }  \epsilon_{Bi} \sim \mathrm{N}(0, \sigma)$

With this formulation $A$ tends to drift to positive values and $B$ tends to drift to negative values, assuming that $D$ is positive. The following figure shows how the dynamics evolve for some values of these $D$ and $S$ parameters


Note how in all cases we have divergence from zero (i.e. monetizing of the echo chamber, to the exclusion of persuading the opposition) and in particular how the increase in tit-for-tat sensitivity leads to acceleration of this process.


(The point of making a model like this [a theoretical model] is to see what happens when we change the parameters. I recommend readers make their own versions to play around with; mine were made in the R programming language on Rstudio and Tuftized/prettified for display on Apple Keynote, but it's easy enough to set that model up in a spreadsheet.) 


What can we do?


Since this is a systemic problem (a problem composed of large scale incentives and human nature), there's little that can be done with individual actions, but there are a few good starting points:

  1. Avoid giving attention (much less money) to people who are monetizing the echo chamber, that is people who are deliberately making out discourse less civil for profit.
  2. Focus on positives rather than negatives: despite the popularity of Person 2, there are plenty of science popularizers who focus, as Carl Sagan did, on the majesty of the Universe and the power of science and technology.
  3. Point out this difference, between monetizers of the echo chamber and persuaders of the opposition, to other people, so that they too can start focussing on the positive
  4. Reflect on what we do, ourselves: do we focus on the negative or the positive? (This led to a lot fewer tweets on my part as I either stop snarky responses before I tweet or delete them soon after realizing they aren't helping.)*




- - - -
* Note that pointing out a technical or modeling error is not covered by this: if someone says 2+2=5, the right thing is not to make fun of them for it, but rather correct them. However, ignoring the error is not positive; the error must be corrected, though the person making it need not be demonized (it's counterproductive to demonize people who are wrong, as that leads to defensiveness). Carl Sagan in the example above repeatedly undermines young-Earth creationism without ever mentioning it. 


ORIGIN: This post was motivated by a video on the Covid-19 situation by a person with whom I would agree about 95% on substance. But that video was such a collection of out-group demonizing and in-group pandering that it did change one mind: mine. Not about the substantive matters, but about subscribing to or otherwise following the social media of people whose entire schtick is monetizing their echo chamber.