Sunday, April 12, 2020

Random clusters: how we make too much of coincidences

Understanding randomness and coincidence


If we flip a fair coin 1000 times, how likely is it that we see a sequence of 10 tails somewhere?

Give that problem a try, then read on.

We start by computing the probability that a sequence of ten flips is all tails:

$\Pr(\text{10 tails in a row}) = 2^{-10} = 1/1024$

There are 991 sequences of 10 flips in 1000 flips, so we might be tempted to say that the probability of at least one sequence of 10 tails is 991/1024.

This is obviously the wrong answer, because if the question had been for a sequence of 10 tails in 2000 flips the same logic would yield a probability of 1991/1024, which is greater than 1.

(People make this equivalence between event disjunction and probability addition all the time; it's wrong every single time they do it. The rationale, inasmuch as there's one, is that "and" translates to multiplication, so they expect "or" to translate to addition; it doesn't.)

The correct probability calculation starts by asking what is the probability that those 991 ten-flip sequences don't include one sequence with ten tails in a row, in other words,

$(1 - \Pr(\text{10 tails in a row}))^{991}$.

This is the probability of the negative of the event we want (at least one sequence of 10 tails in 1000 flips), which makes the probability of the event we want

$1 - (1 - \Pr(\text{10 tails in a row}))^{991} = 0.62$.

62% of the time one of these 1000-flip sequences will show, somewhere, a sequence of 10 tails.

Now consider Bob, a journalist, who sees those 10 tails and hurries home to write about the "one in a thousand" event that he just witnessed, demanding a coin fairness validation authority, and 4 trillion dollars in QE for the Fed, 1.5 trillion dollars for politically connected businesses, 1100 pages of legislation containing every legislator's wishlist, and an app on everyone's phone an embedded chip in everyone's wrist to track their vaccine statu… I mean to check for fairness in coin flips.

And this brings us to the problem of random clustering.


Random clusters


Say there's an event that happens with probability 14.5/100,000 per person-year (that's the flu death rate for California in 2018, according to the CDC).

What's the probability that on a given year we see a cluster of 30 or more events in a single week (50% above the average, it turns out later) in the San Francisco Bay Area (population 7,000,000 for this example)? How about 40 or more events (twice the average)?

Try it for yourself.

Done?

Sure you don't want to try to do it first?

Okay, here we go.

First we need to see what the distribution of the number of events per week is. Assuming that the 14.5/100,000 per person-year is distributed uniformly over the 52 weeks (which it isn't, but by choosing this we make the case for random clusters stronger, because the deaths happen mostly during flu season and the average during flu season is higher), the individual event probability is

$p = 14.5/(100,000 \times 52) = 0.0000027885$ events per person-week.

A population of 7,000,000 is much bigger than 30, so we use the Poisson distribution for the number of events/week. (The Poisson distribution is unbounded, but the error we make by assuming it is trivial, given that the cluster sizes we care about are so small compared to the size of the population.)

If the probability of an individual event per person-week is $p$, then the average number of events per week $\mu$ given a Poisson distribution (which is also the parameter of the Poisson distribution, helpfully), is $\mu = 7000000 \times p = 19.51923$

Our number of events per week, a random variable $P$, will be distributed Poisson with parameter $\mu$; the probability that we see a week with a number of events $N$ or higher is $(1-F_P(N))$, where $F_P(N)$ is the c.d.f. of $P$ evaluated at value $N$; hence the probability of a year with at least one week where the number of events (we'll call that cluster size) is $N$ or bigger is

$1- F_P(N)^{52}$ where $P \sim \mathrm{Poisson}(19.52)$.

We can now use the miracle of computers and spreadsheets (or the R programming language, which is my preference) to compute and plot the probability that there's a cluster size of at least $N$ events on any week in one year:


 So, from randomness alone, we expect that with 40% probability there'll be one week with a cluster 50% larger than average in the Bay Area, and with 0.08% probability one that's twice as large.

That 0.08% looks very small… until we realize that there are thousands of townships in the US, and if each of them has a similar probability (different in detail because populations and death rates are different, but generally in the same order of magnitude, so we'll just use the same number for illustration), the aggregate number for, say, 1000 townships will be

$1- (1-0.0008)^{1000} = 0.55$.

Any cherry-picking person can, with some effort, find one such cluster at least in half the years. And that's the problem, because everything said here is driven by pure probability, but the interpretation of the clusters is always assumed to be caused by some other process.

"After all, could randomness really cause such outliers?"

Yes. Randomness was the only process assumed above, all numbers were driven by the same process, and those so-called outliers were nothing but the result of the same processes that create the averages.

Remember that the next time people point at scary numbers like 50% or 100% above average.



Mathematical note (illustration, not proof)


A small numerical illustration of the similarity between probability rates for incremental percentages, to show that the assumption that the various townships would have similar (not equal) probabilities for different $\mu$, as a function of the size the cluster relative to the average:


Sunday, April 5, 2020

An irritating logical error that permeates the COVID-19 discussion

Apparently many people in influential media outlets don't know that the proposition
"most patients who die of COVID19 have [a given precondition]"
doesn't imply the proposition
"most COVID19 patients with [a given precondition] die."
This is a logic error; it's made in all sorts of news outlets and social media right now. 

Don't take my word that this is a logic error; here's a counter-example:


Saturday, April 4, 2020

Sampling on the dependent variable

I think "sampling on the dependent variable" is about to beat "ignoring hidden factor correlation" as the most common data analysis error in the wild.

Walter is a very popular professor of physics at the Boston Institute of Technology [name of institution cleverly disguised for legal reasons], who teaches a 400-student class in the largest classroom at BIT. The other 20 professors of physics are very boring, so at any given time they have on average 5 students each in their classe.

A journalist stands at the door of the physics department and asks every fourth student how full their physics classes are. The results are as follows:

- 100 students say that their class was completely full;
- 25 students say that their class was mostly empty.

This is reported as "4 out of 5 classes are full at BIT; new building needed to address the lack of space, new faculty must be hired urgently."

Did you see the error? It's subtle.

Here's a visualization of the process to help see it:


In fact, only one class is full. The problem is that the likelihood of a student being in the sample (a random sample of students coming out of the building) is proportional to the variable of interest (the number of students in the class); in other words, the journalist is sampling on the dependent variable.

The more full a class is, the more over-represented that class will be in the sample of students.

This looks like some rare error, the kind of thing that would only happen to hapless journalists, except that it happens all the time and in serious circumstances.

Consider the case of health authorities trying to determine the seriousness of a condition, namely how many of the people with the condition die. They could count the cases that get tested and compute the fraction of those that die. (That's what most of the preliminary COVID-19 case fatality rate numbers in the media are.)

And that's the same error that the journalist made.

In this case, the dependent variable is not size of class, it's seriousness of disease, and the sampling problem is not with the number of students in a class, is with the people who choose to get tested. These people choose to get tested (or get tested at a hospital when admitted) because they have symptoms that make them take the trouble.

In other words, the more serious the level of the disease a patient P has, the more likely P will be tested (sampling on the dependent variable, again), and more of these tested patients will die than if the testing was done to a random sample of the population.

(This is different from the truncation argument made in this previous post. Truncation is also a type of sampling on the dependent variable; a form that is easier to correct, as the non-truncated part of the sample distribution is the same as the population distribution up to scaling.)

To illustrate the effect of different degrees of sampling on the dependent variable, let us consider the case of a uniformly distributed variable (in the population) and different degrees of sampling:


Let's consider two persons, A with $x_A=0.2$ and B with $x_B=0.4$. With correct, random, sampling, A and B would have equal chance of being in the sample. With sampling proportional to $x$, B would be twice as likely to be in the sample than A, biasing the sample average upwards relative to the population; with sampling proportional to $x^2$, B would be four times more likely to be in the sample, which would bias the sample average even more.

To put this $x^2$ in the context of, for example, COVID-19 tests, a sampling proportional to $x^2$ means that people in a group X with symptoms twice as bad as people in a group Y will be four times more likely to seek treatment (and be tested). Basically, each person in group X will be counted four times more often in the statistics than each person in group Y (the group with less serious symptoms).

(The other degrees, $x^3$ and $x^4$ capture cases where people avoid the hospital unless their symptoms are serious or very serious.)

As we can see from the charts in the image above, the more distortion of the underlying population distribution the sampling process creates, the higher the sample average, and all the while the population average stays at a constant 1/2.

Sampling on the dependent variable: something to keep in mind when people talk about dire situations in the news.