Showing posts with label COVID19. Show all posts
Showing posts with label COVID19. Show all posts

Sunday, April 12, 2020

Random clusters: how we make too much of coincidences

Understanding randomness and coincidence


If we flip a fair coin 1000 times, how likely is it that we see a sequence of 10 tails somewhere?

Give that problem a try, then read on.

We start by computing the probability that a sequence of ten flips is all tails:

$\Pr(\text{10 tails in a row}) = 2^{-10} = 1/1024$

There are 991 sequences of 10 flips in 1000 flips, so we might be tempted to say that the probability of at least one sequence of 10 tails is 991/1024.

This is obviously the wrong answer, because if the question had been for a sequence of 10 tails in 2000 flips the same logic would yield a probability of 1991/1024, which is greater than 1.

(People make this equivalence between event disjunction and probability addition all the time; it's wrong every single time they do it. The rationale, inasmuch as there's one, is that "and" translates to multiplication, so they expect "or" to translate to addition; it doesn't.)

The correct probability calculation starts by asking what is the probability that those 991 ten-flip sequences don't include one sequence with ten tails in a row, in other words,

$(1 - \Pr(\text{10 tails in a row}))^{991}$.

This is the probability of the negative of the event we want (at least one sequence of 10 tails in 1000 flips), which makes the probability of the event we want

$1 - (1 - \Pr(\text{10 tails in a row}))^{991} = 0.62$.

62% of the time one of these 1000-flip sequences will show, somewhere, a sequence of 10 tails.

Now consider Bob, a journalist, who sees those 10 tails and hurries home to write about the "one in a thousand" event that he just witnessed, demanding a coin fairness validation authority, and 4 trillion dollars in QE for the Fed, 1.5 trillion dollars for politically connected businesses, 1100 pages of legislation containing every legislator's wishlist, and an app on everyone's phone an embedded chip in everyone's wrist to track their vaccine statu… I mean to check for fairness in coin flips.

And this brings us to the problem of random clustering.


Random clusters


Say there's an event that happens with probability 14.5/100,000 per person-year (that's the flu death rate for California in 2018, according to the CDC).

What's the probability that on a given year we see a cluster of 30 or more events in a single week (50% above the average, it turns out later) in the San Francisco Bay Area (population 7,000,000 for this example)? How about 40 or more events (twice the average)?

Try it for yourself.

Done?

Sure you don't want to try to do it first?

Okay, here we go.

First we need to see what the distribution of the number of events per week is. Assuming that the 14.5/100,000 per person-year is distributed uniformly over the 52 weeks (which it isn't, but by choosing this we make the case for random clusters stronger, because the deaths happen mostly during flu season and the average during flu season is higher), the individual event probability is

$p = 14.5/(100,000 \times 52) = 0.0000027885$ events per person-week.

A population of 7,000,000 is much bigger than 30, so we use the Poisson distribution for the number of events/week. (The Poisson distribution is unbounded, but the error we make by assuming it is trivial, given that the cluster sizes we care about are so small compared to the size of the population.)

If the probability of an individual event per person-week is $p$, then the average number of events per week $\mu$ given a Poisson distribution (which is also the parameter of the Poisson distribution, helpfully), is $\mu = 7000000 \times p = 19.51923$

Our number of events per week, a random variable $P$, will be distributed Poisson with parameter $\mu$; the probability that we see a week with a number of events $N$ or higher is $(1-F_P(N))$, where $F_P(N)$ is the c.d.f. of $P$ evaluated at value $N$; hence the probability of a year with at least one week where the number of events (we'll call that cluster size) is $N$ or bigger is

$1- F_P(N)^{52}$ where $P \sim \mathrm{Poisson}(19.52)$.

We can now use the miracle of computers and spreadsheets (or the R programming language, which is my preference) to compute and plot the probability that there's a cluster size of at least $N$ events on any week in one year:


 So, from randomness alone, we expect that with 40% probability there'll be one week with a cluster 50% larger than average in the Bay Area, and with 0.08% probability one that's twice as large.

That 0.08% looks very small… until we realize that there are thousands of townships in the US, and if each of them has a similar probability (different in detail because populations and death rates are different, but generally in the same order of magnitude, so we'll just use the same number for illustration), the aggregate number for, say, 1000 townships will be

$1- (1-0.0008)^{1000} = 0.55$.

Any cherry-picking person can, with some effort, find one such cluster at least in half the years. And that's the problem, because everything said here is driven by pure probability, but the interpretation of the clusters is always assumed to be caused by some other process.

"After all, could randomness really cause such outliers?"

Yes. Randomness was the only process assumed above, all numbers were driven by the same process, and those so-called outliers were nothing but the result of the same processes that create the averages.

Remember that the next time people point at scary numbers like 50% or 100% above average.



Mathematical note (illustration, not proof)


A small numerical illustration of the similarity between probability rates for incremental percentages, to show that the assumption that the various townships would have similar (not equal) probabilities for different $\mu$, as a function of the size the cluster relative to the average:


Sunday, April 5, 2020

An irritating logical error that permeates the COVID-19 discussion

Apparently many people in influential media outlets don't know that the proposition
"most patients who die of COVID19 have [a given precondition]"
doesn't imply the proposition
"most COVID19 patients with [a given precondition] die."
This is a logic error; it's made in all sorts of news outlets and social media right now. 

Don't take my word that this is a logic error; here's a counter-example:


Sunday, March 1, 2020

Fun with COVID-19 Numbers for March 1, 2020

NOTA BENE: The Coronavirus COVID-2019 is a serious matter and we should be taking all reasonable precautions to minimize contagion and stay healthy. But there's a lot of bad quantitative thinking that's muddling the issue, so I'm collecting some of it here.


Death Rate I: We can't tell, there's no good data yet.


This was inspired by a tweet by Ted Naiman, MD, whose Protein-to-Energy ratio analysis of food I credit for at least half of my weight loss (the other half I credit P. D. Mangan, for the clearest argument for intermittent fasting, which convinced me); so this is not about Dr Naiman's tweet, just that his was the tweet I saw with a variation of this proposition:

"COVID-19 is 'like the flu,' except the death rate is 30 to 50 times higher."

But here's the problem with that proposition: we don't have reliable data to determine that. Here are two simple arguments that cast some doubt on the proposition:

⬆︎ How the death rate could be higher: government officials and health organizations under-report the number of deaths in order to contain panic or to minimize criticism of government and health organizations; also possible that some deaths from COVID-19 are attributed to conditions that were aggravated by COVID-19, for example being reported as deaths from pneumonia.

⬇︎ How the death rate could be lower: people with mild cases of COVID-19 don't report them and treat themselves with over-the-counter medication (to avoid getting taken into forced quarantine, for example), hence there's a bias in the cases known to the health organizations, towards more serious cases, which are more likely to die.

How much we believe the first argument applies depends on how much we trust the institutions of the countries reporting, and... you can draw your own conclusions!

To illustrate the second argument, consider the incentives of someone with flu-like symptoms and let's rate their seriousness or aversiveness, $a$, as a continuous variable ranging from zero (no symptoms) to infinity (death). We'll assume that the distribution of $a$ is an exponential, to capture thin tails, and to be simple let's make its parameter $\lambda =1$.

Each sick patient will have to decide whether to seek treatment other than over-the-counter medicine, but depending on the health system that might come with a cost (being quarantined at home, being quarantined in "sick wards," for example); let's call that cost, in the same scale of aversiveness, $c$.

What we care about is how the average aversiveness that is reported changes with $c$. Note that if everyone reported their $a$, that average would be $1/\lambda = 1$, but what we observe is a self-selected subset, so we need $E[a | a > c]$, which we can compute easily, given the exponential distribution, as

\[
E[a | a > c]
=
\frac{\int_{c}^{\infty} a \, f_A(a) da }{1 - F_A(c)}
=
\frac{\left[ - \exp(-a)(a+1)\right]^{\infty}_{c}}{\exp(-c)}
= c + 1
\]
Note that the probability of being reported is $\Pr(a>c) = \exp(-c)$, so as the cost of reporting goes up, a vanishingly small percentage of cases are reported, but their severity increases [linearly, but that's an artifact of the simple exponential] with the cost. That's the self-selection bias in the second argument above.

A plot for $c$ between zero (everyone reports their problems) and 5 (the cost of reporting is so high that only the sickest 0.67% risk reporting their symptoms to the authorities):


Remember that for all cases in this plot the average aversiveness/seriousness doesn't change: it's fixed at 1, and everyone has the disease, with around 63% of the population having less than the average aversiveness/seriousness. But, if the cost of reporting is, for example, equal to twice the aversiveness of the average (in other words, people dislike being put in involuntary quarantine twice as much as they dislike the symptoms of the average seriousness of the disease), only the sickest 13.5% of people will look for help from the authorities/health organizations, who will report a seriousness of 3 (three times the average seriousness of the disease in the general population).*

With mixed incentives for all parties involved, it's difficult to trust the current reported numbers.


Death Rate II: Using the data from the Diamond Princess cruise ship.


A second endemic problem is arguing about small differences in the death rate, based on small data sets. Many of these differences are indistinguishable statistically, and to be nice to all flavors of statistical testing we're going to compute likelihood ratios, not rely on simple point estimate tests.

The Diamond Princess cruise ship is as close as one gets to a laboratory experiment in COVID-19, but there's a small numbers problem. In other words we'll get good estimates when we have large scale, high-quality data. Thanks to @Clarksterh on Twitter for the idea.

Using data from Wikipedia for Feb 20, there were 634 confirmed infections (328 asymptomatic) aboard the Diamond Princess and as of Feb 28 there were 6 deaths among those infections. The death rate is 6/634 = 0.0095.

(The ship's population isn't representative of the general population, being older and richer, but that's not what's at stake here. This is about fixating on the point estimates and small differences thereof. There's also a delay between the diagnosis and the death, so these numbers might be off by a factor of two or three.)

What we're doing now: using $d$ as the death rate, $d = 0.0095$ is the maximum likelihood estimate, so it will give the highest probability for the data, $\Pr(\text{6 dead out of 634} | d = 0.0095)$. Below, we calculate and plot the likelihood ratio between that probability and the computed probability of the data for other candidate death rates, $d_i$.**

\[LR(d_i) = \frac{\Pr(\text{6 dead out of 634} | d = 0.0095)}{\Pr(\text{6 dead out of 634} | d = d_i)}\]


We can't reject any rates between 0.5% and 1.5% with any confidence (okay, some people using single-sided point tests with marginal significance might narrow that a bit, but let's not rehash old fights here), and that's a three-fold range. And there are still a lot of issues with the data.

On the other hand...

It's easy to see that the COVID-19 death rate is much higher than that of the seasonal flu (0.1%): using the data from the Diamond Princess, the $LR(0.001) =  3434.22$, which should satisfy both the most strong-headed frequentists and Bayesians that these two rates are different. Note that $LR(0.03) = 510.01$, which also shows that with the data above the Diamond Princess invalidates the 3% death rate. (Again, noting that the numbers might be off by a factor of two or three in either direction due to the delay in diagnosing the infection and between diagnosis and recovery or death.)

As with most of these analyses, disaggregate clinical data will be necessary to establish these rates, which we're estimating from much less reliable [aggregate] epidemiological data.



Stay safe: wash hands, don't touch your face, avoid unnecessary contact with other people. 



- - - - - 

* A friend pointed out that there are some countries or subcultures where hypochondria is endemic and that would lead to underestimation of the seriousness of the disease; this model ignores that, but anecdotally I've met people who get doctor's appointments because they have DOMS and want the doctor to reassure them that it's normal, prescribe painkillers and anti-inflammatories, and other borderline psychotic behavior...


** We're just computing the binomial here, no assumptions beyond that:

$\Pr(\text{6 dead out of 634} | d = d_i) = C(634,6) \, d_i^6 (1-d_i)^{628}$,

and since we use a ratio the big annoying combinatorials cancel out.