Saturday, April 4, 2020

Sampling on the dependent variable

I think "sampling on the dependent variable" is about to beat "ignoring hidden factor correlation" as the most common data analysis error in the wild.

Walter is a very popular professor of physics at the Boston Institute of Technology [name of institution cleverly disguised for legal reasons], who teaches a 400-student class in the largest classroom at BIT. The other 20 professors of physics are very boring, so at any given time they have on average 5 students each in their classe.

A journalist stands at the door of the physics department and asks every fourth student how full their physics classes are. The results are as follows:

- 100 students say that their class was completely full;
- 25 students say that their class was mostly empty.

This is reported as "4 out of 5 classes are full at BIT; new building needed to address the lack of space, new faculty must be hired urgently."

Did you see the error? It's subtle.

Here's a visualization of the process to help see it:


In fact, only one class is full. The problem is that the likelihood of a student being in the sample (a random sample of students coming out of the building) is proportional to the variable of interest (the number of students in the class); in other words, the journalist is sampling on the dependent variable.

The more full a class is, the more over-represented that class will be in the sample of students.

This looks like some rare error, the kind of thing that would only happen to hapless journalists, except that it happens all the time and in serious circumstances.

Consider the case of health authorities trying to determine the seriousness of a condition, namely how many of the people with the condition die. They could count the cases that get tested and compute the fraction of those that die. (That's what most of the preliminary COVID-19 case fatality rate numbers in the media are.)

And that's the same error that the journalist made.

In this case, the dependent variable is not size of class, it's seriousness of disease, and the sampling problem is not with the number of students in a class, is with the people who choose to get tested. These people choose to get tested (or get tested at a hospital when admitted) because they have symptoms that make them take the trouble.

In other words, the more serious the level of the disease a patient P has, the more likely P will be tested (sampling on the dependent variable, again), and more of these tested patients will die than if the testing was done to a random sample of the population.

(This is different from the truncation argument made in this previous post. Truncation is also a type of sampling on the dependent variable; a form that is easier to correct, as the non-truncated part of the sample distribution is the same as the population distribution up to scaling.)

To illustrate the effect of different degrees of sampling on the dependent variable, let us consider the case of a uniformly distributed variable (in the population) and different degrees of sampling:


Let's consider two persons, A with $x_A=0.2$ and B with $x_B=0.4$. With correct, random, sampling, A and B would have equal chance of being in the sample. With sampling proportional to $x$, B would be twice as likely to be in the sample than A, biasing the sample average upwards relative to the population; with sampling proportional to $x^2$, B would be four times more likely to be in the sample, which would bias the sample average even more.

To put this $x^2$ in the context of, for example, COVID-19 tests, a sampling proportional to $x^2$ means that people in a group X with symptoms twice as bad as people in a group Y will be four times more likely to seek treatment (and be tested). Basically, each person in group X will be counted four times more often in the statistics than each person in group Y (the group with less serious symptoms).

(The other degrees, $x^3$ and $x^4$ capture cases where people avoid the hospital unless their symptoms are serious or very serious.)

As we can see from the charts in the image above, the more distortion of the underlying population distribution the sampling process creates, the higher the sample average, and all the while the population average stays at a constant 1/2.

Sampling on the dependent variable: something to keep in mind when people talk about dire situations in the news.