Si Tacuisses, Philosophus Mansisses: Biases

Saturday, April 4, 2020

Sampling on the dependent variable

I think "sampling on the dependent variable" is about to beat "ignoring hidden factor correlation" as the most common data analysis error in the wild.

Walter is a very popular professor of physics at the Boston Institute of Technology [name of institution cleverly disguised for legal reasons], who teaches a 400-student class in the largest classroom at BIT. The other 20 professors of physics are very boring, so at any given time they have on average 5 students each in their classe.

A journalist stands at the door of the physics department and asks every fourth student how full their physics classes are. The results are as follows:

- 100 students say that their class was completely full;
- 25 students say that their class was mostly empty.

This is reported as "4 out of 5 classes are full at BIT; new building needed to address the lack of space, new faculty must be hired urgently."

Did you see the error? It's subtle.

Here's a visualization of the process to help see it:

In fact, only one class is full. The problem is that the likelihood of a student being in the sample (a random sample of students coming out of the building) is proportional to the variable of interest (the number of students in the class); in other words, the journalist is sampling on the dependent variable.

The more full a class is, the more over-represented that class will be in the sample of students.

This looks like some rare error, the kind of thing that would only happen to hapless journalists, except that it happens all the time and in serious circumstances.

Consider the case of health authorities trying to determine the seriousness of a condition, namely how many of the people with the condition die. They could count the cases that get tested and compute the fraction of those that die. (That's what most of the preliminary COVID-19 case fatality rate numbers in the media are.)

And that's the same error that the journalist made.

In this case, the dependent variable is not size of class, it's seriousness of disease, and the sampling problem is not with the number of students in a class, is with the people who choose to get tested. These people choose to get tested (or get tested at a hospital when admitted) because they have symptoms that make them take the trouble.

In other words, the more serious the level of the disease a patient P has, the more likely P will be tested (sampling on the dependent variable, again), and more of these tested patients will die than if the testing was done to a random sample of the population.

(This is different from the truncation argument made in this previous post. Truncation is also a type of sampling on the dependent variable; a form that is easier to correct, as the non-truncated part of the sample distribution is the same as the population distribution up to scaling.)

To illustrate the effect of different degrees of sampling on the dependent variable, let us consider the case of a uniformly distributed variable (in the population) and different degrees of sampling:

Let's consider two persons, A with $x_A=0.2$ and B with $x_B=0.4$. With correct, random, sampling, A and B would have equal chance of being in the sample. With sampling proportional to $x$, B would be twice as likely to be in the sample than A, biasing the sample average upwards relative to the population; with sampling proportional to $x^2$, B would be four times more likely to be in the sample, which would bias the sample average even more.

To put this $x^2$ in the context of, for example, COVID-19 tests, a sampling proportional to $x^2$ means that people in a group X with symptoms twice as bad as people in a group Y will be four times more likely to seek treatment (and be tested). Basically, each person in group X will be counted four times more often in the statistics than each person in group Y (the group with less serious symptoms).

(The other degrees, $x^3$ and $x^4$ capture cases where people avoid the hospital unless their symptoms are serious or very serious.)

As we can see from the charts in the image above, the more distortion of the underlying population distribution the sampling process creates, the higher the sample average, and all the while the population average stays at a constant 1/2.

Sampling on the dependent variable: something to keep in mind when people talk about dire situations in the news.

Wednesday, September 28, 2011

What to do about psychological biases? The answer tells a lot... about you.

There are many documented cases of behavior deviating from the normative "rational" prescription of decision sciences and economics. For example, in the book Predictably Irrational, Dan Ariely tells us how he got a large number of Sloan School MBA students to change their choices using an irrelevant alternative.

The Ariely example has two groups of students choose a subscription type for The Economist. The first group was given three options to choose from: (online only, $\$60$); (paper only, $\$120$); or (paper+online, $\$120$). Overwhelmingly they chose the last option. The second group was given two options : (online only, $\$60$) or (paper+online $\$120$). Overwhelmingly they chose the first option.

Since no one chooses the (paper only, $\$120$) option, it should be irrelevant to the choices. However, removing it makes a large number of respondents change their minds. This is what is called a behavioral bias: an actual behavior that deviates from "rational" choice. (Technically these choices violate the Strong Axiom of Revealed Preference.)

(If you're not convinced that the behavior described is irrational, consider the following isomorphic problem: a waiter offers a group of people three desserts: ice cream, chocolate mousse, and fruit salad; most people choose the fruit salad, no one chooses the mousse. Then the waiter apologizes: it turns out there's no mousse. At that point most of the people who had ordered fruit salad switch to ice cream. This behavior is the same -- use some letters to represent options to remove any doubt -- as the one in Ariely's example. And few people would consider the fruit salad to ice-cream switchers rational.)

Ok, so people do, in some cases (perhaps in a majority of cases) behave in "irrational" ways, as described by the decision science and economics models. This is not entirely surprising, as those models are abstractions of idealized behavior and people are concrete physical entities with limitations and -- some argue -- faulty software.

What is really enlightening is how people who know about this feel about the biases.

IGNORE. Many academic economists and others who use economics models try to ignore these biases. Inasmuch as these biases can be more or less important depending on the decision, the persons involved, and the context, this ignorance might work for the economists, for a while. However, pretending that reality is not real is not a good foundation for Science, or even life.

ATTACK. A number of people use the existence of biases as an attack on established economics. This is how science evolves, with theories being challenged by evidence and eventually changing to incorporate the new phenomena. Some people, however, may be motivated by personal animosity towards economics and decision sciences; this creates a bad environment for knowledge evolution -- it becomes a political game, never good news for Science.

EXPLOIT. Books like Nudge make this explicit, but many people think of these biases as a way to manipulate others' behavior. Manipulate is the appropriate verb here, since these people (maybe with what they think is the best of intentions -- I understand these pave the way to someplace...) want to change others' behavior without actually telling these others what they are doing. In addition to the underhandedness that, were this a commercial application, the Nudgers would be trying to outlaw, this type of attitude reeks of "I know better than others, but they are too stupid to agree." Underhanded manipulation presented as a virtue; the world certainly has changed a lot.

ADDRESS AND MANAGE. A more productive attitude is to design decisions and information systems to minimize the effect of these biases. For example, in the decision above, both scenarios could be presented, the inconsistency pointed out, and then a separate part-worth decision could be addressed (i.e. what are each of the two elements -- print and online -- worth separately?). Note that this is the one attitude that treats behavioral biases as damage and finds way to route decisions around them, unlike the other three attitudes.

In case it's not obvious, my attitude towards these biases is to address and manage them.