Si Tacuisses, Philosophus Mansisses: Bayes's Rule

Showing posts with label Bayes's Rule. Show all posts

Wednesday, February 3, 2021

How to recommend a book one hasn't read

(Yes, we could just write a blind positive recommendation. That's apparently a common approach in traditional publishing circles. This post is about something else.)

Phil Tetlock recommends Tim Harford's book "The data detective" in the tweet above. Having read some Tim Harford books in the past, and knowing Phil Tetlock, I can second that recommendation using Bayes's rule, even though I haven't read the book (I'm not in the demographic for it).

How?

In three steps:

STEP 1: Prior probabilities. As far as I recall, the Tim Harford books I read were good in the two dimensions I care about for popularization of technical material: they didn't have any glaring errors (I would remember that) and they were well written without falling into the "everything ancillary but none of the technical detail" trap of so much popularization. So, the probability that this book is good, in the absence of information, (the prior probability) is high.

Note that in a simple world where a book is either good or bad, we have Pr(not good) = 1 – Pr(good); so we can plot the informativeness of that prior distribution using the odds ratio (where informativeness increases with difference to 1; note the log scale):

STEP 2: Conditional probabilities. To integrate the information that Phil Tetlock recommends this book, we need to know how likely he is to recommend any book when it's good and how likely he is to recommend any book when it's bad. Note that these are not complementary probabilities: there are some people who recommend all books, regardless of quality, so for those people these two probabilities would both be 1; observing a tweet from one of these people would be completely uninformative: the posterior probability would be the same as the prior (check that if you don't believe me*).

Having known Phil Tetlock for some years now, I'm fairly certain that his recommendation is informative, i.e. Pr(recommend | good) is much larger than Pr(recommend | not good).

STEP 3: Posterior probabilities Putting the prior and conditional probabilities together, we can use Bayes's rule (below) to determine that the probability that the book is good given the tweet is high.

As with all Bayesian models of beliefs (that is, not calibrated on measurements or actuarial statistics), these are subjective probabilities. Still, I stand by my seconding of Phil Tetlock's recommendation.

- - - -

* If you're the trusting type that believes without checking, I have a lovely oceanside villa in Kansas City, MO to sell. Trust but verify, as the Committee for State Security used to say.

Tuesday, November 10, 2020

Why is it so hard for people to change their minds?

The usual explanation is that people discount information that contradicts their beliefs. Let's build a model to explore this idea.

We analyze the evolution of beliefs of a decision-maker receiving outside information, and to start we'll assume that the decision-maker is rational in the Bayesian sense, i.e. uses Bayes's rule with the correct conditional probabilities to update beliefs.

We call the variable of interest $X$ and it's binary: true/false, 1/0, red/blue. For the purposes of this discussion $X$ could be the existence of water on Pluto or whether the New England Patriots deflate their footballs.

There's a true $X \in \{0,1\}$ out there, but the decision-maker doesn't know it. We will denote the probability of $X=1$ by $p$, and since there's information coming in, we index it: $p[n]$ is the probability that $X=1$ given $n$ pieces of information $X_1 \ldots X_n$.

We start with $p[0] = 1/2$, as the decision-maker doesn't know anything. A piece of information $X_i$ comes in, with reliability $r_i$, defined as $\Pr(X_i = X) = r_i$; in other words false positives and false negatives have the same probability, $(1-r_i)$.

Using Bayes's rule to update information, we have

$p[i] = \frac{r_i \, p[i-1]}{r_i \, p[i-1] + (1-r_i)(1-p[i-1])}$, if $X_i = 1$ and

$p[i] = \frac{(1-r_i) \, p[i-1]}{r_i \, (1-p[i-1]) + (1-r_i) \, p[i-1]}$, if $X_i = 0$.

For illustration, let's have the true $X=1$ (so there's indeed water in Pluto and/or the Patriots do deflate their balls), and $r_i = r$, fixed for all $i$; with these definitions, $\Pr(X_i = 1) = r$. We can now iterate $p[i]$ using some random draws for $X_i$ consistent with the $r$; here are some simulations of the path of $p[i]$, three each for $r = 0.6, 0.7, 0.8$.*

Essentially, truth wins out eventually. The more reliable the information, the faster the convergence. So, that whole "it's easier to fool someone than to get them to realize they were fooled" was wrong, wasn't it?

Only if people are Bayesian updaters with accurate perception of the reliability. In particular, when they don't let their beliefs bias that perception.

Huh-Oh!

Let us consider the case of biased perception. The simplest approach is to consider that the decision-maker's perception of reliability depends on whether the $X_i$ is in support or against current beliefs.

For simplicity the true reliability of information will still be a constant, denoted $r$; but the decision maker uses a $r_i$ that is dependent on the $X_i$ and the $p[i-1]$: if they agree (for example $p[i-1]>1/2$ and $X_i = 1$), then $r_i = r$; if they don't (for example $p[i-1]<1/2$ and $X_i = 1$), then $r_i = (1-r)$.

Note that the $X_i$ are still generated by a process that has $\Pr(X_i = 1) = r$, but now the decision-maker's beliefs are updated using the $r_i$, which are only correct ($r_i = r$) for draws of $X_i$ that are consistent with the beliefs $p[i-1]$, and are precisely opposite ($r_i = 1-r$) otherwise.

To illustrate this biased behavior, in the following charts we force $X_1 = 0$ (recall that $X=1$), so that the decision-maker starts with the wrong information.

There are just a few of the many paths, but they illustrate three elements that tended to be common across most simulations:

There's a lot more volatility in the beliefs, and much slower convergence. Sometimes, like the middle case with $r=0.8$, there's a complete flip from $p[i] \simeq 0$ to a quick convergence to the true $p[i] \simeq 1$; this was rare but worth showing one example in the image.
There are many cases when the decision-maker stays very close to the wrong $p[i]\simeq 0$ for very long periods (sometimes for the total length of the simulation, 1000 steps; the graphs are for the first 60 because that was enough for illustration).
The higher the reliability the more volatile the results can be, unlike in the case with fixed $r_i$. In general increasing reliability $r$ didn't help much with convergence or stability.

So, when people start biasing their perspectives (which might come from changing the reliability, what was simulated here, or from ignoring information that contradicts their beliefs, which is similar in effect), to counter the effect of bad early information (the $X_1 = 0$) it takes a lot of counteracting new information.

Lucky for us, people in the real world don't have these biases, they're all perfect Bayesians. Otherwise, things could get ugly.

😉

- - - -

* In case it's not obvious, the effect of $r_i$ is symmetric around 0.5: because of how the information is integrated, the patterns for $r=0.6$ and $r=0.4$ are identical. As a consequence, when $r=0.5$ there's no learning at all and the decision-maker never moves away from $p = 0.5$.

OBSERVATION: There are also strategic and signaling reasons why people don't publicly change their minds, because that can be used against them by other players in competitive situations; but that's a more complicated — and to some extent trivial, because obvious in hindsight — situation, since it involves the incentives of many decision-makers and raises questions of mechanism design.

Friday, January 13, 2017

Medical tests and probabilities

You may have heard this one, but bear with me.

Let's say you get tested for a condition that affects ten percent of the population and the test is positive. The doctor says that the test is ninety percent accurate (presumably in both directions). How likely is it that you really have the condition?

[Think, think, think.]

Most people, including most doctors themselves, say something close to $90\%$; they might shade that number down a little, say to $80\%$, because they understand that "the base rate is important."

Yes, it is. That's why one must do computation rather than fall prey to anchor-and-adjustment biases.

Here's the computation for the example above (click for bigger):

One-half. That's the probability that you have the condition given the positive test result.

We can get a little more general: if the base rate is $\Pr(\text{sick}) = p$ and the accuracy (assumed symmetric) of the test is $\Pr(\text{positive}|\text{sick}) = \Pr(\text{negative}|\text{not sick}) = r $, then the probability of being sick given a positive test result is

\[ \Pr(\text{sick}|\text{positive}) = \frac{p \times r}{p \times r + (1- p) \times (1-r)}. \]

The following table shows that probability for a variety of base rates and test accuracies (again, assuming that the test is symmetric, that is the probability of a false positive and a false negative are the same; more about that below).

A quick perusal of this table shows some interesting things, such as the really low probabilities, even with very accurate tests, for the very small base rates (so, if you get a positive result for a very rare disease, don't fret too much, do the follow-up).

There are many philosophical objections to all the above, but as a good engineer I'll ignore them all and go straight to the interesting questions that people ask about that table, for example, how the accuracy or precision of the test works.

Let's say you have a test of some sort, cholesterol, blood pressure, etc; it produces some output variable that we'll assume is continuous. Then, there will be a distribution of these values for people who are healthy and, if the test is of any use, a different distribution for people who are sick. The scale is the same, but, for example, healthy people have, let's say, blood pressure values centered around 110 over 80, while sick people have blood pressure values centered around 140 over 100.

So, depending on the variables measured, the type of technology available, the combination of variables, one can have more or less overlap between the distributions of the test variable for healthy and sick people.

Assuming for illustration normal distributions with equal variance, here are two different tests, the second one being more precise than the first one:

Note that these distributions are fixed by the technology, the medical variables, the biochemistry, etc; the two examples above would, for example, be the difference between comparing blood pressures (test 1) and measuring some blood chemical that is more closely associated with the medical condition (test 2), not some statistical magic made on the same variable.

Note that there are other ways that a test A can be more precise than test B, for example if the variances for A are smaller than for B, even if the means are the same; or if the distributions themselves are asymmetric, with longer tails on the appropriate side (so that the overlap becomes much smaller).

(Note that the use of normal distributions with similar variances above was only for example purposes; most actual tests have significant asymmetries and different variances for the healthy versus sick populations. It's something that people who discover and refine testing technologies rely on to come up with their tests. I'll continue to use the same-variance normals in my examples, for simplicity.)

A second question that interested (and interesting) people ask about these numbers is why the tests are symmetric (the probability of a false positive equal to that of a false negative).

They are symmetric in the examples we use to explain them, since it makes the computation simpler. In reality almost all important preliminary tests have a built-in bias towards the most robust outcome.

For example, many tests for dangerous conditions have a built-in positive bias, since the outcome of a positive preliminary test is more testing (usually followed by relief since the positive was a false positive), while the outcome of a negative can be lack of treatment for an existing condition (if it's a false negative).

To change the test from a symmetric error to a positive bias, all that is necessary is to change the threshold between positive and negative towards the side of the negative:

In fact, if you, the patient, have access to the raw data (you should be able to, at least in the US where doctors treat patients like humans, not NHS cost units), you can see how far off the threshold you are and look up actual distribution tables on the internet. (Don't argue these with your HMO doctor, though, most of them don't understand statistical arguments.)

For illustration, here are the posterior probabilities for a test that has bias $k$ in favor of false positives, understood as $\Pr(\text{positive}|\text{not sick}) = k \times \Pr(\text{negative}|\text{sick})$, for some different base rates $p$ and probability of accurate positive test $r$ (as above):

So, this is good news: if you get a scary positive test for a dangerous medical condition, that test is probably biased towards false positives (because of the scary part) and therefore the probability that you actually have that scary condition is much lower than you'd think, even if you'd been trained in statistical thinking (because that training, for simplicity, almost always uses symmetric tests). Therefore, be a little more relaxed when getting the follow-up test.

There's a third interesting question that people ask when shown the computation above: the probability of someone getting tested to begin with. It's an interesting question because in all these computational examples we assume that the population that gets tested has the same distribution of sick and health people as the general population. But the decision to be tested is usually a function of some reason (mild symptoms, hypochondria, job requirement), so the population of those tested may have a higher incidence of the condition than the general population.

This can be modeled by adding elements to the computation, which makes the computation more cumbersome and detracts from its value to make the point that base rates are very important. But it's a good elaboration and many models used by doctors over-estimate base rates precisely because they miss this probability of being tested. More good news there!

Probabilities: so important to understand, so thoroughly misunderstood.

- - - - -
Production notes

1. There's nothing new above, but I've had to make this argument dozens of times to people and forum dwellers (particularly difficult when they've just received a positive result for some scary condition), so I decided to write a post that I can point people to.

2. [warning: rant] As someone who has railed against the use of spline drawing and quarter-ellipses in other people's slides, I did the right thing and plotted those normal distributions from the actual normal distribution formula. That's why they don't look like the overly-rounded "normal" distributions in some other people's slides: because these people make their "normals" with free-hand spline drawing and their exponentials with quarter ellipses, That's extremely lazy in an age when any spreadsheet, RStats, Matlab, or Mathematica can easily plot the actual curve. The people I mean know who they are. [end rant]