Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Wednesday, June 12, 2019

A statistical analysis of reviews of L.A. Finest: audience vs. critics



"If numbers are available, let's use the numbers. If all we have are opinions, let's go with mine." -- variously attributed to a number of bosses.

There's a new police procedural this season, L.A. Finest, and Rotten Tomatoes has done it again: critics and audience appear to be at loggerheads. Like with The Orville, Star Trek Discovery, and the last season of Doctor Who.

But "appear to be" is a dequantified statement. And Rotten Tomatoes has numbers; so, what can these numbers tell us?

Before they can tell us anything, we need to write our question: first in words, then as a math problem. Then we can solve the math problem and that solution gets translated into a "words" answer, but now a quantified "words" answer.

The question, which is suggested by the above numbers is:
Do the critics and the audience use similar or opposite criteria to rate this show?
One way to answer this question, which would have been feasible in the past when Rotten Tomatoes had user reviews, would be to do text analytics on the reviews themselves. But now the user reviews are gone so that's no longer possible.

Another way, a simpler and cleaner way, is to use the data above.

To simplify we'll assume that all ratings are either positive or negative, 0 or 1; there are some unobservable random factors that make some people like a show more or less, so these ratings are random variables. For a given person $i$, the probability that that person likes L.A. Finest is captured in some parameter $\theta_i$ (we don't observe that, of course), which is the probability of that person giving a positive rating.

So, our question above is whether the $\theta_i$ of the critics and the $\theta_i$ of the audience are the same or "opposed." And what is "opposed"? If $i$ and $j$ use opposite criteria, the probability that $i$ gives a 1 is the probability that $j$ gives a 0, so $\theta_i = 1-\theta_j$.

We don't have the individual parameters $\theta_i$ but we can simplify again by assuming that all variation within each group (critics or audience) is random, so we really only need two $\theta$.

We are comparing two situations, call them: hypothesis zero, $H_0$, meaning the critics and the audience use the same criteria, that is they have the same $\theta$, call it $\theta_0$; and hypothesis one, $H_1$, meaning the critics use criteria opposite to those of the audience, so if the critics $\theta$ is $\theta_1$, the audience $\theta$ is $(1-\theta_1)$.

Yes, I know, we don't have $\theta_0$ or $\theta_1$. We'll get there.

Our "words" question now becomes the following math problem: how much more likely is it that the data we observe is created by $H_1$ versus created by $H_0$, or in a formula: what is the likelihood ratio

$LR = \frac{\Pr(\mathrm{Data}| H_1)}{\Pr(\mathrm{Data}| H_0)} $?

Observation: This is different from the usual statistics test: the usual test is whether the two distributions are different; we are testing for a specific type of difference, opposition. So there are in fact three states of the world: same, opposite, and different but not opposite; we want to compare the likelihood of the first two. If same is much more likely than opposite, then we conclude 'same.' If opposite is much more likely than same, we conclude 'opposite.' If same and opposite have similar likelihoods (for some notion of 'similar' we'd have to investigate), then we conclude 'different but not opposite.'

Our data is four numbers: number of critics $N_C = 10$, number of positive reviews by critics $k_C = 1$, number of audience members $N_A = 40$, number of positive reviews by audience members $k_A = 30$.

But what about the $\theta_0$ and $\theta_1$?

This is where the lofty field of mathematics gives way to the down and dirty world of estimation. We estimate $\theta$ by maximum likelihood, and the maximum likelihood estimator for the probability of a positive outcome of a binary random variable (called a Bernoulli variable) is the sample mean.

Yep, all those words to say "use the share of 1s as the $\theta$."

Not so fast. True, for $H_0$, we use the share of ones

$\theta_0 = (k_C + k_A)/(N_C + N_A) = 31/50 = 0.62$;

but for $H_1$, we need to address the audience's $1-\theta_1$ by reverse coding the zeros and ones, in other words,

$\theta_1 = (k_C + (N_A - k_A))/(N_C + N_A) = 11/50 = 0.22$.

Yes, those two fractions are "estimation." Maximum likelihood estimation, at that.

Now that we are done with the dirty statistics, we come back to the shiny world of math, by using our estimates to solve the math problem. That requires a small bit of combinatorics and probability theory, all in a single sentence:

If each individual data point is an independent and identically distributed Bernoulli variable, the sum of these data points follows the binomial distribution.

Therefore the desired probabilities, which are joint probabilities of two binomial distributions, one for the critics, one for the audience, are

$\Pr(\mathrm{Data}| H_0) = c(N_C,k_C) (\theta_0)^{k_C} (1- \theta_0)^{N_C- k_C} \times c(N_A,k_A) (\theta_0)^{k_A} (1- \theta_0)^{N_A- k_A}$

and

$\Pr(\mathrm{Data}| H_1) = c(N_C,k_C) (\theta_1)^{k_C} (1- \theta_1)^{N_C- k_C} \times c(N_A,k_A) (1 -\theta_1)^{k_A} (\theta_1)^{N_A- k_A}$.

Replacing the symbols with the estimates and the data we get

$\Pr(\mathrm{Data}| H_0) = 3.222\times 10^{-5}$;
$\Pr(\mathrm{Data}| H_1) = 3.066\times 10^{-2}$.

We can now compute the likelihood ratio,

$LR = \frac{\Pr(\mathrm{Data}| H_1)}{\Pr(\mathrm{Data}| H_0)} = 915$,

and translate that into words to make the statement
It's 915 times more likely that critics are using criteria opposite to those of the audience than the same criteria.
Isn't that a lot more satisfying than saying they "appear to be at loggerheads"?

Sunday, February 12, 2017

Word Thinkers and the Igon Value Problem

Nassim Nicholas Taleb did it again: "word thinkers," now a synonym for his previous coinage IYI (Intellectuals Yet Idiots).
I often say that a mathematician thinks in numbers, a lawyer in laws, and an idiot thinks in words. These words don’t amount to anything. 
A little unfair, though I've often cringed at the use of technical words by people who don't seem to know the meaning of those words. This sometimes leads to never-ending words-only arguments about things that can be determined in minutes with basic arithmetic or with a spreadsheet.


To not rehash the Heisenberg traffic stop example, here's one from a recent discussion of the putative California secession from the US (and already mentioned in this blog): people discussed California's need for electricity, with the pro-Calexit people assuming that appropriate capacity could be added in a jiffy, while the con-Calexit people assumed the state would instantly be blacked out.

No one thought of actually looking up the numbers and checking out the needs. Using 2015 numbers, California would need to add about 15GW of new dispatchable generation for energy independence, assuming no demand growth. (Computations in this post.) So, that's a lot, but not unsurmountable in, say, a decade with no regulatory interference. Maybe even less time, with newer technologies (yes, all nuclear; call it a French connection).

There was no advanced math in that calculation: literally add and divide. And the data was available online. But the "word thinkers" didn't think about their words as having meaning.

And that's it: the problem is not so much that they think in words, but rather that they don't associate any meaning to the words. They are just words, and all that matters is their aesthetic and signaling value.

Few things exemplify the problem of these words-without-meaning as well as The Igon Value Problem.

In a review of Malcolm Gladwell's collection of essays "What the dog saw and other adventures" for The New York Times, Steven Pinker coined that phrase, picking on a problem of Gladwell that is common to the words-without-meaning thinkers:
An eclectic essayist is necessarily a dilettante, which is not in itself a bad thing. But Gladwell frequently holds forth about statistics and psychology, and his lack of technical grounding in these subjects can be jarring. He provides misleading definitions of “homology,” “sagittal plane” and “power law” and quotes an expert speaking about an “igon value” (that’s eigenvalue, a basic concept in linear algebra). In the spirit of Gladwell, who likes to give portentous names to his aperçus, I will call this the Igon Value Problem: when a writer’s education on a topic consists in interviewing an expert, he is apt to offer generalizations that are banal, obtuse or flat wrong. [Emphasis added]
Educational interlude:
Eigenvalues of a square $[n\times n]$ matrix $M$ are the constants $\lambda_i$ associated with vectors $x_i$ such that $M \, x_i = \lambda_i \, x_i$. In other words, these vectors, called eigenvectors, are along the directions in $n$-dimensional space that are unchanged when operated upon by $M$; the $\lambda_i$ are proportionality constants that show how the vectors stretch in that direction. Because of this $n$-dimensional geometric interpretation, the $x_i$ are the matrix's "own vectors" (in German, eigenvectors) and by association the $\lambda_i$ are the "own values" (in German, you guessed it, eigenvalues). 
Eigenvectors and eigenvalues reveal the deep structure of the information content of whatever the matrix represents. For example: if $M$ is a matrix of covariances among statistical variables, the eigenvectors represent the underlying principal components of the variables; if $M$ is an incidence matrix representing network connections, the eigenvector with the highest eigenvalue ranks the centrality of the nodes in the network.
This educational interlude is a demonstration of the use of words (note that there's no actual derivation or computation in it) with deep meaning, in this case mathematical.

Being a purveyor of "generalizations that are banal, obtuse or flat wrong" hasn't harmed Gladwell; in fact, his success has spawned a cottage industry of what Taleb is calling word-thinkers, which apparently are now facing an impending rebellion.

Taleb talks about 'skin in the game,' which is a way to say, having an outside validator: not popularity, not social signaling; money, physical results, a verifiable mathematical proof. All of these come with the one thing word-thinkers avoid:

A clear succeed/fail criterion.

- - - - - - - - - -

Added 2/16/2017: An example of word-thinking over quantitative matters.

From a discussion about Twitter, motivated by their filtering policies:
Person A: "I wonder how long Twitter can burn money, billions/yr.  Who is funding this nonsense?"
My response: "Actually, from latest available financials, TWTR had a $\$ 77$ million positive cash flow last year. Even if its revenue were to dry up, the operational cash outflow is only $\$ 220$ million/year; with a $\$ 3.8$ billion cash-in-hand reserve, it can last around 17 years at zero inflow."
Numbers are easy to obtain and the only necessary computation is a division. But Person A didn't bother to (a) look up the TWTR financials, (b) search for the appropriate entries, and (c) do a simple computation.

That's the problem with word thinking about quantitative matters: those who take the extra quant step will always have the advantage. As far as truth and logic are concerned, of course.

Friday, January 13, 2017

Medical tests and probabilities

You may have heard this one, but bear with me.

Let's say you get tested for a condition that affects ten percent of the population and the test is positive. The doctor says that the test is ninety percent accurate (presumably in both directions). How likely is it that you really have the condition?

[Think, think, think.]

Most people, including most doctors themselves, say something close to $90\%$; they might shade that number down a little, say to $80\%$, because they understand that "the base rate is important."

Yes, it is. That's why one must do computation rather than fall prey to anchor-and-adjustment biases.

Here's the computation for the example above (click for bigger):


One-half. That's the probability that you have the condition given the positive test result.

We can get a little more general: if the base rate is $\Pr(\text{sick}) = p$ and the accuracy (assumed symmetric) of the test is $\Pr(\text{positive}|\text{sick}) = \Pr(\text{negative}|\text{not sick})  = r $, then the probability of being sick given a positive test result is

\[ \Pr(\text{sick}|\text{positive}) = \frac{p \times r}{p \times r + (1- p) \times (1-r)}. \]

The following table shows that probability for a variety of base rates and test accuracies (again, assuming that the test is symmetric, that is the probability of a false positive and a false negative are the same; more about that below).


A quick perusal of this table shows some interesting things, such as the really low probabilities, even with very accurate tests, for the very small base rates (so, if you get a positive result for a very rare disease, don't fret too much, do the follow-up).


There are many philosophical objections to all the above, but as a good engineer I'll ignore them all and go straight to the interesting questions that people ask about that table, for example, how the accuracy or precision of the test works.

Let's say you have a test of some sort, cholesterol, blood pressure, etc; it produces some output variable that we'll assume is continuous. Then, there will be a distribution of these values for people who are healthy and, if the test is of any use, a different distribution for people who are sick. The scale is the same, but, for example, healthy people have, let's say, blood pressure values centered around 110 over 80, while sick people have blood pressure values centered around 140 over 100.

So, depending on the variables measured, the type of technology available, the combination of variables, one can have more or less overlap between the distributions of the test variable for healthy and sick people.

Assuming for illustration normal distributions with equal variance, here are two different tests, the second one being more precise than the first one:



Note that these distributions are fixed by the technology, the medical variables, the biochemistry, etc; the two examples above would, for example, be the difference between comparing blood pressures (test 1) and measuring some blood chemical that is more closely associated with the medical condition (test 2), not some statistical magic made on the same variable.

Note that there are other ways that a test A can be more precise than test B, for example if the variances for A are smaller than for B, even if the means are the same; or if the distributions themselves are asymmetric, with longer tails on the appropriate side (so that the overlap becomes much smaller).

(Note that the use of normal distributions with similar variances above was only for example purposes; most actual tests have significant asymmetries and different variances for the healthy versus sick populations. It's something that people who discover and refine testing technologies rely on to come up with their tests. I'll continue to use the same-variance normals in my examples,  for simplicity.) 


A second question that interested (and interesting) people ask about these numbers is why the tests are symmetric (the probability of a false positive equal to that of a false negative). 

They are symmetric in the examples we use to explain them, since it makes the computation simpler. In reality almost all important preliminary tests have a built-in bias towards the most robust outcome.

For example, many tests for dangerous conditions have a built-in positive bias, since the outcome of a positive preliminary test is more testing (usually followed by relief since the positive was a false positive), while the outcome of a negative can be lack of treatment for an existing condition (if it's a false negative).

To change the test from a symmetric error to a positive bias, all that is necessary is to change the threshold between positive and negative towards the side of the negative:



In fact, if you, the patient, have access to the raw data (you should be able to, at least in the US where doctors treat patients like humans, not NHS cost units), you can see how far off the threshold you are and look up actual distribution tables on the internet. (Don't argue these with your HMO doctor, though, most of them don't understand statistical arguments.)

For illustration, here are the posterior probabilities for a test that has bias $k$ in favor of false positives, understood as $\Pr(\text{positive}|\text{not sick}) = k \times \Pr(\text{negative}|\text{sick})$, for some different base rates $p$ and probability of accurate positive test $r$ (as above):


So, this is good news: if you get a scary positive test for a dangerous medical condition, that test is probably biased towards false positives (because of the scary part) and therefore the probability that you actually have that scary condition is much lower than you'd think, even if you'd been trained in statistical thinking (because that training, for simplicity, almost always uses symmetric tests). Therefore, be a little more relaxed when getting the follow-up test.


There's a third interesting question that people ask when shown the computation above: the probability of someone getting tested to begin with. It's an interesting question because in all these computational examples we assume that the population that gets tested has the same distribution of sick and health people as the general population. But the decision to be tested is usually a function of some reason (mild symptoms, hypochondria, job requirement), so the population of those tested may have a higher incidence of the condition than the general population.

This can be modeled by adding elements to the computation, which makes the computation more cumbersome and detracts from its value to make the point that base rates are very important. But it's a good elaboration and many models used by doctors over-estimate base rates precisely because they miss this probability of being tested. More good news there!


Probabilities: so important to understand, so thoroughly misunderstood.


- - - - -
Production notes

1. There's nothing new above, but I've had to make this argument dozens of times to people and forum dwellers (particularly difficult when they've just received a positive result for some scary condition), so I decided to write a post that I can point people to.

2. [warning: rant]  As someone who has railed against the use of spline drawing and quarter-ellipses in other people's slides, I did the right thing and plotted those normal distributions from the actual normal distribution formula. That's why they don't look like the overly-rounded "normal" distributions in some other people's slides: because these people make their "normals" with free-hand spline drawing and their exponentials with quarter ellipses, That's extremely lazy in an age when any spreadsheet, RStats, Matlab, or Mathematica can easily plot the actual curve. The people I mean know who they are. [end rant]

Saturday, March 5, 2016

Powerlifters vs Gym Rats - A tale of two means

In my last post I wrote:

For example, some time ago I had a discussion with a friend about strength training. The gist of it was that powerlifters are typically much stronger than the average athlete, but they are also much fewer; because of that, in a typical gym the strongest athlete might not be a powerlifter, but as we get into regional competitions and national competitions, the winner is going to be a powerlifter.

And the explanation, which the friend didn't understand, was "because on the upper tail the difference between means is going to dominate the difference in sizes of the population."

So here's an illustration of what I meant, with pictures and numbers and bad jokes.

First let's make the setup explicit. That's the great power of math and numerical examples, making things explicit. "Powerlifters are typically much stronger than the average athlete" will be operationalized with four assumptions:
A1: There's some composite metric of strength, call it $S$ that we care about and we'll normalize it so that the average gym rat has a mean $\mu(S_{\mathrm{GR}})$ of zero and a variance of $1$. 
A2: The distribution of strength within the population of gym rats is Normally distributed. 
A3: The distribution of strength in the sub-population of powerlifters is also Normally distributed. 
A4: For illustration purposes only, we will assume that powerlifters have a mean $\mu(S_{\mathrm{PL}})$ of 2 and the same variance as the rest of the gym rats.
We operationalize "they are also much fewer" with
A5: For illustration, the number of powerlifters is $1\%$ of gym rats.
(Powerlifters are gym rats, so the distribution for $S_{\mathrm{GR}}$ includes these $1\%$, balanced by CrossFit people, who bring down the mean strength and IQ in the gym while raising the insurance premiums. Watch Elgintensity to understand.)

The following figure shows the distributions:




When we look at the people in a gym with above-average strength, that is people with $S_{\mathrm{GR}}>0$, we find that one-half of all gym rats have that, and $98
\%$ of all powerlifters have that: $\Pr(S_{\mathrm{GR}}>0) = 0.5$ and $\Pr(S_{\mathrm{PL}}>0) = 0.98$. This is illustrated in the next figure:



Powerlifters are over-represented in the above-average strength, approximately twice as much as in the general population, but they are only about $2\%$ of the total, as their over-representation is multiplied by $1\%$.

As we become more selective, the over-representation goes up. For athletes that are at least one standard deviation above the mean, we have:



with $\Pr(S_{\mathrm{GR}}>1) = 0.16$ and $\Pr(S_{\mathrm{PL}}>1) = 0.84$. Powerlifters are over-represented 5-fold, so about $5\%$ of the total athletes in this category.

When we become more and more selective, for example when we compute the number of gym rats that have at least as much strength as the average powerlifter, $\Pr(S_{\mathrm{GR}}>2)$, we get



with $\Pr(S_{\mathrm{GR}}>2) = 0.023$ and $\Pr(S_{\mathrm{PL}}>2) = 0.5$, a 22-fold over-representation, meaning that of every six athletes in this category, one is a powerlifter. (Yes, one out of six, not one out of five. See if you can figure out why; if not, look at the solution for $S>6$ below and you'll understand. Or not, but that's a different problem.)

And as we look at subsets of stronger and stronger athletes, the over-representation of powerlifters becomes higher and higher: $\Pr(S_{\mathrm{GR}}>3) = 0.00135$ and $\Pr(S_{\mathrm{PL}}>3) = 0.159$, $118$-fold ratio. There will be a few more powerlifters in this group that other gym rats; another way to say that is that powerlifters will be a little bit more than one-half of all gym rats that are at least one standard deviation stronger than the average powerlifter.

The ratios grow exponentially with increasing values for strength (the rare correct use of "exponentially" as they are ratios of Normal distribution tail probabilities; see below).

For $S>4$ the ratio is $718$, for $S>5$ the ratio is $4700$, for $S>6$ the ratio is $32 100$, in other words, there will be one non-powerlifter per group of $322$ gym rats with strength greater than 6 standard deviations above the mean of all gym rats.

This is what the effect of the differences in the tails of Normals always implies: eventually the small size of the better population (powerlifters) will be irrelevant as the higher mean will dominate.

See? That wasn't complicated at all.

-- -- -- --

For the mathematically inclined (strangely themselves over-represented in the set of powerlifters...)

Note that the ratio of probability density functions for the two Normal distributions in the post, for realizations of strength $S = x$ is
\[
\frac{f_{S}(x|\mu_{S}=2)}{f_{S}(x|\mu_{S}=0)}= \frac{e^{-(x-2)^2/2}}{e^{-x^2/2}}= e^{2x-2}
\]
which grows unbounded with $x$; no matter how small the fraction of powerlifters, say $\epsilon$, there's always a minimal $\bar S$ beyond which that ratio becomes greater than $1/\epsilon$ Which means that at some point above $\bar S$ the ratio of the remaining tail itself becomes greater than $1/\epsilon$. (It's very easy to calculate $\bar S$ and I have done so; I'll leave it as an exercise for the dedicated reader...)

Oh, that's the rare occurrence of the correct use of "exponentially," which is usually incorrectly treated as a synonym for "convex."

Wednesday, March 2, 2016

Acalculia, innumeracy, or numerophobia?

I think there's an epidemic of number-induced brain paralysis going around.

There are quite a few examples of quant questions in interviews creating the mental equivalent of a frozen operating system (including this post by Sprezzaturian), but I think that there's something beyond that, something that applies in social situations and that affects people who should know better.

Here's a simple example. What is the orbital speed of the International Space Station, roughly? No, don't google it, calculate it. Orbital period is about 90 minutes, altitude (distance to ground) about 400km, Earth radius is about 6370km.

Seriously, this question stumps people with university degrees, including some in the life sciences who necessarily have taken college level science courses.

And what college-level math do you need to answer it? The formula for the circumference of a circle of radius $r$. Yes, $2\times\pi\times r$. The orbital velocity in km/h is the total number of kilometers per orbit ($2\times\pi\times (6370+400)$) divided by the time to orbit in hours ($1\frac{1}{2}$), that is around $28\,000$ km/h, which is close to the actual value, $27\, 600$ km/h. (The orbit is an ellipse and takes more than 90 minutes.)

Can it possibly be ignorance, innumeracy? Is it plausible that college-educated professionals don't know the circumference formula?  Nope, they can recite the formula when prompted.

Or is it acalculia? That they have a mental inability to do calculation? Nope, they can compute exactly how much I owe on the lunch bill for the extra crème brûlée and the expensive entrée.

No, I think it's a mild case of numerophobia, a mental paralysis created by the appearance of an unexpected numerical challenge in normal life. This is a problem, as most of the world can be perceived more deeply if one thinks like a quant all the time; many strange "paradoxes" become obvious when seen through the lens of numerical (or parametrical) thinking.

For example, some time ago I had a discussion with a friend about strength training. The gist of it was that powerlifters are typically much stronger than the average athlete, but they are also much fewer; because of that, in a typical gym the strongest athlete might not be a powerlifter, but as we get into regional competitions and national competitions, the winner is going to be a powerlifter.

"That's because on the upper tail the difference between means is going to dominate the difference in sizes of the population." That quoted sentence is what I said. I might as well have said "boo-blee-gaa-gee in-a-gadda-vida hidee-hidee-hidee-oh" for all the comprehension. The friend is an engineer. A numbers person. But apparently, numbers are work-domain only.

The awesome power of quant thinking is being blocked by this strange social numerophobia. We must fight it. Liberate your inner quant; learn to love numbers in all areas of life.

Everything is numbers.

Saturday, February 18, 2012

Analysis of the Tweets vs. Likes at the Monkey Cage

I find the question of what posts are more likely to be tweeted than liked a little strange; ideally one would want more of both.

The story so far:  a Monkey Cage post proposed some hypotheses for what characteristics of a post made it more likely to be tweeted than liked. Causal Loop did the analysis (linked at the Monkey Cage) using a composite index. Laudable as the analysis was (and how different Political Science is from the 1990s), I think I can improve upon it.

First, there are 51 (of 860 total) posts with zero likes and zero tweets. This is important information: these are posts that no one thought worthy of social media attention. Unlike Causal Loop, I want to keep these data in my dataset.

Second, instead of a ratio of likes to tweets (or more precisely, an index based on a modified ratio), I'll estimate separate models for likes and tweets, with comparable specifications. To see the problem with ratios consider the following three posts

Post A: 4 tweets, 2 likes
Post B: 8 tweets, 2 likes
Post C: 400 tweets, 200 likes

A ratio metric treats posts A and C as identical, while separating them from post B. But intuitively we expect a post like C, which generates a lot of social media activity in aggregate, to be different from posts A and B, which don't. (This scale insensitivity is a general characteristic of ratio measures.) This is one of the reasons I prefer disaggregate models. Another reason is that adding Google "+1"s would be trivial to a disaggregate model -- just run the same specifications for another dependent variable -- and complex to a ratio-based index.

To test various hypotheses one can use appropriate tests on the coefficients of the independent variables in the models or simulations to test inferences when the specifications are different (and a Hausman-like test isn't conveniently available). That's what I would do for more serious testing. With identical specifications one can compare the z-values, of course, but that's a little too reductive.

Since the likes and tweets are count variables, all that is necessary is to model the processes generating each as the aggregation of discrete events. For this post I assumed a Poisson process; its limitations are discussed below.

I loaded Causal Loop's data into Stata (yes, I could have done it in R, but since the data is in Stata format and I still own Stata, I minimized effort) and run a series of nested Poisson models: first with only the basic descriptor variables (length, graphics, video, grade level), then adding the indicator variables for the authors, then adding the indicator variables for the topics.  The all-variables-included models results (click for bigger):

Determinants of likes and tweets for posts in The Monkey Cage blog

A few important observations regarding this choice of models:

1. First and foremost, I'm violating the Prime Directive of model-building: I'm unfamiliar with the data. I read the Monkey Cage regularly, so I have an idea of what the posts are, but I didn't explore the data to make sure I understood what each variable meant or what the possible instantiations were. In other words, I acted as a blind data-miner. Never do this! Before building models always make sure you understand what the data mean. My excuse is that I'm not going to take the recommendations seriously and this is a way to pass the morning on Saturday. But even so, if you're one of my students, do what I say, not what I just did.

2. The choice of Poisson process as basis for the count model, convenient as it is, is probably wrong. There's almost surely state dependence in liking and tweeting: if a post is tweeted, then a larger audience (Twitter followers of the person tweeting rather than Monkey Cage readers) gets exposed to it, increasing the probability of other tweets (and also of likes -- generated from the diffusion on Twitter which brings people to the Monkey Cage who then like posts to Facebook). By using Poisson, I'm implicitly assuming a zero-order process and independence between tweets and likes -- which is almost surely not true.

3. I think including the zeros is very important. But my choice of a non-switching model implies that the differences between zero and other number of likes and tweets is only a difference of degree. It is possible, indeed likely, that they are differences of kind or process. To capture this, I'd have to build a switching model, where the determinants of zero likes or tweets were allowed to be separate from the determinants of the number of tweets and likes conditional on their being nonzero.

With all these provisos, here are some possible tongue-in-cheek conclusions from the above models:
  • Joshua Tucker doesn’t influence tweetability, but his authorship decreases likability; ditto for Andrew Gelman and John Sides. Sorry, guys.
  • James Fearon writes tweetable but not likable content.
  • Potpourri is the least tweetable tag and also not likable; International relations is the most tweetable but not likable; Frivolity, on the other hand is highly likable. That says something about Facebook, no? 
  • Newsletters are tweetable but not likable… again Nerds on Tweeter, Airheads on Facebook.
As for Joshua Tucker's hypotheses, I find some support for them, from examining the models, but I wouldn't want to commit to a support or reject before running some more elaborate tests.

Given my violation of the Prime Directive of model building (make sure you understand the data before you start building models), I wouldn't start docking the -- I'm sure -- lavish pay and benefits afforded by the Monkey Cage to its bloggers based on the numbers above.

Thursday, February 9, 2012

Quantitative thinking: not for everyone. And that's bad.

Not all smart people are quantitative thinkers.

I've noticed that some smart people I know have different views of the world. Not just social, cultural, political, or aesthetic. They really do see the world through different conceptual lenses: mine are quantitative, theirs are qualitative.

Let's keep in mind that these are smart people who, when prompted to do so, can do basic math. But many of them think about the world in general, and most problems in particular, in a dequantified manner or limit their quantitative thinking in ways that don't match their knowledge of math.

Level 1 - Three different categories for quantities

Many people seem to hold the the three-level view of numbers: all quantities are divided into three bins: zero, one, many. In a previous post I explain why it's important to drill down into these categories: putting numbers in context requires, first of all, that the numbers are actual numbers, not categorical placeholders.

This tripartite view of the world is particularly bad when applied to probabilistic reasoning, because the world then becomes a three-part proposition: 0 (never), 50-50 (uncertain, which is almost always treated as the maximum entropy case), or 1 (always).

Once, at a conference, I was talking to a colleague from a prestigious school who, despite agreeing that a probability of 0.5 is different from a probability of 0.95, proceeded to argue his point based on an unstated 50-50 assumption. Knowing that $0.5 \neq 0.95$ didn't have any impact in his tripartite view of the world of uncertainty.

The problem with having a discussion with someone who thinks in terms of {zero, one, many} is that almost everything worth discussing requires better granularity than that. But the person who thinks thusly doesn't understand that it is even a problem.



Level 2 - Numbers and rudimentary statistics

Once we're past categorical thinking, things become more interesting to quantitatively focused people; this, by the way, is where a lot of muddled reasoning enters the picture. After all, many colleagues at this level of thinking believe that, by going beyond the three-category view of numbers, they are "great quants," which only proves the Dunning-Krueger effect applies.

For illustration we consider the relationship between two variables, $x$ and $y$, say depth of promotional cut (as a percentage of price) and promotional lift (as a percentage increase in unit sales due to promotion). Yep, a business example; could be politics or any social science (or science for that matter), but business is a neutral field.

At the crudest level, understanding the relationship between $x$ and $y$ can be reduced to determining whether that relationship exists at all; usually this is done by determining whether variation in one, $x$, can predict variation in the other, $y$.  For example, a company could run a contrast experiment ("A-B test" for those who believe Google invented experiments) by having half their stores run a promotion and half not; the data would then be, say:

Sales in stores without promotion: 200,000 units/store
Sales in stores with promotion: 250,000 units/store

Looks like a relationship, right? An apparent 25-percent lift (without knowing the depth of the price cut I can't comment on whether this is good or bad). But what if the average sales for all stores when there are no promotions on any store is 240,000 units/store? All this promotion apparently did was discourage some customers in the stores without promotions (the customers know about the promotion in other stores because you cannot stop information for diffusing over social media, for example) and incentivize a few of the discouraged to look for the stores running the promotion.

(A lot of anecdotes used to support public policy make the sort of mistake I just illustrated. There are plenty of other mistakes, too.)

To go beyond the simple observation of numbers and to use statistical tests, we need to have some formulation of the relationship, for example a linear one such as:

$\qquad y = \beta \, x + \epsilon$.

This formulation includes a term $\epsilon$ (called stochastic disturbance) which is the modeler's admission that we don't know everything we'd like to. (All tests have an underlying structure, even non-parametric tests; when people say that there's no structure what they are really saying is that they don't understand how the test works.)

Given some pairs of observations $\{(x_1,y_1), (x_2,y_2),\ldots\}$ , the relationship can be tested by estimating the parameter $\beta$ and determining whether the estimate $\hat \beta$ is significantly different from zero. If it's not, that means that the value of $y$ is statistically independent of $x$ (to the level of the test) and there is no relationship between them -- as far as statistical significance is concerned.

There's a lot to argue about significance testing, some of which I put in this video:


Once we get past simple tables and possibly the prepackaged statistical tests that can be done on these tables -- almost like an incantation with statistical software taking the place of the magical forces--, few people remain who want to discuss details. But even within that small set, there are many different sub-levels of thinking.



Level 3 - Thinking in models and functions

Let's go back to the linear formulation in $y = \beta \, x + \epsilon$. What this means is that lift $y$ increases with price cut $x$ in a proportional way, independent of the magnitudes of each.

Ok, so what? ask a lot of people whose level of numerical reasoning is being stretched. The "what" is that the effect of a change of price cut from 4 to 5 percent is assumed to be equal to that effect of the change from 45 to 46 percent. And this assumption is probably not true (actually, empirically we have evidence that this is not true).

Many people are able to repeat the rationale in the previous paragraph, but don't grok the implications.

The questions of where we go from this simple model are complicated. Let us ignore questions of causality for now, and focus on how different people want perceive the importance of details in the relationship between $x$ and $y$.

Increasing vs decreasing. Almost everyone who gets to this level of thinking cares about the direction of the effect. At this stage, however, many people forget that functions may be monotonic (increasing or decreasing) over an interval while outside that interval they may become non-monotonic (for example, increasing until a given point and then decreasing).

Convex versus concave. Even when the function is monotonic over the interesting domain, there's a big difference between linear, convex, and concave functions. Some disagreements with very smart people turned out to be over different assumptions regarding this second derivative: implicitly many people act as if the world is either linear or concave (assuming that the effect of adding 1 to 10 is bigger than the effect of adding 1 to 1000). As I pointed out in this post about network topologies and this post about models, combinatorics has a way of creating convexities. There's also a lot of s-shaped relationships in the world, but we'll leave those alone for now.

Functional form. As I illustrated in my post on long tails, two decreasing convex functions (the probability mass functions of the Poisson and Zipf distributions) can have very important differences. Empirical researchers are likely to care more about this than theoretical modelers, but once we reach the stage where we are discussing in these terms (and the group of people who can follow and participate in this discussion) arguments tend to be solved by mathematical inference or model calibration. In other words, leaving personal issues and inconvenient implications aside.

(Needless to say -- but I'll write it anyway -- this is the level of discussion I'd like to have when consequences are important. Alas, it's not very common; certainly not in the political or social sciences arena. In business and economics it's becoming more common and in STEM it's a foundation.)

Elaboration is still possible. I'll illustrate by noting that underlying assumptions (that I never made explicit, mind you) can come back to bite us in the gluteus maximus.

(Non-trivial statistics geekdom follows; skip till after the next picture to avoid some technical points about model building.)

Let's assume that we collect and store the data disaggregate by customer, so that $y_i$ is the quantity (not lift) bought by customer $i$; after all, we can always make aggregate data from disaggregate data but seldom can do the opposite. How would we analyze this data?

First observation: expenditures per customer are greater than zero, always. But our model might predict, for some values of $\epsilon$ a negative prediction for $y_i$  times price (which is a positive number). So our model needs to be tweaked to take into account the hard bound at zero.

If ours were retail stores, where the data collected by the PoS scanners is only available for customers who buy something (in other words, we don't observe $y$ when $y=0$), we would have to use a technique called a censored regression; if we observe the zeros (like on a online retail site), then a model called Tobit will account for the pooling of the probability mass at zero.

Second observation: the number of units bought by any given customer is an integer; we keep treating it as a continuous quantity. Typically regression models and their variants like censored regression and Tobit assume that the stochastic disturbances are Normal variables. That would lead to possible $y_i = 1.35$, which is nonsensical in our new data: $y_i \in \{0,1,2,3,\ldots\}$.

Counting models, like a Poisson regression (which has its own assumptions) take the discreteness into account and correct the problems introduced by the continuity assumption. In olden days (when? the 50s?) these were hard models to estimate but now they are commonly included in statistical packages so there is no reason not to use them.

For illustration, here's what these models look like:

Illustrating model differences: OLS, Tobit, and Poisson




Conclusion - why is it so hard to explain these things?

Thinking quantitatively is like a super-power: where others know of phenomena, we know how much of a phenomenon.*

The problem is that this is not like a amplifier super-power, like telescopic vision is to vision, but rather an orthogonal super-power, like the ability to create multiple instances of oneself. It's hard to explain to people without the super-power (people who don't think in numbers, even though they're smart) and it's hard to understand their point of view.

Contrary to the tagline of the television show Numb3rs, not everyone thinks in numbers.

That's a pity.


-- -- -- --
* A tip of the hat to Dilbert creator Scott Adams, via Ilkka Kokkarinen's blog for pointing this out in a post which is now the opening chapter of his book.

Wednesday, December 21, 2011

Powerful problems with power law estimation papers

Perhaps I shouldn't try to make resolutions: I resolved to blog book notes till the end of the year, and instead I'm writing something about estimation.

A power law is a relationship of the form $y = \gamma_0 x^{\gamma_1}$ and can be linearized for estimation using OLS (with a very stretchy assumption on stochastic disturbances, but let's not quibble) into

$\log(y) = \beta_0 + \beta_1 \log(x) +\epsilon$,

from which the original parameters can be trivially recovered:

$\hat\gamma_0 = \exp(\hat\beta_0)$ and $\hat\gamma_1 = \hat\beta_1$.

Power laws are plentiful in Nature, especially when one includes the degree distribution of social networks in a – generous and uncommon, I admit it – definition of Nature. An usually proposed source of power law degree distribution is preferential attachment in network formation: the probability of a new node $i$ being connected to an old node $j$ is an increasing function of the degree of $j$.

The problem with power laws in the wild is that they are really hard to estimate precisely, and I got very annoyed at the glibness of some articles, which report estimation of power laws in highly dequantized manner: they don't actually show the estimates or their descriptive statistics, only charts with no error bars.

Here's my problem: it's well-known that even small stochastic disturbances can make parameter identification in power law data very difficult. And yet, that is never mentioned in those papers. This omission, coupled with the lack of actual estimates and their descriptive statistics, is unforgivable. And suspicious.

Perhaps this needs a couple of numerical examples to clarify; as they say at the end of each season of television shows now:

– To be continued –

Friday, December 2, 2011

Dilbert gets the Correlation-Causation difference wrong

This was the Dilbert comic strip for Nov. 28, 2011:


It seems to imply that even though there's a correlation between the pointy-haired boss leaving Dilbert's cubicle and receiving an anonymous email about the worst boss in the world, there's no causation.

THAT IS WRONG!

Clearly there's causation: PHB leaves Dilbert's cubicle, which causes Wally to send the anonymous email. PHB's implication that he thinks Dilbert sends the email is wrong, but that doesn't mean that the correlation he noticed isn't in this case created by a causal link between leaving Dilbert's cubicle and getting the email.

I think Edward Tufte once said that the statement "correlation is not causation" was incomplete; at least it should read "correlation is not causation, but it sure hints at some relationship that must be investigated further." Or words to that effect.

Saturday, September 17, 2011

Small probabilities, big trouble.

After a long – work-related – hiatus, I'm back to blogging with a downer: the troublesome nature of small probability estimation.

The idea for this post came from a speech by Nassim Nicholas Taleb at Penn. Though the video is a bit rambling, it contains several important points. One that is particularly interesting to me is the difficulty of estimating the probability of rare events.

For illustration, let's consider a Normally distributed random variable $P$, and see what happens when small model errors are introduced. In particular we want to how the probability density $f_{P}(\cdot)$ predicted by four different models changes as a function of distance to zero, $x$. The higher the $x$ the  more infrequently the event $P = x$ happens.

The densities are computed in the following table (click for larger):

Table for blog post

The first column gives $f_{P}(x)$ for $P \sim \mathcal{N}(0,1)$, the base case. The next column is similar except that there's a 0.1% increase in the variance (10 basis points*). The third column is the ratio of these densities. (These are not probabilities, since $P$  is a continuous variable.)

Two observations jump at us:

1. Near the mean, where most events happen, it's very difficult to separate the two cases: the ratio of the densities up to two standard deviations ($x=2$) is very close to 1.

2. Away from the mean, where events are infrequent (but potentially with high impact), the small error of 10 basis points is multiplied: at highly infrequent events ($x>7$) the density is off by over 500 basis points.

So: it's very difficult to tell the models apart with most data, but they make very different predictions for uncommon events. If these events are important when they happen, say a stock market crash, this means trouble.

Moving on, the fourth column uses $P \sim \mathcal{N}(0.001,1)$, the same 10 basis points error, but in the mean rather than the variance. Column five is the ratio of these densities to the base case.

Comparing column five with column three we see that similarly sized errors in mean estimation have less impact than errors in variance estimation. Unfortunately variance is harder to estimate accurately than the mean (it uses the mean estimate as an input, for one), so this only tells us that problems are likely to happen where they are more damaging to model predictive abilities.

Column six shows the effect of a larger variance (100 basis points off the standard, instead of 10); column seven shows the ratio of this density to the base case.

With an error of 1% in the estimate of the variance it's still hard to separate the models within two standard deviations (for a Normal distribution about 95% of all events fall within two standard deviations of the mean), but the error in density estimates at $x=7$ is 62%.

Small probability events are very hard to predict because most of the times all the information available is not enough to choose between models that have very close parameters but these models predict very different things for infrequent cases.

Told you it was a downer.

-- -- --

* Some time ago I read a criticism of this nomenclature by someone who couldn't see its purpose. The purpose is good communication design: when there's a lot of 0.01% and 0.1% being spoken in a noisy environment it's a good idea to say "one basis point" or "ten basis points" instead of "point zero one" or "zero point zero one" or "point zero zero one." It's the same reason we say "Foxtrot Universe Bravo Alpha Romeo" instead of "eff u bee a arr" in audio communication.

NOTE for probabilists appalled at my use of $P$  in $f_{P}(x)$ instead of more traditional nomenclature $f_{X}(x)$ where the uppercase $X$ would mean the variable and the lowercase $x$ the value: most people get confused when they see something like $p=\Pr(x=X)$.

Monday, May 9, 2011

That 81% prediction, it looks good, but needs further elaboration

Bobbing around the interwebs today we find a post about a prediction of UBL's location. A tip of the homburg to Drew Conway for being the first mention I saw. Now, for the prediction itself.

As impressive as a 81% chance attributed to the actual location of UBL is, it raises three questions. These are important questions for any prediction system after its prediction is realized. Bear in mind that I'm not criticizing the actual prediction model, just the attitude of cheering for the probability without further details.

Yes, 81% is impressive; did the model make other predictions (say the location of weapons caches), and if so were they also congruent with facts? Often models will predict several variables and get some right and others wrong. Other predicted variables can act as quality control and validation. (Choice modelers typically use a hold-out sample to validate calibrated models.) It's hard to validate a model based on a single prediction.

Equally important is the size of the space of possibilities relative to the size of the predicted event. If the space was over the entire world, and the prediction pointed to Abbottabad but not Islamabad, that's impressive; if the space was restricted to Af/Pk and the model predicted the entire Islamabad district, that's a lot less impressive. I predict that somewhere in San Francisco there's a panhandler with a "Why lie, the money's for beer" poster; that's not an impressive prediction. If I predict that the panhandler is on the Market - Valencia intersection, that's impressive.

Selection is the last issue: was this the only location model for UBL or were there hundreds of competing models and we're just seeing the best? In that case it's less impressive that a model gave a high probability to the actual outcome: it's sampling on the dependent variable. For example, when throwing four dice once, getting 1-1-1-1 is very unlikely ($1/6^4 \approx 0.0008$); when throwing four dice 10 000 times, it's very likely that the 1-1-1-1 combination will appear in one of them (that probability is $1-(1- 1/6^4)^{10000} \approx 1$).

Rules of model building and inference are not there because statisticians need a barrier to entry to keep the profession profitable. (Though they sure help with paying the bills.) They are there because there's a lot of ways in which one can make wrong inferences from good models.

Usama Bin Laden had to be somewhere; a sufficiently large set of models with large enough isoprobability areas will almost surely contain a model that gives a high probability to the actual location where UBL was, especially if it was allowed to predict the location of the top hundred Al-Qaeda people and it just happened to be right about UBL.

Lessons: 1) the value of a predicted probability $\Pr(x)$ for a known event $x$ can only be understood with the context of the predicted probabilities $\Pr(y)$ for other known events $y$; 2) we must be very careful in defining what $x$ is and what the space $\mathcal{X}: x \in \mathcal{X}$ is; 3) when analyzing the results of a model, one needs to control for the existence of other models [cough] Bayesian thinking [/cough].

Effective model building and evaluation need to take into account the effects of limited reasoning by those reporting model results, or, in simpler terms, make sure you look behind the curtain before you trust the magic model to be actually magical.

Summary of this post: in acrostic!