Tuesday, June 18, 2019

Hidden factor correlation



Correlation is not causation; everyone learns to say that. But if there's a correlation, there's probably some sort of causal relationship hiding somewhere, unless it's a spurious correlation.

If two variables, $A$ and $B$ are correlated, the three simplest causal relationships are: $A$ causes $B$; $B$ causes $A$; or $A$ and $B$ are caused by an unseen factor $C$. There are many more complicated causation relationships, but these are the three basic ones.

The third case, where an unseen variable $C$ is the real source of the correlation, is what we're interested in this post. To illustrate the case let's say $C$ is a standard normal random variable, and $A$ and $B$ are noisy measures of $C$,

$ \qquad A = C + \epsilon_A$ and $ B = C + \epsilon_B$,

where the $\epsilon_i$ are drawn from a normal distribution with $\sigma_{\epsilon} = 0.05$.

To illustrate we generate 10,000 draws of $C$ and create the 10,000 $A$ and $B$ using R:

hidden_factor = rnorm(10000)
var_A_visible = hidden_factor + 0.05 * rnorm(10000)
var_B_visible = hidden_factor + 0.05 * rnorm(10000)

Now we can plot $A$ and $B$, and the correlation is obvious

And we can regress $A$ on $B$ to get the correlation and test statistics for the estimates using a linear model,

model_no_control = lm(var_A_visible~var_B_visible)
summary(model_no_control)

With the result:

Call:
lm(formula = var_A_visible ~ var_B_visible)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.271466 -0.047214 -0.000861  0.047400  0.302517 

Coefficients:
               Estimate Std. Error  t value Pr(>|t|)    
(Intercept)   0.0005294  0.0007025    0.754    0.451    
var_B_visible 0.9975142  0.0006913 1442.852  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.07025 on 9998 degrees of freedom
Multiple R-squared:  0.9952, Adjusted R-squared:  0.9952 
F-statistic: 2.082e+06 on 1 and 9998 DF,  p-value: < 2.2e-16

So, both the model and the graph confirm a strong correlation ($p < 0.0001$) between $A$ and $B$. And in many real-life cases, this is used to support the idea that either $A$ causes $B$ or $B$ causes $A$.

Now we proceed to show how the hidden factor is relevant. First, let us plot the residuals, $A-C$ against $B-C$:

The apparent correlation has now disappeared. And a linear model including the hidden factor confirms this:

model_with_control = lm(var_A_visible~var_B_visible+hidden_factor)
summary(model_with_control)

With the result

Call:
lm(formula = var_A_visible ~ var_B_visible + hidden_factor)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.18347 -0.03382 -0.00021  0.03410  0.17780 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.0007378  0.0004986   1.480    0.139    
var_B_visible 0.0004082  0.0100560   0.041    0.968    
hidden_factor 0.9997573  0.0100707  99.274  2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.04985 on 9997 degrees of freedom
Multiple R-squared:  0.9976, Adjusted R-squared:  0.9976 
F-statistic: 2.072e+06 on 2 and 9997 DF,  p-value: < 2.2e-16

Hidden factors are easy to test for, as seen here, but they are not always apparent. For example, in nutrition papers there's often an hidden factor relating to how health-conscious an individual is that is more often than not causing both observables (say exercising regularly and eating salads; high correlation, but exercising doesn't cause eating salads and eating salads doesn't cause exercise).

Correlation is not causation, but generally one can find a causal relationship behind a correlation, possibly one that involves hidden factors or more complex relationships.

Monday, June 17, 2019

Calculating God?

I don't believe that the God of any earthly religion is the creator of the Universe.

But I really dislike a lazy and innumerate argument commonly used to "prove" the non-existence of God, which can be summarized in the following false dichotomy:
Either there is no God and the universe just 'poofed' into existence, or there's an infinite number of Gods, because the plane of existence for each God has to be created by a higher-level God.
This is a false dichotomy: it could well be that our universe was created by a powerful being from a higher-order universe, but that universe poofed into existence without a creator. Or maybe it did have a creator, whose universe poofed into existence; or that third universe may have had a creator...

Hey, this looks like dynamic programming. I know dynamic programming.

Let's say that universes are recursively nested until one of them just poofs into existence. Of course we can't see outside our universe, but we can build simple models.

So, our universe either poofed into existence (say with probability $p$) or it was created by some higher being (with probability $1-p$). Now we iterate the process: 'level 2' universe either poofed into existence (with some probability $q$) or was created by a 'level 3' universe being (with probability $1-q$); and so on.

Time for a simplifying assumption, or as non-mathematicians call it, making things up. Let's assume that all these universes share the poofed/created probabilities, so that for any 'level $k$' universe, it poofed into existence with probability $p$ and was created by a being from a 'level $k+1$' universe with probability $1-p$.

Note that it's still possible to have an infinite number of universes, but with this formulation, the probability of a 'level $k$' universe (with us being 'level 1') being the last level is

$p (1-p)^{k-1}$.

This probability gets small pretty quickly, which suggests the 'infinite regress of universes' argument gets thin very fast.



Now we can compute the expected number of universes as a function of $p$:

$\mathbb{E}(n) = p + 2(1-p)p + 3 (1-p)^2 p + \ldots N (1-p)^{N-1} p + \ldots$

or

$\mathbb{E}(n) = p/(1-p) \times ( \text{ sum of series } N (1-p)^N )$

The sum of series $N (1-p)^N$ is $(1-p)/p^2$, so

$E(n) = 1/p$

Therefore, if we believe that the probability of a universe poofing into existence is 0.1, there are an expected ten universes; for 0.2, five universes; for 0.5, two universes.

Very far from 'turtles all the way down.'

Of course, these calculations were unnecessary, because as we know from the revelations of the prophet Terry Pratchett, it's four elephants on the back of the Great A'Tuin swimming in the Sea of Stars.

Wednesday, June 12, 2019

A statistical analysis of reviews of L.A. Finest: audience vs. critics



"If numbers are available, let's use the numbers. If all we have are opinions, let's go with mine." -- variously attributed to a number of bosses.

There's a new police procedural this season, L.A. Finest, and Rotten Tomatoes has done it again: critics and audience appear to be at loggerheads. Like with The Orville, Star Trek Discovery, and the last season of Doctor Who.

But "appear to be" is a dequantified statement. And Rotten Tomatoes has numbers; so, what can these numbers tell us?

Before they can tell us anything, we need to write our question: first in words, then as a math problem. Then we can solve the math problem and that solution gets translated into a "words" answer, but now a quantified "words" answer.

The question, which is suggested by the above numbers is:
Do the critics and the audience use similar or opposite criteria to rate this show?
One way to answer this question, which would have been feasible in the past when Rotten Tomatoes had user reviews, would be to do text analytics on the reviews themselves. But now the user reviews are gone so that's no longer possible.

Another way, a simpler and cleaner way, is to use the data above.

To simplify we'll assume that all ratings are either positive or negative, 0 or 1; there are some unobservable random factors that make some people like a show more or less, so these ratings are random variables. For a given person $i$, the probability that that person likes L.A. Finest is captured in some parameter $\theta_i$ (we don't observe that, of course), which is the probability of that person giving a positive rating.

So, our question above is whether the $\theta_i$ of the critics and the $\theta_i$ of the audience are the same or "opposed." And what is "opposed"? If $i$ and $j$ use opposite criteria, the probability that $i$ gives a 1 is the probability that $j$ gives a 0, so $\theta_i = 1-\theta_j$.

We don't have the individual parameters $\theta_i$ but we can simplify again by assuming that all variation within each group (critics or audience) is random, so we really only need two $\theta$.

We are comparing two situations, call them: hypothesis zero, $H_0$, meaning the critics and the audience use the same criteria, that is they have the same $\theta$, call it $\theta_0$; and hypothesis one, $H_1$, meaning the critics use criteria opposite to those of the audience, so if the critics $\theta$ is $\theta_1$, the audience $\theta$ is $(1-\theta_1)$.

Yes, I know, we don't have $\theta_0$ or $\theta_1$. We'll get there.

Our "words" question now becomes the following math problem: how much more likely is it that the data we observe is created by $H_1$ versus created by $H_0$, or in a formula: what is the likelihood ratio

$LR = \frac{\Pr(\mathrm{Data}| H_1)}{\Pr(\mathrm{Data}| H_0)} $?

Observation: This is different from the usual statistics test: the usual test is whether the two distributions are different; we are testing for a specific type of difference, opposition. So there are in fact three states of the world: same, opposite, and different but not opposite; we want to compare the likelihood of the first two. If same is much more likely than opposite, then we conclude 'same.' If opposite is much more likely than same, we conclude 'opposite.' If same and opposite have similar likelihoods (for some notion of 'similar' we'd have to investigate), then we conclude 'different but not opposite.'

Our data is four numbers: number of critics $N_C = 10$, number of positive reviews by critics $k_C = 1$, number of audience members $N_A = 40$, number of positive reviews by audience members $k_A = 30$.

But what about the $\theta_0$ and $\theta_1$?

This is where the lofty field of mathematics gives way to the down and dirty world of estimation. We estimate $\theta$ by maximum likelihood, and the maximum likelihood estimator for the probability of a positive outcome of a binary random variable (called a Bernoulli variable) is the sample mean.

Yep, all those words to say "use the share of 1s as the $\theta$."

Not so fast. True, for $H_0$, we use the share of ones

$\theta_0 = (k_C + k_A)/(N_C + N_A) = 31/50 = 0.62$;

but for $H_1$, we need to address the audience's $1-\theta_1$ by reverse coding the zeros and ones, in other words,

$\theta_1 = (k_C + (N_A - k_A))/(N_C + N_A) = 11/50 = 0.22$.

Yes, those two fractions are "estimation." Maximum likelihood estimation, at that.

Now that we are done with the dirty statistics, we come back to the shiny world of math, by using our estimates to solve the math problem. That requires a small bit of combinatorics and probability theory, all in a single sentence:

If each individual data point is an independent and identically distributed Bernoulli variable, the sum of these data points follows the binomial distribution.

Therefore the desired probabilities, which are joint probabilities of two binomial distributions, one for the critics, one for the audience, are

$\Pr(\mathrm{Data}| H_0) = c(N_C,k_C) (\theta_0)^{k_C} (1- \theta_0)^{N_C- k_C} \times c(N_A,k_A) (\theta_0)^{k_A} (1- \theta_0)^{N_A- k_A}$

and

$\Pr(\mathrm{Data}| H_1) = c(N_C,k_C) (\theta_1)^{k_C} (1- \theta_1)^{N_C- k_C} \times c(N_A,k_A) (1 -\theta_1)^{k_A} (\theta_1)^{N_A- k_A}$.

Replacing the symbols with the estimates and the data we get

$\Pr(\mathrm{Data}| H_0) = 3.222\times 10^{-5}$;
$\Pr(\mathrm{Data}| H_1) = 3.066\times 10^{-2}$.

We can now compute the likelihood ratio,

$LR = \frac{\Pr(\mathrm{Data}| H_1)}{\Pr(\mathrm{Data}| H_0)} = 915$,

and translate that into words to make the statement
It's 915 times more likely that critics are using criteria opposite to those of the audience than the same criteria.
Isn't that a lot more satisfying than saying they "appear to be at loggerheads"?