Tuesday, June 18, 2019

Hidden factor correlation



Correlation is not causation; everyone learns to say that. But if there's a correlation, there's probably some sort of causal relationship hiding somewhere, unless it's a spurious correlation.

If two variables, $A$ and $B$ are correlated, the three simplest causal relationships are: $A$ causes $B$; $B$ causes $A$; or $A$ and $B$ are caused by an unseen factor $C$. There are many more complicated causation relationships, but these are the three basic ones.

The third case, where an unseen variable $C$ is the real source of the correlation, is what we're interested in this post. To illustrate the case let's say $C$ is a standard normal random variable, and $A$ and $B$ are noisy measures of $C$,

$ \qquad A = C + \epsilon_A$ and $ B = C + \epsilon_B$,

where the $\epsilon_i$ are drawn from a normal distribution with $\sigma_{\epsilon} = 0.05$.

To illustrate we generate 10,000 draws of $C$ and create the 10,000 $A$ and $B$ using R:

hidden_factor = rnorm(10000)
var_A_visible = hidden_factor + 0.05 * rnorm(10000)
var_B_visible = hidden_factor + 0.05 * rnorm(10000)

Now we can plot $A$ and $B$, and the correlation is obvious

And we can regress $A$ on $B$ to get the correlation and test statistics for the estimates using a linear model,

model_no_control = lm(var_A_visible~var_B_visible)
summary(model_no_control)

With the result:

Call:
lm(formula = var_A_visible ~ var_B_visible)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.271466 -0.047214 -0.000861  0.047400  0.302517 

Coefficients:
               Estimate Std. Error  t value Pr(>|t|)    
(Intercept)   0.0005294  0.0007025    0.754    0.451    
var_B_visible 0.9975142  0.0006913 1442.852  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.07025 on 9998 degrees of freedom
Multiple R-squared:  0.9952, Adjusted R-squared:  0.9952 
F-statistic: 2.082e+06 on 1 and 9998 DF,  p-value: < 2.2e-16

So, both the model and the graph confirm a strong correlation ($p < 0.0001$) between $A$ and $B$. And in many real-life cases, this is used to support the idea that either $A$ causes $B$ or $B$ causes $A$.

Now we proceed to show how the hidden factor is relevant. First, let us plot the residuals, $A-C$ against $B-C$:

The apparent correlation has now disappeared. And a linear model including the hidden factor confirms this:

model_with_control = lm(var_A_visible~var_B_visible+hidden_factor)
summary(model_with_control)

With the result

Call:
lm(formula = var_A_visible ~ var_B_visible + hidden_factor)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.18347 -0.03382 -0.00021  0.03410  0.17780 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.0007378  0.0004986   1.480    0.139    
var_B_visible 0.0004082  0.0100560   0.041    0.968    
hidden_factor 0.9997573  0.0100707  99.274  2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.04985 on 9997 degrees of freedom
Multiple R-squared:  0.9976, Adjusted R-squared:  0.9976 
F-statistic: 2.072e+06 on 2 and 9997 DF,  p-value: < 2.2e-16

Hidden factors are easy to test for, as seen here, but they are not always apparent. For example, in nutrition papers there's often an hidden factor relating to how health-conscious an individual is that is more often than not causing both observables (say exercising regularly and eating salads; high correlation, but exercising doesn't cause eating salads and eating salads doesn't cause exercise).

Correlation is not causation, but generally one can find a causal relationship behind a correlation, possibly one that involves hidden factors or more complex relationships.