Si Tacuisses, Philosophus Mansisses: Hidden factor correlation

Correlation is not causation; everyone learns to say that. But if there's a correlation, there's probably some sort of causal relationship hiding somewhere, unless it's a spurious correlation.

If two variables, $A$ and $B$ are correlated, the three simplest causal relationships are: $A$ causes $B$; $B$ causes $A$; or $A$ and $B$ are caused by an unseen factor $C$. There are many more complicated causation relationships, but these are the three basic ones.

The third case, where an unseen variable $C$ is the real source of the correlation, is what we're interested in this post. To illustrate the case let's say $C$ is a standard normal random variable, and $A$ and $B$ are noisy measures of $C$,

$ \qquad A = C + \epsilon_A$ and $ B = C + \epsilon_B$,

where the $\epsilon_i$ are drawn from a normal distribution with $\sigma_{\epsilon} = 0.05$.

To illustrate we generate 10,000 draws of $C$ and create the 10,000 $A$ and $B$ using R:

hidden_factor = rnorm(10000)
var_A_visible = hidden_factor + 0.05 * rnorm(10000)
var_B_visible = hidden_factor + 0.05 * rnorm(10000)

Now we can plot $A$ and $B$, and the correlation is obvious

And we can regress $A$ on $B$ to get the correlation and test statistics for the estimates using a linear model,

model_no_control = lm(var_A_visible~var_B_visible)
summary(model_no_control)

With the result:

Call:
lm(formula = var_A_visible ~ var_B_visible)

Residuals:
Min 1Q Median 3Q Max
-0.271466 -0.047214 -0.000861 0.047400 0.302517

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0005294 0.0007025 0.754 0.451
var_B_visible 0.9975142 0.0006913 1442.852 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.07025 on 9998 degrees of freedom
Multiple R-squared: 0.9952, Adjusted R-squared: 0.9952
F-statistic: 2.082e+06 on 1 and 9998 DF, p-value: < 2.2e-16

So, both the model and the graph confirm a strong correlation ($p < 0.0001$) between $A$ and $B$. And in many real-life cases, this is used to support the idea that either $A$ causes $B$ or $B$ causes $A$.

Now we proceed to show how the hidden factor is relevant. First, let us plot the residuals, $A-C$ against $B-C$:

The apparent correlation has now disappeared. And a linear model including the hidden factor confirms this:

model_with_control = lm(var_A_visible~var_B_visible+hidden_factor)
summary(model_with_control)

With the result

Call:
lm(formula = var_A_visible ~ var_B_visible + hidden_factor)

Residuals:
Min 1Q Median 3Q Max
-0.18347 -0.03382 -0.00021 0.03410 0.17780

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0007378 0.0004986 1.480 0.139
var_B_visible 0.0004082 0.0100560 0.041 0.968
hidden_factor 0.9997573 0.0100707 99.274 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.04985 on 9997 degrees of freedom
Multiple R-squared: 0.9976, Adjusted R-squared: 0.9976
F-statistic: 2.072e+06 on 2 and 9997 DF, p-value: < 2.2e-16

Hidden factors are easy to test for, as seen here, but they are not always apparent. For example, in nutrition papers there's often an hidden factor relating to how health-conscious an individual is that is more often than not causing both observables (say exercising regularly and eating salads; high correlation, but exercising doesn't cause eating salads and eating salads doesn't cause exercise).

Correlation is not causation, but generally one can find a causal relationship behind a correlation, possibly one that involves hidden factors or more complex relationships.

Tuesday, June 18, 2019

Hidden factor correlation