Friday, November 16, 2012

Don't ignore the distribution of estimates!

If forecasts ignore the distribution of estimates, they will be biased.

For example, when computing the probability of purchase using a logit model, we take the estimates for the coefficients in the utility function and use them as the true coefficients, thus:

$P(i) = \Pr(v_i > 0| \beta) = \frac{\exp(v(x_i; \beta))}{1 + \exp(v(x_i; \beta))}$.

But the estimates are themselves random variables, and have a distribution of their own, so, more correctly, the probability $P(i)$ should be written as

$P(i) = \int \Pr(v_i > 0| \hat\beta) \, dF_{B}(\hat\beta)$.

Note that $\beta = E[\hat\beta]$; so we can integrate by parts the formula above and get

$P(i) = \Pr(v_i > 0| \beta) -  \int   \frac{\partial  \, \Pr(v_i > 0| \hat\beta)}{\partial \, \hat\beta} \, F_{B}(\hat\beta) \, d\hat\beta$.

The term

$-  \int   \frac{\partial  \, \Pr(v_i > 0| \hat\beta)}{\partial \, \hat\beta} \, F_{B}(\hat\beta) \, d\hat\beta$

is a bias introduced by ignoring the distribution of $\hat\beta$.

(This simple exercise in probability and calculus was the result of having to make this point over and over on the interwebs, despite the fact that it should be basic knowledge to the people involved. Some of whom, ironically, call themselves "data scientists.")

Added Nov 17: A simple illustration at my online scrapbook.