Thursday, June 16, 2011

Data is not information; perfect fit is only good for clothes

(A short post on two trivial matters that come up a lot in online discussions. From now on, I'll just link here.)

Data is an accumulation of measures from some phenomenon as it happens in the world. Because the world is not a nice clean theoretical construct, part of data will be what we call noise (or stochastic disturbance).

Suppose we have data on the lift (percentage increase in unit sales) caused by a promotional price cut (as a percentage of price). Lift will be determined by two things, generally speaking:

Deal-proneness of the market. That's the likelihood that something is bought just because it's "on sale," even if there is no price cut. Because this is a strong effect, many places make having a sale sign when there's no actual price cut illegal.

Price response of the market. For a variety of reasons (income effect, substitution effect, provisioning effect, time-shifting effect) people buy more stuff the lower its price.

Other factors, beyond the control of marketers, like accidents on the highway serving the large retail spaces where most purchases are made or a cancelled baseball game giving customers more time to shop, can decrease or increase the sales (and therefore the measured lift) by themselves. This is what we call noise.

The following figure (click for bigger) shows the difference between data and information:

Image for a blog post on information vs. data

A simple model of lift $y$ as a function of price cut $x$ is $y = a + b \,  x$, where $a$ captures the deal-proneness and $b$ captures the price response. But, as shown above, the noise will appear as model error.

A goodness-of-fit test will show that the simple model doesn't capture the entirety of data. That is a good thing, since the data include noise and noise is something that we don't want to model.

But then a self-proclaimed expert appears. And she has a "better" model; and by better she means that it has higher goodness-of-fit. It's not difficult to come up with such a model, simply by fitting a polynomial to the data you get a perfect fit. Here's a recipe for one such model given $N$ data points: just fit the following specification (error will be zero, so OLS might balk at it)
\[ y_{i} = \sum_{k=0}^{N-1} \, \beta_{k} \, x_{i}^{k}\]
(This model is meant as an illustration and not to be used for anything. Really: don't use this model!)

Perfect fit, though, mixes the effects of the price cut  on that leftmost data point with the effect of that day's highway closure (note how the noise makes the point below the fitted line). And that is a bad thing, because while price cuts are under the control of the manager, highway closures are not. And, even if they were, they are not identified in the polynomial: they actually appear as part of the effect of the price cut (in fact the model has no idea that a highway was closed).

Perfect fit is good for clothes but not for models: a model that fits the data perfectly captures the noise as if it were part of the control variables. There are always stochastic disturbances; models must have some way to excise them from the information.

Use managerial judgement (or subject matter expertise for other applications) instead of simplistic metrics to evaluate models.