Thursday, July 28, 2011

A simple, often overlooked, problem with models

There are just too many possibilities.

Let's say we have one dependent variable, $y$, and ten independent variables, $x_1,\ldots,x_{10}$. How many models can we build? For simplicity let's keep our formulation linear (in the usual sense of the word, that is linear in the coefficients; see footnote).

Inexcusably wrong answer: 11 models.

Wrong answer: 1024 models.

Right-ish answer: $1.8 \times 10^{308}$ models.

Right answer: an infinity of models.

Ok, 1024 is the number of models which include at most one instance of each variable and no interaction. Something like

$ y = \beta_0 + \beta_1 \, x_1 +  \beta_3 \, x_3 + \beta_7 \, x_7$ ,

of which there are $2^{10}$ models. (Since the constant $\beta_0$ can be zero by calibration, we'll include it in all models -- otherwise we'd have to demean the $y$.)

Once we consider possible interactions among variables, like $x_1 x_7 x_8$ for example, a three-way interaction, there are $2^{10}$ variables and interactions and therefore $2^{2^{10}}= 1.8 \times 10^{308}$ possible models with all interactions. For comparison, the number of atoms in the known universe is estimated to be in the order of $10^{80}$.

Of course, each variable can enter the model in a variety of functional forms: $x_1^{2}$, $\log(x_7)$, $\sin(5 \, x_9)$ or $x_3^{-x_{2}/2}$, for example, making it an infinite number of possibilities. (And there can be interactions between these different functions of different variables, obviously.)

(Added on August 11th.) Using polynomial approximations for generalized functions, say to the fourth degree, the total number of interactions is now $5^{10}=9765625$, as any variable may enter an interaction in one of five orders (0 through 4), and the total number of models is $2^{5^{10}}$ or around $10^{3255000}$. (End of addition.)

So here's a combinatorial riddle for statisticians: how can you identify a model out of, let's be generous, $1.8 \times 10^{308}$ with data in the exa- or petabyte range? That's almost three hundred orders of magnitude too little, methinks.

The main point is that any non-trivial set of variables can be modeled in a vast number of ways, which means that a limited number of models presented for appreciation (or review) necessarily includes an inordinate amount of judgement from the model-builder.

It's unavoidable, but seldom acknowledged.


The "linear in coefficients" point is the following. Take the following formulation, which is clearly non-linear in the $x$:

$y = \beta_0 + \beta_1 \, x_1^{1/4} + \beta_2 \, x_1 \, x_7$

but can be made linear very easily by making two changes of variables: $ z_1 =  x_1^{1/4}$ and $z_2 =  x_1 \, x_7$.

In contrast, the model $y = \alpha \, \sin( \omega \, t )$ cannot be linearized in coefficients $\alpha$ and $\omega$.