Wednesday, July 3, 2019

From value to share (in math)



There are a few places online (and in some books) where a kind of magic happens. At some point we go from something like this
\[
v(x) = w_1 x_1 + \ldots + w_K x_K
\]
to, in a single magic jump,
\[
\Pr(\text{Buy } x) = \frac{\exp(v(x))}{1 + \exp(v(x))}.
\]
The $v(x) = w_1 x_1 + \ldots + w_K x_K$ is easy enough to understand: the value of some object $x$ is a weighted sum of the dimensions of $x$; there are $K$ dimensions $x_1, \ldots, x_K$ and associated weights, $w_1, \ldots, w_K$.

But then there's something along the lines of "using appropriate assumptions, the probability of a customer purchasing $x$ is given by" that fraction.

This magic jump appears in so many places discussing analytics, data science, big data, machine learning, artificial intelligence, and other fashionable ways to refer to statistics (because that's what's there) that we could be forgiven for suspecting that the people making the jump don't know where it comes from.

It's fine not to know where something comes from, as long as one is aware of that. But it's instructive to work through the simple cases, like this one. After all, a personal trainer needn't be a super-athlete, but should at least be able to exercise.

So, how do we go from the weighted sum to those exponentials (called the logit formula)?

First, that formula is only correct if, among other things: (a) the choice is between buying $x$ or nothing at all; and (b) the $v(\cdot)$ for buying nothing is set to be zero, denote that as $v(0) =0$. Okay, the (a) makes this a simple choice and (b) is just a threshold; to move it around we can always have a constant $x_0$ added to the $v(x)$.

Now, why isn't the probability either one or zero, then? After all, either $v(x) > 0 = v(0)$ and the customer should always buy or $v(x) < 0 = v(0)$ and the customer should never buy.

That's the second step: there are things that can change the $v(x)$ that we don't observe. Maybe the customer's mood changes and the $w_i$ change with it, for example. We don't know, so we say that the customer optimizes "utility" $u(x)$ instead of value $v(x)$, and the difference is a stochastic disturbance (expensive wording for random stuff) $\epsilon$:
\[
u(x) = v(x) - \epsilon.
\]
(Let's assume that $u(0) = v(0) = 0$. It's not necessary but makes things simpler.) This second step changes the nature of the problem: instead of a decision (buy or not), we can only predict a probability:
\[
\Pr(\text{Buy } x) = \Pr(u(x)>0) = \Pr(v(x)> \epsilon).
\]
Now we can compute that probability: it's just the cumulative distribution for the $\epsilon$ variable evaluated at the point $v(x)$, denoted $F_{\epsilon}(v(x))$.


Unfortunately, by their own nature the $\epsilon$ aren't known, so we need to assume a probability distribution for $\epsilon$, one that's acceptable and is computationally convenient.

We could try the Normal distribution, which most people accept without questioning, so it gets us the first desirable characteristic. Unfortunately the $\exp(-z^2/2)$ makes integrating it difficult. What we need is a distribution that looks more or less like a Normal but can be integrated into a manageable form.

Enter the logistic distribution.


It looks enough like a Normal (it has fat tails but it's hard to tell by looking). And its probability density function
\[
f_{\epsilon}(z) = \frac{\exp(- z)}{(1+\exp(-z))^2}
\]
integrates easily to
\[
F_{\epsilon}(v) = \int_{-\infty}^{v} f_{\epsilon}(z) \, dz = \frac{1}{1 + \exp(-v)} = \frac{\exp(v)}{1+\exp(v)},
\]
which, when replaced in the formula for probability of buy gives the logit formula:
\[
\Pr(\text{Buy } x) = \Pr(v(x)> \epsilon) = F_{\epsilon}(v(x)) = \frac{\exp(v(x))}{1+\exp(v(x))}.
\]
When you know the math, you understand the magic.