Si Tacuisses, Philosophus Mansisses: July 2019

Thursday, July 25, 2019

Yeah, about that exponential economy...

There's a lot of management and technology writing that refers to "exponential growth," but I think that most of it is a confusion between early life cycle convexity and true exponentials.

Here's a bunch of data points from what looks like exponential growth:

Looks nicely convex, and that red curve is an actual exponential fit to the data,
\[
y = 0.0057 \, \exp(0.0977 \, x) \qquad [R^2 = 0.971].
\]
Model explains 97.1% of variance. I mean, what more proof could one want? A board of directors filled with political apparatchiks? A book by [a ghostwriter for] a well-known management speaker? Fourteen years of negative earnings and a CEO that consumes recreational drugs during interviews?

Alas, those data points aren't proof of an exponential process, rather, they are the output of a logistic process with some minor stochastic disturbances thrown in:
\[
y = \frac{1}{1+\exp(-0.1 \, x+5)} + \epsilon_x \qquad \epsilon_x \sim \text{Normal}(0,0.005).
\]
The logistic process is a convenient way to capture growth behavior where there's a limited potential: early on, the limit isn't very important, so the growth appears to be exponential, but later on there's less and less opportunity for growth so the process converges to the potential. This can be seen by plotting the two together:

This difference is important because — and this has been a constant in the management and technology popular press — in the beginning of new industries, new segments in an industry, and new technologies, unit sales look like the data above: growth, growth, growth. So, the same people who declared the previous ten to twenty s-shaped curves "exponential economies" at their start come out of the woodwork once again to tell us how [insert technology name here] is going to revolutionize everything.

Ironically, knowledge is one of the few things that shows a rate of growth that's proportional to the size of the [knowledge] base. Which would make knowing stuff (like the difference between the convex part of an s-shaped curve and an exponential) a true exponential capability.

But that would require those who talk of "exponential economy" to understand what exponential means.

Friday, July 19, 2019

Fat tails and extremistan - not the same thing

Extremistan and mediocrestan

What, are we making up words, now? (All words are made up. Think about it.)

Extremistan and mediocrestan are characterizations of distributions; a simple way to think about them is that very large events either totally dominate (extremistan) or don't (mediocrestan):

Height is in mediocrestan: if the average height in a room with ten people is 200 cm, that's probably from ten people between 190 and 210 cm tall and not nine people 100 cm tall and one person 1100 cm tall.

Wealth is in extremistan: if the average wealth in a room with 10 people is 2 billion dollars, that's more likely to be one billionaire with 20 billion and nine average income people than ten billionaires with 2 billion each.

This classification determines whether you can estimate relevant population parameters from samples (mediocrestan yes, extremistan no) and how well-behaved order statistics (maximum, second place, etc) are (mediocrestan nicely predictable, extremistan not so much).

There's a fairly common error that people make when they learn about extremistan: they think that because distributions in extremistan have fat tails and are dominated by extreme values, then — and this is the error — distributions that have fat tails, especially those with extreme values, are in extremistan.

Note the error: $a \Rightarrow b$ is being used to assert $b \Rightarrow a$.

As we'll see next, not all fat-tailed and extreme-valued distributions are in extremistan.

A tale of two tails

Let us compare (a) the probability that $n$ similar outcomes of large size $M$ add up to a combined event of size $nM$ (or, equivalently, average to $M$) with (b) the probability of an extreme event of size $nM$ and $n-1$ events of size 0 add up to that combined event $nM$. If the first is higher than the second, we're in mediocrestan, if the second is higher than the first, we're in extremistan.

For the Normal distribution, the probabilities (a) denoted $P(\text{Similar})$ and (b) denoted $P(\text{Extreme})$ are:
\begin{eqnarray*}
P(\text{Similar}) &=& \frac{1}{(2 \pi)^{n/2}} \exp(- n \, M^2/2) \\
P(\text{Extreme}) &=& \frac{1}{(2 \pi)^{n/2}} \exp( - n^2 \, M^2/2)
\end{eqnarray*}
It's trivial to see that for the Normal we have
\[
P(\text{Similar}) > P(\text{Extreme}).
\]
Unsurprisingly enough, with its reference excess kurtosis of 0, the Normal distribution is well inside mediocrestan.

For our fat-tailed, extreme-valued distribution, we'll use the Gumbel distribution, which is also known as Extreme Value Type I. A simple form of this distribution has the following pdf:
\[
f_X(x) = \exp(- x - \exp(-x))
\]
As shown here, its variance is $\pi^2/6$, while the Normal above has variance 1, but since we're comparing within class (Normal with Normal and Gumbel with Gumbel), that makes no difference and saves a lot of unnecessary clutter if we just use that pdf as is.

For Gumbel we have the following probabilities:
\begin{eqnarray*}
P(\mathrm{Similar}) &=& \exp(-nM - n \, \exp(-M))
\\
P(\mathrm{Extreme}) &=& \exp(-nM - \exp(-nM) -n+1)
\end{eqnarray*}
Since for large $M$ we have $\exp(-nM) \approx 0$ and $\exp(-M) \approx 0$, then $\exp(-nM) +n-1 > n \, \exp(-M)$, for Gumbel we also have
\[
P(\text{Similar}) > P(\text{Extreme}).
\]
The Gumbel distribution belongs in mediocrestan, despite its fat tails and extreme values.

Really makes us think about the specialness of scale independent distributions, where we can bet on a big event to overwhelm all the small events (i.e. an extremistan distribution). Those are the distributions for which a trading strategy of enduring many small losses to capture the one big win can beat a strategy of consistent small wins.

What about the maximum?

In many cases the maximum is more relevant than the mean or median. So, how do fat tails influence the maxima?

When you look at the maximum of something, say the fastest kid in a class, the larger the class, the higher the maximum will be, on average. So the fastest kid in a group of 100 is on average faster than the fastest kid in a group of 10, for example.

In mediocrestan this increase is concave on the number of kids (the difference between the fastest kids in classes of 100 and 200 kids is bigger than the difference between the fastest kids in classes of 1100 and 1200 kids, on average); in extremistan there are no guarantees.

But once again, fat tails and extreme value distributions (the Gumbel, here scaled to have variance 1) have well-behaved maxima:

This nice concavity (note the logarithmic horizontal scale) makes things predictable; since many real-world metrics are known to be fat-tailed, it's comforting to know that their maxima don't explode all of a sudden.

Note that there's an effect of the extreme value: the maxima are larger and they grow faster, with less concavity than for the Normal.

And the point is…?

There are a number of people who assert that all sorts of research and social metrics are unusable because their analysis is based on mediocrestan (either by using sample statistics to estimate population statistics or by assuming regular behavior from order statistics), but — so goes the argument — these real world metrics have fat tails, so they are in extremistan.

The point of the above was to show that this form of argument (usually punctuated with gratuitous insults, expletives, and Mathematica-based math or other forms of using pretend-math to bully one's audience) is wrong, tout court.

Only a small subset of fat-tailed, extreme-valued distributions is in extremistan. For all the rest, we can use our usual tools.

Friday, July 5, 2019

A family has two children. One is a boy. Now, do the math!

Problem

A family has two children. One is a boy. How likely is it that the other child is a boy?

Frequentist approach

Let's say we have a large number of cases, 4000 families for example. That's 1000 each for each combination of children: $(B,B), (B,G), (G,B)$, and $(G,G)$. Now we look at all the possibilities where we observe one of the children at random:

1000 $(B,B)$ families yield a total of 1000 boys;
1000 $(B,G)$ families yield a total of 500 boys;
1000 $(G,B)$ families yield a total of 500 boys;
1000 $(G,G)$ families yield a total of 0 boys.

We have a total of 2000 observed boys, and 1000 of these boys come from the case when the family has two boys, $(B,B)$. Half the time we observe a boy the underlying family has two boys; therefore the probability of a second boy is 1/2.

If instead of 4000 we had generic $N$ families, and called them "cases," this argument would be the frequentist derivation of the result. In frequentist parlance, the 2000 total boys are called the "possibles" and the 1000 boys from $(B,B)$ are called the "favorables." The probability is calculated as the ratio of favorables to possibles.

(The frequentist approach is how most people learn about probability and combinatorics.)

Bayesian approach

Frequentist arguments become unwieldy with more elaborate problems, so we can use this puzzle to illustrate a more elegant approach, Bayesian inference.†

First let's call things by their name: $(B,B), (B,G), (G,B)$, and $(G,G)$ are the unobserved states of the world. "One is a boy," which we'll represent by $B$, is an observed event.

Some events are uninformative, for example "one is blond," in that they don't help answer the question. Others like "one is a boy," $B$, are informative, because they help answer the question. But how can we tell?

Event $B$ is informative because it happens with different probabilities in different states of the world; therefore observing $B$ gives information about what states we're more likely to be in:

$\Pr(B|(B,B)) = 1$;
$\Pr(B|(B,G)) = 1/2$;
$\Pr(B|(G,B)) = 1/2$;
$\Pr(B|(G,G)) = 0$.

We don't know the unobserved state of the world (that is, in which of those four states the family in question falls), so in this situation we can assign equal probabilities to all four (we could look up demographics tables and confirm the numbers, but let's keep this simple):

$\Pr((B,B)) = \Pr((B,G)) = \Pr((G,B)) = \Pr((G,G)) = 1/4$.

What we want is the probability of the state $(B,B)$ having observed the event $B$; this is the conditional probability $\Pr((B,B)|B)$, which can be computed using the Bayes formula,

\[
\Pr((B,B)|B) = \frac{\Pr(B|(B,B)) \Pr((B,B))}{\Pr(B)}.
\]
Because the $\Pr(B)$ trips a lot of people, let's be clear about what it is: it's the probability that you will observe a boy in general, not in this particular case; sometimes called the a-priori probability or the unconditional probability. This is the probability that if we picked a two-child family at random and then picked one of the children at random, that child would be a boy. It's not "one, because we observe a boy," a common error.

To compute $\Pr(B)$ we must consider all four states of the world and add up ("integrate over the space of states" in expensive wording) the probability of observing a boy in each of these states weighed by the probability of the state itself:

$\begin{array}{rl}\Pr(B) =& \Pr(B|(B,B)) \Pr((B,B)) + \\
& \Pr(B|(B,G)) \Pr((B,G)) + \\
& \Pr(B|(G,B)) \Pr((G,B)) + \\
&\Pr(B|(G,G)) \Pr((G,G)) \\
=& 1/2
\end{array}$

(Unsurprisingly, it's 1/2, since half of the children are boys.)

Now we can compute our quantity of interest $\Pr((B,B)|B)$ by replacing the numbers in the Bayes formula. In fact, we can do that for all the states,

$\Pr((B,B)|B) = 1/2$;
$\Pr((B,G)|B) = 1/4$;
$\Pr((G,B)|B) = 1/4$;
$\Pr((G,G)|B) = 0$.

(As they used to say in the Soviet Union, trust but verify: check those numbers to be sure.)

If it's a math problem, do the math.

-- -- -- --
* "Do the math" means apply the rules of math, not just the notation and numbers.

† There's a bit of a schism in statistical modeling between frequentists and Bayesians. I'll let you figure out which side I'm on.

Wednesday, July 3, 2019

From value to share (in math)

There are a few places online (and in some books) where a kind of magic happens. At some point we go from something like this
\[
v(x) = w_1 x_1 + \ldots + w_K x_K
\]
to, in a single magic jump,
\[
\Pr(\text{Buy } x) = \frac{\exp(v(x))}{1 + \exp(v(x))}.
\]
The $v(x) = w_1 x_1 + \ldots + w_K x_K$ is easy enough to understand: the value of some object $x$ is a weighted sum of the dimensions of $x$; there are $K$ dimensions $x_1, \ldots, x_K$ and associated weights, $w_1, \ldots, w_K$.

But then there's something along the lines of "using appropriate assumptions, the probability of a customer purchasing $x$ is given by" that fraction.

This magic jump appears in so many places discussing analytics, data science, big data, machine learning, artificial intelligence, and other fashionable ways to refer to statistics (because that's what's there) that we could be forgiven for suspecting that the people making the jump don't know where it comes from.

It's fine not to know where something comes from, as long as one is aware of that. But it's instructive to work through the simple cases, like this one. After all, a personal trainer needn't be a super-athlete, but should at least be able to exercise.

So, how do we go from the weighted sum to those exponentials (called the logit formula)?

First, that formula is only correct if, among other things: (a) the choice is between buying $x$ or nothing at all; and (b) the $v(\cdot)$ for buying nothing is set to be zero, denote that as $v(0) =0$. Okay, the (a) makes this a simple choice and (b) is just a threshold; to move it around we can always have a constant $x_0$ added to the $v(x)$.

Now, why isn't the probability either one or zero, then? After all, either $v(x) > 0 = v(0)$ and the customer should always buy or $v(x) < 0 = v(0)$ and the customer should never buy.

That's the second step: there are things that can change the $v(x)$ that we don't observe. Maybe the customer's mood changes and the $w_i$ change with it, for example. We don't know, so we say that the customer optimizes "utility" $u(x)$ instead of value $v(x)$, and the difference is a stochastic disturbance (expensive wording for random stuff) $\epsilon$:
\[
u(x) = v(x) - \epsilon.
\]
(Let's assume that $u(0) = v(0) = 0$. It's not necessary but makes things simpler.) This second step changes the nature of the problem: instead of a decision (buy or not), we can only predict a probability:
\[
\Pr(\text{Buy } x) = \Pr(u(x)>0) = \Pr(v(x)> \epsilon).
\]
Now we can compute that probability: it's just the cumulative distribution for the $\epsilon$ variable evaluated at the point $v(x)$, denoted $F_{\epsilon}(v(x))$.

Unfortunately, by their own nature the $\epsilon$ aren't known, so we need to assume a probability distribution for $\epsilon$, one that's acceptable and is computationally convenient.

We could try the Normal distribution, which most people accept without questioning, so it gets us the first desirable characteristic. Unfortunately the $\exp(-z^2/2)$ makes integrating it difficult. What we need is a distribution that looks more or less like a Normal but can be integrated into a manageable form.

Enter the logistic distribution.

It looks enough like a Normal (it has fat tails but it's hard to tell by looking). And its probability density function
\[
f_{\epsilon}(z) = \frac{\exp(- z)}{(1+\exp(-z))^2}
\]
integrates easily to
\[
F_{\epsilon}(v) = \int_{-\infty}^{v} f_{\epsilon}(z) \, dz = \frac{1}{1 + \exp(-v)} = \frac{\exp(v)}{1+\exp(v)},
\]
which, when replaced in the formula for probability of buy gives the logit formula:
\[
\Pr(\text{Buy } x) = \Pr(v(x)> \epsilon) = F_{\epsilon}(v(x)) = \frac{\exp(v(x))}{1+\exp(v(x))}.
\]
When you know the math, you understand the magic.