Non-work posts by Jose Camoes Silva; repurposed in May 2019 as a blog mostly about innumeracy and related matters, though not exclusively.
Showing posts with label Probability. Show all posts
Showing posts with label Probability. Show all posts
Wednesday, July 3, 2019
From value to share (in math)
There are a few places online (and in some books) where a kind of magic happens. At some point we go from something like this
\[
v(x) = w_1 x_1 + \ldots + w_K x_K
\]
to, in a single magic jump,
\[
\Pr(\text{Buy } x) = \frac{\exp(v(x))}{1 + \exp(v(x))}.
\]
The $v(x) = w_1 x_1 + \ldots + w_K x_K$ is easy enough to understand: the value of some object $x$ is a weighted sum of the dimensions of $x$; there are $K$ dimensions $x_1, \ldots, x_K$ and associated weights, $w_1, \ldots, w_K$.
But then there's something along the lines of "using appropriate assumptions, the probability of a customer purchasing $x$ is given by" that fraction.
This magic jump appears in so many places discussing analytics, data science, big data, machine learning, artificial intelligence, and other fashionable ways to refer to statistics (because that's what's there) that we could be forgiven for suspecting that the people making the jump don't know where it comes from.
It's fine not to know where something comes from, as long as one is aware of that. But it's instructive to work through the simple cases, like this one. After all, a personal trainer needn't be a super-athlete, but should at least be able to exercise.
So, how do we go from the weighted sum to those exponentials (called the logit formula)?
First, that formula is only correct if, among other things: (a) the choice is between buying $x$ or nothing at all; and (b) the $v(\cdot)$ for buying nothing is set to be zero, denote that as $v(0) =0$. Okay, the (a) makes this a simple choice and (b) is just a threshold; to move it around we can always have a constant $x_0$ added to the $v(x)$.
Now, why isn't the probability either one or zero, then? After all, either $v(x) > 0 = v(0)$ and the customer should always buy or $v(x) < 0 = v(0)$ and the customer should never buy.
That's the second step: there are things that can change the $v(x)$ that we don't observe. Maybe the customer's mood changes and the $w_i$ change with it, for example. We don't know, so we say that the customer optimizes "utility" $u(x)$ instead of value $v(x)$, and the difference is a stochastic disturbance (expensive wording for random stuff) $\epsilon$:
\[
u(x) = v(x) - \epsilon.
\]
(Let's assume that $u(0) = v(0) = 0$. It's not necessary but makes things simpler.) This second step changes the nature of the problem: instead of a decision (buy or not), we can only predict a probability:
\[
\Pr(\text{Buy } x) = \Pr(u(x)>0) = \Pr(v(x)> \epsilon).
\]
Now we can compute that probability: it's just the cumulative distribution for the $\epsilon$ variable evaluated at the point $v(x)$, denoted $F_{\epsilon}(v(x))$.
Unfortunately, by their own nature the $\epsilon$ aren't known, so we need to assume a probability distribution for $\epsilon$, one that's acceptable and is computationally convenient.
We could try the Normal distribution, which most people accept without questioning, so it gets us the first desirable characteristic. Unfortunately the $\exp(-z^2/2)$ makes integrating it difficult. What we need is a distribution that looks more or less like a Normal but can be integrated into a manageable form.
Enter the logistic distribution.
It looks enough like a Normal (it has fat tails but it's hard to tell by looking). And its probability density function
\[
f_{\epsilon}(z) = \frac{\exp(- z)}{(1+\exp(-z))^2}
\]
integrates easily to
\[
F_{\epsilon}(v) = \int_{-\infty}^{v} f_{\epsilon}(z) \, dz = \frac{1}{1 + \exp(-v)} = \frac{\exp(v)}{1+\exp(v)},
\]
which, when replaced in the formula for probability of buy gives the logit formula:
\[
\Pr(\text{Buy } x) = \Pr(v(x)> \epsilon) = F_{\epsilon}(v(x)) = \frac{\exp(v(x))}{1+\exp(v(x))}.
\]
When you know the math, you understand the magic.
Tuesday, June 18, 2019
Hidden factor correlation
Correlation is not causation; everyone learns to say that. But if there's a correlation, there's probably some sort of causal relationship hiding somewhere, unless it's a spurious correlation.
If two variables, $A$ and $B$ are correlated, the three simplest causal relationships are: $A$ causes $B$; $B$ causes $A$; or $A$ and $B$ are caused by an unseen factor $C$. There are many more complicated causation relationships, but these are the three basic ones.
The third case, where an unseen variable $C$ is the real source of the correlation, is what we're interested in this post. To illustrate the case let's say $C$ is a standard normal random variable, and $A$ and $B$ are noisy measures of $C$,
$ \qquad A = C + \epsilon_A$ and $ B = C + \epsilon_B$,
where the $\epsilon_i$ are drawn from a normal distribution with $\sigma_{\epsilon} = 0.05$.
To illustrate we generate 10,000 draws of $C$ and create the 10,000 $A$ and $B$ using R:
hidden_factor = rnorm(10000)
var_A_visible = hidden_factor + 0.05 * rnorm(10000)
var_B_visible = hidden_factor + 0.05 * rnorm(10000)
Now we can plot $A$ and $B$, and the correlation is obvious
model_no_control = lm(var_A_visible~var_B_visible)
summary(model_no_control)
With the result:
Call:
lm(formula = var_A_visible ~ var_B_visible)
Residuals:
Min 1Q Median 3Q Max
-0.271466 -0.047214 -0.000861 0.047400 0.302517
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0005294 0.0007025 0.754 0.451
var_B_visible 0.9975142 0.0006913 1442.852 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.07025 on 9998 degrees of freedom
Multiple R-squared: 0.9952, Adjusted R-squared: 0.9952
F-statistic: 2.082e+06 on 1 and 9998 DF, p-value: < 2.2e-16
So, both the model and the graph confirm a strong correlation ($p < 0.0001$) between $A$ and $B$. And in many real-life cases, this is used to support the idea that either $A$ causes $B$ or $B$ causes $A$.
Now we proceed to show how the hidden factor is relevant. First, let us plot the residuals, $A-C$ against $B-C$:
The apparent correlation has now disappeared. And a linear model including the hidden factor confirms this:
model_with_control = lm(var_A_visible~var_B_visible+hidden_factor)
summary(model_with_control)
With the result
Call:
lm(formula = var_A_visible ~ var_B_visible + hidden_factor)
Residuals:
Min 1Q Median 3Q Max
-0.18347 -0.03382 -0.00021 0.03410 0.17780
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0007378 0.0004986 1.480 0.139
var_B_visible 0.0004082 0.0100560 0.041 0.968
hidden_factor 0.9997573 0.0100707 99.274 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04985 on 9997 degrees of freedom
Multiple R-squared: 0.9976, Adjusted R-squared: 0.9976
F-statistic: 2.072e+06 on 2 and 9997 DF, p-value: < 2.2e-16
Hidden factors are easy to test for, as seen here, but they are not always apparent. For example, in nutrition papers there's often an hidden factor relating to how health-conscious an individual is that is more often than not causing both observables (say exercising regularly and eating salads; high correlation, but exercising doesn't cause eating salads and eating salads doesn't cause exercise).
Correlation is not causation, but generally one can find a causal relationship behind a correlation, possibly one that involves hidden factors or more complex relationships.
Labels:
analytics,
causality,
Probability,
RStats,
statistics
Monday, June 17, 2019
Calculating God?
I don't believe that the God of any earthly religion is the creator of the Universe.
But I really dislike a lazy and innumerate argument commonly used to "prove" the non-existence of God, which can be summarized in the following false dichotomy:
Hey, this looks like dynamic programming. I know dynamic programming.
Let's say that universes are recursively nested until one of them just poofs into existence. Of course we can't see outside our universe, but we can build simple models.
So, our universe either poofed into existence (say with probability $p$) or it was created by some higher being (with probability $1-p$). Now we iterate the process: 'level 2' universe either poofed into existence (with some probability $q$) or was created by a 'level 3' universe being (with probability $1-q$); and so on.
Time for a simplifying assumption, or as non-mathematicians call it, making things up. Let's assume that all these universes share the poofed/created probabilities, so that for any 'level $k$' universe, it poofed into existence with probability $p$ and was created by a being from a 'level $k+1$' universe with probability $1-p$.
Note that it's still possible to have an infinite number of universes, but with this formulation, the probability of a 'level $k$' universe (with us being 'level 1') being the last level is
$p (1-p)^{k-1}$.
This probability gets small pretty quickly, which suggests the 'infinite regress of universes' argument gets thin very fast.
Now we can compute the expected number of universes as a function of $p$:
$\mathbb{E}(n) = p + 2(1-p)p + 3 (1-p)^2 p + \ldots N (1-p)^{N-1} p + \ldots$
or
$\mathbb{E}(n) = p/(1-p) \times ( \text{ sum of series } N (1-p)^N )$
The sum of series $N (1-p)^N$ is $(1-p)/p^2$, so
$E(n) = 1/p$
Therefore, if we believe that the probability of a universe poofing into existence is 0.1, there are an expected ten universes; for 0.2, five universes; for 0.5, two universes.
Very far from 'turtles all the way down.'
Of course, these calculations were unnecessary, because as we know from the revelations of the prophet Terry Pratchett, it's four elephants on the back of the Great A'Tuin swimming in the Sea of Stars.
But I really dislike a lazy and innumerate argument commonly used to "prove" the non-existence of God, which can be summarized in the following false dichotomy:
Either there is no God and the universe just 'poofed' into existence, or there's an infinite number of Gods, because the plane of existence for each God has to be created by a higher-level God.This is a false dichotomy: it could well be that our universe was created by a powerful being from a higher-order universe, but that universe poofed into existence without a creator. Or maybe it did have a creator, whose universe poofed into existence; or that third universe may have had a creator...
Hey, this looks like dynamic programming. I know dynamic programming.
Let's say that universes are recursively nested until one of them just poofs into existence. Of course we can't see outside our universe, but we can build simple models.
So, our universe either poofed into existence (say with probability $p$) or it was created by some higher being (with probability $1-p$). Now we iterate the process: 'level 2' universe either poofed into existence (with some probability $q$) or was created by a 'level 3' universe being (with probability $1-q$); and so on.
Time for a simplifying assumption, or as non-mathematicians call it, making things up. Let's assume that all these universes share the poofed/created probabilities, so that for any 'level $k$' universe, it poofed into existence with probability $p$ and was created by a being from a 'level $k+1$' universe with probability $1-p$.
Note that it's still possible to have an infinite number of universes, but with this formulation, the probability of a 'level $k$' universe (with us being 'level 1') being the last level is
$p (1-p)^{k-1}$.
This probability gets small pretty quickly, which suggests the 'infinite regress of universes' argument gets thin very fast.
Now we can compute the expected number of universes as a function of $p$:
$\mathbb{E}(n) = p + 2(1-p)p + 3 (1-p)^2 p + \ldots N (1-p)^{N-1} p + \ldots$
or
$\mathbb{E}(n) = p/(1-p) \times ( \text{ sum of series } N (1-p)^N )$
The sum of series $N (1-p)^N$ is $(1-p)/p^2$, so
$E(n) = 1/p$
Therefore, if we believe that the probability of a universe poofing into existence is 0.1, there are an expected ten universes; for 0.2, five universes; for 0.5, two universes.
Very far from 'turtles all the way down.'
Of course, these calculations were unnecessary, because as we know from the revelations of the prophet Terry Pratchett, it's four elephants on the back of the Great A'Tuin swimming in the Sea of Stars.
Labels:
Atheism,
atheists,
math,
mathematics,
Probability
Friday, February 24, 2017
If it's a math problem... do the math
Or, The Monty Hall problem: redux.
I recently posted a new video, addressing the Monty Hall problem. The problem is not the puzzle itself, which has been solved ad nauseam by everyone and their vlogbrother.
The video is about what information is. By working through the details of the Monty Hall puzzle, we can learn where information is revealed and how. That is the reason for the video; that and a plea for something so simple and yet so ignored that I'll repeat it again:
If it's a math problem, do the math.
Now, this may seem trivial, but math (and to some extent science, technology, and engineering, to say nothing of business, management, and economics) makes people uncomfortable, even people who say they "love math."
Hence the attempt to solve the problem with anything but computation. By waving hands and verbalizing (very error prone) or by creating similar problems that might be insightful (but mostly convince only those who already know the solution and understand it).
If all you're interested is the computations for the solution, they're here:
The point of the video is not this particular table; it's the insights about information on the path to it: how constraints to actions change probabilities and how those relate to information.
For example, from the viewpoint of the contestant, once she picks door 1 (thus giving Monty Hall a choice of door 2 and door 3 to open), the probability that Monty picks either door 2 or door 3 is precisely 1/2; that's calculated in the video, not assumed and not hand-waved. But, as the video then explains, that 50-50 probability isn't equally distributed across different states:
A final remark, from the video as well, is that by having computations one can avoid many time-wasters, who --- not having done any computations themselves and generally having a limited understanding of the whole state-event difference, which is essential to reasoning with conditional probabilities --- are now required to point out where they disagree with the computation, before moving forward with new "ideas."
If it's a math problem... do the math!
I recently posted a new video, addressing the Monty Hall problem. The problem is not the puzzle itself, which has been solved ad nauseam by everyone and their vlogbrother.
The video is about what information is. By working through the details of the Monty Hall puzzle, we can learn where information is revealed and how. That is the reason for the video; that and a plea for something so simple and yet so ignored that I'll repeat it again:
If it's a math problem, do the math.
Now, this may seem trivial, but math (and to some extent science, technology, and engineering, to say nothing of business, management, and economics) makes people uncomfortable, even people who say they "love math."
Hence the attempt to solve the problem with anything but computation. By waving hands and verbalizing (very error prone) or by creating similar problems that might be insightful (but mostly convince only those who already know the solution and understand it).
If all you're interested is the computations for the solution, they're here:
The point of the video is not this particular table; it's the insights about information on the path to it: how constraints to actions change probabilities and how those relate to information.
For example, from the viewpoint of the contestant, once she picks door 1 (thus giving Monty Hall a choice of door 2 and door 3 to open), the probability that Monty picks either door 2 or door 3 is precisely 1/2; that's calculated in the video, not assumed and not hand-waved. But, as the video then explains, that 50-50 probability isn't equally distributed across different states:
A final remark, from the video as well, is that by having computations one can avoid many time-wasters, who --- not having done any computations themselves and generally having a limited understanding of the whole state-event difference, which is essential to reasoning with conditional probabilities --- are now required to point out where they disagree with the computation, before moving forward with new "ideas."
If it's a math problem... do the math!
Friday, January 13, 2017
Medical tests and probabilities
You may have heard this one, but bear with me.
Let's say you get tested for a condition that affects ten percent of the population and the test is positive. The doctor says that the test is ninety percent accurate (presumably in both directions). How likely is it that you really have the condition?
[Think, think, think.]
Most people, including most doctors themselves, say something close to $90\%$; they might shade that number down a little, say to $80\%$, because they understand that "the base rate is important."
Yes, it is. That's why one must do computation rather than fall prey to anchor-and-adjustment biases.
Here's the computation for the example above (click for bigger):
In fact, if you, the patient, have access to the raw data (you should be able to, at least in the US where doctors treat patients like humans, not NHS cost units), you can see how far off the threshold you are and look up actual distribution tables on the internet. (Don't argue these with your HMO doctor, though, most of them don't understand statistical arguments.)
For illustration, here are the posterior probabilities for a test that has bias $k$ in favor of false positives, understood as $\Pr(\text{positive}|\text{not sick}) = k \times \Pr(\text{negative}|\text{sick})$, for some different base rates $p$ and probability of accurate positive test $r$ (as above):
So, this is good news: if you get a scary positive test for a dangerous medical condition, that test is probably biased towards false positives (because of the scary part) and therefore the probability that you actually have that scary condition is much lower than you'd think, even if you'd been trained in statistical thinking (because that training, for simplicity, almost always uses symmetric tests). Therefore, be a little more relaxed when getting the follow-up test.
There's a third interesting question that people ask when shown the computation above: the probability of someone getting tested to begin with. It's an interesting question because in all these computational examples we assume that the population that gets tested has the same distribution of sick and health people as the general population. But the decision to be tested is usually a function of some reason (mild symptoms, hypochondria, job requirement), so the population of those tested may have a higher incidence of the condition than the general population.
This can be modeled by adding elements to the computation, which makes the computation more cumbersome and detracts from its value to make the point that base rates are very important. But it's a good elaboration and many models used by doctors over-estimate base rates precisely because they miss this probability of being tested. More good news there!
Probabilities: so important to understand, so thoroughly misunderstood.
- - - - -
Production notes
1. There's nothing new above, but I've had to make this argument dozens of times to people and forum dwellers (particularly difficult when they've just received a positive result for some scary condition), so I decided to write a post that I can point people to.
2. [warning: rant] As someone who has railed against the use of spline drawing and quarter-ellipses in other people's slides, I did the right thing and plotted those normal distributions from the actual normal distribution formula. That's why they don't look like the overly-rounded "normal" distributions in some other people's slides: because these people make their "normals" with free-hand spline drawing and their exponentials with quarter ellipses, That's extremely lazy in an age when any spreadsheet, RStats, Matlab, or Mathematica can easily plot the actual curve. The people I mean know who they are. [end rant]
Let's say you get tested for a condition that affects ten percent of the population and the test is positive. The doctor says that the test is ninety percent accurate (presumably in both directions). How likely is it that you really have the condition?
[Think, think, think.]
Most people, including most doctors themselves, say something close to $90\%$; they might shade that number down a little, say to $80\%$, because they understand that "the base rate is important."
Yes, it is. That's why one must do computation rather than fall prey to anchor-and-adjustment biases.
Here's the computation for the example above (click for bigger):
One-half. That's the probability that you have the condition given the positive test result.
We can get a little more general: if the base rate is $\Pr(\text{sick}) = p$ and the accuracy (assumed symmetric) of the test is $\Pr(\text{positive}|\text{sick}) = \Pr(\text{negative}|\text{not sick}) = r $, then the probability of being sick given a positive test result is
\[ \Pr(\text{sick}|\text{positive}) = \frac{p \times r}{p \times r + (1- p) \times (1-r)}. \]
The following table shows that probability for a variety of base rates and test accuracies (again, assuming that the test is symmetric, that is the probability of a false positive and a false negative are the same; more about that below).
A quick perusal of this table shows some interesting things, such as the really low probabilities, even with very accurate tests, for the very small base rates (so, if you get a positive result for a very rare disease, don't fret too much, do the follow-up).
There are many philosophical objections to all the above, but as a good engineer I'll ignore them all and go straight to the interesting questions that people ask about that table, for example, how the accuracy or precision of the test works.
Let's say you have a test of some sort, cholesterol, blood pressure, etc; it produces some output variable that we'll assume is continuous. Then, there will be a distribution of these values for people who are healthy and, if the test is of any use, a different distribution for people who are sick. The scale is the same, but, for example, healthy people have, let's say, blood pressure values centered around 110 over 80, while sick people have blood pressure values centered around 140 over 100.
So, depending on the variables measured, the type of technology available, the combination of variables, one can have more or less overlap between the distributions of the test variable for healthy and sick people.
Assuming for illustration normal distributions with equal variance, here are two different tests, the second one being more precise than the first one:
Note that these distributions are fixed by the technology, the medical variables, the biochemistry, etc; the two examples above would, for example, be the difference between comparing blood pressures (test 1) and measuring some blood chemical that is more closely associated with the medical condition (test 2), not some statistical magic made on the same variable.
Note that there are other ways that a test A can be more precise than test B, for example if the variances for A are smaller than for B, even if the means are the same; or if the distributions themselves are asymmetric, with longer tails on the appropriate side (so that the overlap becomes much smaller).
(Note that the use of normal distributions with similar variances above was only for example purposes; most actual tests have significant asymmetries and different variances for the healthy versus sick populations. It's something that people who discover and refine testing technologies rely on to come up with their tests. I'll continue to use the same-variance normals in my examples, for simplicity.)
A second question that interested (and interesting) people ask about these numbers is why the tests are symmetric (the probability of a false positive equal to that of a false negative).
They are symmetric in the examples we use to explain them, since it makes the computation simpler. In reality almost all important preliminary tests have a built-in bias towards the most robust outcome.
For example, many tests for dangerous conditions have a built-in positive bias, since the outcome of a positive preliminary test is more testing (usually followed by relief since the positive was a false positive), while the outcome of a negative can be lack of treatment for an existing condition (if it's a false negative).
To change the test from a symmetric error to a positive bias, all that is necessary is to change the threshold between positive and negative towards the side of the negative:
In fact, if you, the patient, have access to the raw data (you should be able to, at least in the US where doctors treat patients like humans, not NHS cost units), you can see how far off the threshold you are and look up actual distribution tables on the internet. (Don't argue these with your HMO doctor, though, most of them don't understand statistical arguments.)
For illustration, here are the posterior probabilities for a test that has bias $k$ in favor of false positives, understood as $\Pr(\text{positive}|\text{not sick}) = k \times \Pr(\text{negative}|\text{sick})$, for some different base rates $p$ and probability of accurate positive test $r$ (as above):
So, this is good news: if you get a scary positive test for a dangerous medical condition, that test is probably biased towards false positives (because of the scary part) and therefore the probability that you actually have that scary condition is much lower than you'd think, even if you'd been trained in statistical thinking (because that training, for simplicity, almost always uses symmetric tests). Therefore, be a little more relaxed when getting the follow-up test.
There's a third interesting question that people ask when shown the computation above: the probability of someone getting tested to begin with. It's an interesting question because in all these computational examples we assume that the population that gets tested has the same distribution of sick and health people as the general population. But the decision to be tested is usually a function of some reason (mild symptoms, hypochondria, job requirement), so the population of those tested may have a higher incidence of the condition than the general population.
This can be modeled by adding elements to the computation, which makes the computation more cumbersome and detracts from its value to make the point that base rates are very important. But it's a good elaboration and many models used by doctors over-estimate base rates precisely because they miss this probability of being tested. More good news there!
Probabilities: so important to understand, so thoroughly misunderstood.
- - - - -
Production notes
1. There's nothing new above, but I've had to make this argument dozens of times to people and forum dwellers (particularly difficult when they've just received a positive result for some scary condition), so I decided to write a post that I can point people to.
2. [warning: rant] As someone who has railed against the use of spline drawing and quarter-ellipses in other people's slides, I did the right thing and plotted those normal distributions from the actual normal distribution formula. That's why they don't look like the overly-rounded "normal" distributions in some other people's slides: because these people make their "normals" with free-hand spline drawing and their exponentials with quarter ellipses, That's extremely lazy in an age when any spreadsheet, RStats, Matlab, or Mathematica can easily plot the actual curve. The people I mean know who they are. [end rant]
Wednesday, November 9, 2016
Powerlifters vs Gym Rats, take 2
(This is a redo of the numbers in my previous powerlifters vs gym rats post, with assumptions that are less favorable to powerlifters.)
First, since we need some sort of metric to compare athletes, I'll unbiasedly 😀 choose the average of three lifts, bench press, deadlift, and squat, as a percentage of the bodyweight of the athlete. Call that metric $S$.
We'll use a standard Normal for the distribution of this metric, by subtracting the mean (100 percent of bodyweight for non-powerlifters, assuming that the average gym rat can bench, deadlift, and squat their own bodyweight) and dividing by the standard deviation (say 15 percent of bodyweight, using the scientific approach of judging 10 to be too little and 20 to be too much). In other words, for non-powerlifters, $z \doteq (S-100)/15.$
As in the previous post, we'll assume that powerlifters are 1 percent of the gym rats; but instead of the powerlifters having a mean at 2 (in $z$ space, 130 in $S$ space), they only have a one-SD advantage, that is their mean is at 1 (in $z$ space, 115 in $S$ space). In other words
$\qquad z \sim \mathcal{N}(0,1)\qquad $ for non-powerlifters
$\qquad z \sim \mathcal{N}(1,1)\qquad $ for powerlifters
Using these assumptions we can now compute the percentage of powerlifters that exist in a gym population above a given threshold; we can also compute the median score of all athletes who score above that threshold (click for larger):
Note that the conditional median that we're using here is lower than the conditional mean, as the conditional distribution is skewed to the right, i.e. has a long right tail. The choice of the median is more informative for skewed distributions as a "sense of what we'll see in the gym."*
It's interesting to note that this is the median of the combined distribution of powerlifters and other gym rats, weighted by their proportion in the population above the threshold, so the difference between this median and the threshold is a non-monotonic function of the threshold as the curvature and the weight of the distribution of each type of athlete change significantly in the $1-8$ range of the table.
Under these weaker assumptions (pun intended), only when the threshold for inclusion passes 5 standard deviations from the other gym goers' mean do powerlifters become the majority of the qualifying athletes. Unless the gym is full of football players (that's american football), weightlifters, and strongman competitors, I think these assumptions are too unfavorable to powerlifters.
Here are some strong athletes moving metal, for variety (NSFW language):
- - - - - -
* Unless there are CrossFit-ers in the gym, in which case what we typically see in the gym is dangerous, counter-productive nonsense.
First, since we need some sort of metric to compare athletes, I'll unbiasedly 😀 choose the average of three lifts, bench press, deadlift, and squat, as a percentage of the bodyweight of the athlete. Call that metric $S$.
We'll use a standard Normal for the distribution of this metric, by subtracting the mean (100 percent of bodyweight for non-powerlifters, assuming that the average gym rat can bench, deadlift, and squat their own bodyweight) and dividing by the standard deviation (say 15 percent of bodyweight, using the scientific approach of judging 10 to be too little and 20 to be too much). In other words, for non-powerlifters, $z \doteq (S-100)/15.$
As in the previous post, we'll assume that powerlifters are 1 percent of the gym rats; but instead of the powerlifters having a mean at 2 (in $z$ space, 130 in $S$ space), they only have a one-SD advantage, that is their mean is at 1 (in $z$ space, 115 in $S$ space). In other words
$\qquad z \sim \mathcal{N}(0,1)\qquad $ for non-powerlifters
$\qquad z \sim \mathcal{N}(1,1)\qquad $ for powerlifters
Using these assumptions we can now compute the percentage of powerlifters that exist in a gym population above a given threshold; we can also compute the median score of all athletes who score above that threshold (click for larger):
Note that the conditional median that we're using here is lower than the conditional mean, as the conditional distribution is skewed to the right, i.e. has a long right tail. The choice of the median is more informative for skewed distributions as a "sense of what we'll see in the gym."*
It's interesting to note that this is the median of the combined distribution of powerlifters and other gym rats, weighted by their proportion in the population above the threshold, so the difference between this median and the threshold is a non-monotonic function of the threshold as the curvature and the weight of the distribution of each type of athlete change significantly in the $1-8$ range of the table.
Under these weaker assumptions (pun intended), only when the threshold for inclusion passes 5 standard deviations from the other gym goers' mean do powerlifters become the majority of the qualifying athletes. Unless the gym is full of football players (that's american football), weightlifters, and strongman competitors, I think these assumptions are too unfavorable to powerlifters.
Here are some strong athletes moving metal, for variety (NSFW language):
"While they squat I eat cookies" has to be the most powerlifter-y sentence ever.
Update Nov 11, 2016: Here's the percentage of powerlifters in the population of qualifying athletes for different assumptions about the advantage of powerlifters (i.e. the mean of the powerlifters' distribution in standard deviation units); click for larger:
- - - - - -
* Unless there are CrossFit-ers in the gym, in which case what we typically see in the gym is dangerous, counter-productive nonsense.
Saturday, March 5, 2016
Powerlifters vs Gym Rats - A tale of two means
In my last post I wrote:
And the explanation, which the friend didn't understand, was "because on the upper tail the difference between means is going to dominate the difference in sizes of the population."
So here's an illustration of what I meant, with pictures and numbers and bad jokes.
First let's make the setup explicit. That's the great power of math and numerical examples, making things explicit. "Powerlifters are typically much stronger than the average athlete" will be operationalized with four assumptions:
The following figure shows the distributions:
When we look at the people in a gym with above-average strength, that is people with $S_{\mathrm{GR}}>0$, we find that one-half of all gym rats have that, and $98
\%$ of all powerlifters have that: $\Pr(S_{\mathrm{GR}}>0) = 0.5$ and $\Pr(S_{\mathrm{PL}}>0) = 0.98$. This is illustrated in the next figure:
Powerlifters are over-represented in the above-average strength, approximately twice as much as in the general population, but they are only about $2\%$ of the total, as their over-representation is multiplied by $1\%$.
As we become more selective, the over-representation goes up. For athletes that are at least one standard deviation above the mean, we have:
with $\Pr(S_{\mathrm{GR}}>1) = 0.16$ and $\Pr(S_{\mathrm{PL}}>1) = 0.84$. Powerlifters are over-represented 5-fold, so about $5\%$ of the total athletes in this category.
When we become more and more selective, for example when we compute the number of gym rats that have at least as much strength as the average powerlifter, $\Pr(S_{\mathrm{GR}}>2)$, we get
with $\Pr(S_{\mathrm{GR}}>2) = 0.023$ and $\Pr(S_{\mathrm{PL}}>2) = 0.5$, a 22-fold over-representation, meaning that of every six athletes in this category, one is a powerlifter. (Yes, one out of six, not one out of five. See if you can figure out why; if not, look at the solution for $S>6$ below and you'll understand. Or not, but that's a different problem.)
And as we look at subsets of stronger and stronger athletes, the over-representation of powerlifters becomes higher and higher: $\Pr(S_{\mathrm{GR}}>3) = 0.00135$ and $\Pr(S_{\mathrm{PL}}>3) = 0.159$, $118$-fold ratio. There will be a few more powerlifters in this group that other gym rats; another way to say that is that powerlifters will be a little bit more than one-half of all gym rats that are at least one standard deviation stronger than the average powerlifter.
The ratios grow exponentially with increasing values for strength (the rare correct use of "exponentially" as they are ratios of Normal distribution tail probabilities; see below).
For $S>4$ the ratio is $718$, for $S>5$ the ratio is $4700$, for $S>6$ the ratio is $32 100$, in other words, there will be one non-powerlifter per group of $322$ gym rats with strength greater than 6 standard deviations above the mean of all gym rats.
This is what the effect of the differences in the tails of Normals always implies: eventually the small size of the better population (powerlifters) will be irrelevant as the higher mean will dominate.
See? That wasn't complicated at all.
-- -- -- --
For the mathematically inclined (strangely themselves over-represented in the set of powerlifters...)
Note that the ratio of probability density functions for the two Normal distributions in the post, for realizations of strength $S = x$ is
\[
\frac{f_{S}(x|\mu_{S}=2)}{f_{S}(x|\mu_{S}=0)}= \frac{e^{-(x-2)^2/2}}{e^{-x^2/2}}= e^{2x-2}
\]
which grows unbounded with $x$; no matter how small the fraction of powerlifters, say $\epsilon$, there's always a minimal $\bar S$ beyond which that ratio becomes greater than $1/\epsilon$ Which means that at some point above $\bar S$ the ratio of the remaining tail itself becomes greater than $1/\epsilon$. (It's very easy to calculate $\bar S$ and I have done so; I'll leave it as an exercise for the dedicated reader...)
Oh, that's the rare occurrence of the correct use of "exponentially," which is usually incorrectly treated as a synonym for "convex."
For example, some time ago I had a discussion with a friend about strength training. The gist of it was that powerlifters are typically much stronger than the average athlete, but they are also much fewer; because of that, in a typical gym the strongest athlete might not be a powerlifter, but as we get into regional competitions and national competitions, the winner is going to be a powerlifter.
And the explanation, which the friend didn't understand, was "because on the upper tail the difference between means is going to dominate the difference in sizes of the population."
So here's an illustration of what I meant, with pictures and numbers and bad jokes.
First let's make the setup explicit. That's the great power of math and numerical examples, making things explicit. "Powerlifters are typically much stronger than the average athlete" will be operationalized with four assumptions:
A1: There's some composite metric of strength, call it $S$ that we care about and we'll normalize it so that the average gym rat has a mean $\mu(S_{\mathrm{GR}})$ of zero and a variance of $1$.
A2: The distribution of strength within the population of gym rats is Normally distributed.
A3: The distribution of strength in the sub-population of powerlifters is also Normally distributed.
A4: For illustration purposes only, we will assume that powerlifters have a mean $\mu(S_{\mathrm{PL}})$ of 2 and the same variance as the rest of the gym rats.We operationalize "they are also much fewer" with
A5: For illustration, the number of powerlifters is $1\%$ of gym rats.(Powerlifters are gym rats, so the distribution for $S_{\mathrm{GR}}$ includes these $1\%$, balanced by CrossFit people, who bring down the mean strength and IQ in the gym while raising the insurance premiums. Watch Elgintensity to understand.)
The following figure shows the distributions:
When we look at the people in a gym with above-average strength, that is people with $S_{\mathrm{GR}}>0$, we find that one-half of all gym rats have that, and $98
\%$ of all powerlifters have that: $\Pr(S_{\mathrm{GR}}>0) = 0.5$ and $\Pr(S_{\mathrm{PL}}>0) = 0.98$. This is illustrated in the next figure:
Powerlifters are over-represented in the above-average strength, approximately twice as much as in the general population, but they are only about $2\%$ of the total, as their over-representation is multiplied by $1\%$.
As we become more selective, the over-representation goes up. For athletes that are at least one standard deviation above the mean, we have:
with $\Pr(S_{\mathrm{GR}}>1) = 0.16$ and $\Pr(S_{\mathrm{PL}}>1) = 0.84$. Powerlifters are over-represented 5-fold, so about $5\%$ of the total athletes in this category.
When we become more and more selective, for example when we compute the number of gym rats that have at least as much strength as the average powerlifter, $\Pr(S_{\mathrm{GR}}>2)$, we get
with $\Pr(S_{\mathrm{GR}}>2) = 0.023$ and $\Pr(S_{\mathrm{PL}}>2) = 0.5$, a 22-fold over-representation, meaning that of every six athletes in this category, one is a powerlifter. (Yes, one out of six, not one out of five. See if you can figure out why; if not, look at the solution for $S>6$ below and you'll understand. Or not, but that's a different problem.)
And as we look at subsets of stronger and stronger athletes, the over-representation of powerlifters becomes higher and higher: $\Pr(S_{\mathrm{GR}}>3) = 0.00135$ and $\Pr(S_{\mathrm{PL}}>3) = 0.159$, $118$-fold ratio. There will be a few more powerlifters in this group that other gym rats; another way to say that is that powerlifters will be a little bit more than one-half of all gym rats that are at least one standard deviation stronger than the average powerlifter.
The ratios grow exponentially with increasing values for strength (the rare correct use of "exponentially" as they are ratios of Normal distribution tail probabilities; see below).
For $S>4$ the ratio is $718$, for $S>5$ the ratio is $4700$, for $S>6$ the ratio is $32 100$, in other words, there will be one non-powerlifter per group of $322$ gym rats with strength greater than 6 standard deviations above the mean of all gym rats.
This is what the effect of the differences in the tails of Normals always implies: eventually the small size of the better population (powerlifters) will be irrelevant as the higher mean will dominate.
See? That wasn't complicated at all.
-- -- -- --
For the mathematically inclined (strangely themselves over-represented in the set of powerlifters...)
Note that the ratio of probability density functions for the two Normal distributions in the post, for realizations of strength $S = x$ is
\[
\frac{f_{S}(x|\mu_{S}=2)}{f_{S}(x|\mu_{S}=0)}= \frac{e^{-(x-2)^2/2}}{e^{-x^2/2}}= e^{2x-2}
\]
which grows unbounded with $x$; no matter how small the fraction of powerlifters, say $\epsilon$, there's always a minimal $\bar S$ beyond which that ratio becomes greater than $1/\epsilon$ Which means that at some point above $\bar S$ the ratio of the remaining tail itself becomes greater than $1/\epsilon$. (It's very easy to calculate $\bar S$ and I have done so; I'll leave it as an exercise for the dedicated reader...)
Oh, that's the rare occurrence of the correct use of "exponentially," which is usually incorrectly treated as a synonym for "convex."
Labels:
math,
mathematics,
Probability,
statistics
Wednesday, March 2, 2016
Acalculia, innumeracy, or numerophobia?
I think there's an epidemic of number-induced brain paralysis going around.
There are quite a few examples of quant questions in interviews creating the mental equivalent of a frozen operating system (including this post by Sprezzaturian), but I think that there's something beyond that, something that applies in social situations and that affects people who should know better.
Here's a simple example. What is the orbital speed of the International Space Station, roughly? No, don't google it, calculate it. Orbital period is about 90 minutes, altitude (distance to ground) about 400km, Earth radius is about 6370km.
Seriously, this question stumps people with university degrees, including some in the life sciences who necessarily have taken college level science courses.
And what college-level math do you need to answer it? The formula for the circumference of a circle of radius $r$. Yes, $2\times\pi\times r$. The orbital velocity in km/h is the total number of kilometers per orbit ($2\times\pi\times (6370+400)$) divided by the time to orbit in hours ($1\frac{1}{2}$), that is around $28\,000$ km/h, which is close to the actual value, $27\, 600$ km/h. (The orbit is an ellipse and takes more than 90 minutes.)
Can it possibly be ignorance, innumeracy? Is it plausible that college-educated professionals don't know the circumference formula? Nope, they can recite the formula when prompted.
Or is it acalculia? That they have a mental inability to do calculation? Nope, they can compute exactly how much I owe on the lunch bill for the extra crème brûlée and the expensive entrée.
No, I think it's a mild case of numerophobia, a mental paralysis created by the appearance of an unexpected numerical challenge in normal life. This is a problem, as most of the world can be perceived more deeply if one thinks like a quant all the time; many strange "paradoxes" become obvious when seen through the lens of numerical (or parametrical) thinking.
For example, some time ago I had a discussion with a friend about strength training. The gist of it was that powerlifters are typically much stronger than the average athlete, but they are also much fewer; because of that, in a typical gym the strongest athlete might not be a powerlifter, but as we get into regional competitions and national competitions, the winner is going to be a powerlifter.
"That's because on the upper tail the difference between means is going to dominate the difference in sizes of the population." That quoted sentence is what I said. I might as well have said "boo-blee-gaa-gee in-a-gadda-vida hidee-hidee-hidee-oh" for all the comprehension. The friend is an engineer. A numbers person. But apparently, numbers are work-domain only.
The awesome power of quant thinking is being blocked by this strange social numerophobia. We must fight it. Liberate your inner quant; learn to love numbers in all areas of life.
Everything is numbers.
There are quite a few examples of quant questions in interviews creating the mental equivalent of a frozen operating system (including this post by Sprezzaturian), but I think that there's something beyond that, something that applies in social situations and that affects people who should know better.
Here's a simple example. What is the orbital speed of the International Space Station, roughly? No, don't google it, calculate it. Orbital period is about 90 minutes, altitude (distance to ground) about 400km, Earth radius is about 6370km.
Seriously, this question stumps people with university degrees, including some in the life sciences who necessarily have taken college level science courses.
And what college-level math do you need to answer it? The formula for the circumference of a circle of radius $r$. Yes, $2\times\pi\times r$. The orbital velocity in km/h is the total number of kilometers per orbit ($2\times\pi\times (6370+400)$) divided by the time to orbit in hours ($1\frac{1}{2}$), that is around $28\,000$ km/h, which is close to the actual value, $27\, 600$ km/h. (The orbit is an ellipse and takes more than 90 minutes.)
Can it possibly be ignorance, innumeracy? Is it plausible that college-educated professionals don't know the circumference formula? Nope, they can recite the formula when prompted.
Or is it acalculia? That they have a mental inability to do calculation? Nope, they can compute exactly how much I owe on the lunch bill for the extra crème brûlée and the expensive entrée.
No, I think it's a mild case of numerophobia, a mental paralysis created by the appearance of an unexpected numerical challenge in normal life. This is a problem, as most of the world can be perceived more deeply if one thinks like a quant all the time; many strange "paradoxes" become obvious when seen through the lens of numerical (or parametrical) thinking.
For example, some time ago I had a discussion with a friend about strength training. The gist of it was that powerlifters are typically much stronger than the average athlete, but they are also much fewer; because of that, in a typical gym the strongest athlete might not be a powerlifter, but as we get into regional competitions and national competitions, the winner is going to be a powerlifter.
"That's because on the upper tail the difference between means is going to dominate the difference in sizes of the population." That quoted sentence is what I said. I might as well have said "boo-blee-gaa-gee in-a-gadda-vida hidee-hidee-hidee-oh" for all the comprehension. The friend is an engineer. A numbers person. But apparently, numbers are work-domain only.
The awesome power of quant thinking is being blocked by this strange social numerophobia. We must fight it. Liberate your inner quant; learn to love numbers in all areas of life.
Everything is numbers.
Sunday, January 24, 2016
Big numbers, big confusion; small numbers, bigger confusion.
I can make something almost impossible happen. And so can you.
Let's start by defining what "almost impossible" means. Less than one-in-a-trillion chance? How about less than one in a trillion-trillion chance? One in a trillion-trillion-trillion chance?
Ok, lets take a breath here. What's this trillion-trillion and trillion-etc stuff?
(In my observation, economists say million, billion, trillion, and all their audiences hear is "big number." Innumeracy over scale has bad consequences when applied to public policy.)
One trillion is 1,000,000,000,000. (Yes, I'm using American billions, pretty much like everyone else now does.) This is written as $10^{12}$. A trillion trillion is $10^{12} \times 10^{12} = 10^{24}$ and a trillion-trillion-trillion is $10^{36}$, a one followed by thirty-six zeros.
To put that number in perspective, the age of the Earth is about $4.5$ billion years, or about $1.42 \times 10^{17}$ seconds. That's 142,000 trillion seconds. Note that this is much smaller than a trillion-trillion seconds (it's over one seven-millionth of a trillion-trillion), let alone a trillion-trillion-trillion. If you had seven million planets the same age as the Earth, and you picked at random one specific second in the history in one specific planet you'd would have about a one in a trillion-trillion chance of picking this precise second on this planet. A one in a trillion-trillion-trillion chance is one trillion times smaller than that.
So, something that has a one in a trillion-trillion-trillion chance of happening has to be a very low probability event. Shall we call anything less likely than that "almost impossible"? We shall.
So, here's how I make something almost impossible happen, over and over again, and you can too: shuffle a deck of cards.
Using only 52 cards (no jokers), there are $52! = 52\times 51 \times \ldots \times 2$ possible card shuffles, and $52! \approx 8.1 \times 10^{67}$. That number is $8.1\times 10^{31}$ times bigger than a trillion-trillion-trillion.
And yet, every card shuffle produces an event with 1-in-$8.1 \times 10^{67}$ probability. You and I can generate scores of these "almost impossible" events using a simple deck of cards.
(A little thinking will lead an attentive reader to the solution to this apparent paradox. It's not a paradox. I will post a solution here in a few days, if I remember :-)
-- -- -- --
Just for fun, a simple brain teaser:
Imagine you have two decks of 52 cards (blue and red); what has more possible combinations, shuffling the two decks together and dividing into two piles of 52 cards by separating in the middle of the full shuffled two decks, or shuffling the two decks separately each into its pile?
(Yes, it's obvious for anyone conversant with combinatorics, but apparently not everyone is conversant with combinatorics. Common answer: "it's the same.")
Let's start by defining what "almost impossible" means. Less than one-in-a-trillion chance? How about less than one in a trillion-trillion chance? One in a trillion-trillion-trillion chance?
Ok, lets take a breath here. What's this trillion-trillion and trillion-etc stuff?
(In my observation, economists say million, billion, trillion, and all their audiences hear is "big number." Innumeracy over scale has bad consequences when applied to public policy.)
One trillion is 1,000,000,000,000. (Yes, I'm using American billions, pretty much like everyone else now does.) This is written as $10^{12}$. A trillion trillion is $10^{12} \times 10^{12} = 10^{24}$ and a trillion-trillion-trillion is $10^{36}$, a one followed by thirty-six zeros.
To put that number in perspective, the age of the Earth is about $4.5$ billion years, or about $1.42 \times 10^{17}$ seconds. That's 142,000 trillion seconds. Note that this is much smaller than a trillion-trillion seconds (it's over one seven-millionth of a trillion-trillion), let alone a trillion-trillion-trillion. If you had seven million planets the same age as the Earth, and you picked at random one specific second in the history in one specific planet you'd would have about a one in a trillion-trillion chance of picking this precise second on this planet. A one in a trillion-trillion-trillion chance is one trillion times smaller than that.
So, something that has a one in a trillion-trillion-trillion chance of happening has to be a very low probability event. Shall we call anything less likely than that "almost impossible"? We shall.
So, here's how I make something almost impossible happen, over and over again, and you can too: shuffle a deck of cards.
Using only 52 cards (no jokers), there are $52! = 52\times 51 \times \ldots \times 2$ possible card shuffles, and $52! \approx 8.1 \times 10^{67}$. That number is $8.1\times 10^{31}$ times bigger than a trillion-trillion-trillion.
And yet, every card shuffle produces an event with 1-in-$8.1 \times 10^{67}$ probability. You and I can generate scores of these "almost impossible" events using a simple deck of cards.
(A little thinking will lead an attentive reader to the solution to this apparent paradox. It's not a paradox. I will post a solution here in a few days, if I remember :-)
-- -- -- --
Just for fun, a simple brain teaser:
Imagine you have two decks of 52 cards (blue and red); what has more possible combinations, shuffling the two decks together and dividing into two piles of 52 cards by separating in the middle of the full shuffled two decks, or shuffling the two decks separately each into its pile?
(Yes, it's obvious for anyone conversant with combinatorics, but apparently not everyone is conversant with combinatorics. Common answer: "it's the same.")
Monday, November 26, 2012
How misleading "expected value" can be
The expression "expected value" can be highly misleading.
I was just writing some research results and used the expression "expected value" in relation to a discrete random walk of the form
$x[n+1] = \left\{ \begin{array}{ll}
x[n] + 1 & \qquad \text{with prob. } 1/2 \\
& \\
x[n] -1 & \qquad \text{with prob. } 1/2
\end{array}\right. $ .
This random walk is a martingale, so
$E\big[x[n+1]\big|x[n]\big] = x[n]$.
But from the above formula it's clear that it's never the case that $x[n+1] = x[n]$. Therefore, saying that $x[n+1]$'s expected value is $x[n]$ is misleading — in the sense that a large number of people may expect the event $x[n+1] = x[n]$ to occur rather frequently.
Mathematical language may share words with daily usage, but the meaning can be very different.
----
Added Nov 27: In the random walk above, for any odd $k$, $x[n+k] \neq x[n]$. On the other hand, here's an example of a martingale where $x[n+1] = x[n]$ happens with probability $p$, just for illustration:
$x[n+1] = \left\{ \begin{array}{ll}
x[n] + 1 & \qquad \text{with prob. } (1-p)/2 \\
& \\
x[n] & \qquad \text{with prob. } p \\
& \\
x[n] -1 & \qquad \text{with prob. } (1-p)/2
\end{array}\right. $ .
(Someone asked if it was possible to have such a martingale, which makes me fear for the future of the world. Also, I'm clearly going for popular appeal in this blog...)
Labels:
mathematics,
Probability
Thursday, January 19, 2012
A tale of two long tails
Power law (Zipf) long tails versus exponential (Poisson) long tails: mathematical musings with important real-world implications.
There's a lot of talk about long tails, both in finance (where fat tails, a/k/a kurtosis, turn hedging strategies into a false sense of safety) and in retail (where some people think they just invented niche marketing). I leave finance for people with better salaries brainpower, and focus only on retail for my examples.
A lot of money can be made serving the customers on the long tail; that much we already knew from decades of niche marketing. The question is how much, and for this there are quite a few considerations; I will focus on the difference between exponential decay (Poisson) long tails and hyperbolic decay (power law) long tails and how that difference would impact different emphasis on long tail targeting (that is, how much to invest going after these niche customers), say for a bookstore.
A lot of money can be made serving the customers on the long tail; that much we already knew from decades of niche marketing. The question is how much, and for this there are quite a few considerations; I will focus on the difference between exponential decay (Poisson) long tails and hyperbolic decay (power law) long tails and how that difference would impact different emphasis on long tail targeting (that is, how much to invest going after these niche customers), say for a bookstore.
A Poisson distribution over $N\ge 0$ with parameter $\lambda$ has pdf:
$ \Pr(N=n|\lambda) =\frac{\lambda^{n}\, e^{-\lambda}}{n!}$.
A discrete power law (Zipf) distribution for $N\ge 1$ with parameter $s$ is given by:
$ \Pr(N=n|s) =\frac{n^{-s}}{\zeta(s)},$
where $\zeta(s)$ is the Riemann zeta function; note that it's only a scaling factor given $s$.
A couple of observations:
1. Because the power law has $\Pr(N=0|s)=0$, I'll actually use a Poisson + 1 process for the exponential long tail. This essentially means that the analysis would be restricted to people who buy at least one book. This assumption is not as bad as it might seem: (a) for brick-and-mortar retailers, this data is only collected when there's an actual purchase; (b) the process of buying a book at all -- which includes going to the store -- may be different from the process of deciding whether to buy a given book or the number of books to buy.
2. Since I'm not calibrating the parameters of these distributions on client data (which is confidential), I'm going to set these parameters to equalize the means of the two long tails. There are other approaches, for example setting them to minimize a measure of distance, say the Kullback-Leibler divergence or the mean square error, but the equal means is simpler.
The following diagram compares a Zipf distribution with $s=3$ (which makes $\mu=1.37$) and a 1 + Poisson process with $\lambda=0.37$ (click for larger):
The important data is the grey line, which maps into the right-side logarithmic scale: for all the visually impressive differences in the small numbers $N$ on the left, the really large ratios happen in the long tail. This is one of the issues a lot of probabilists point out to practitioners: it's really important to understand the behavior at the small probability areas of the distribution support, especially if they represent -- say -- the possibility of catastrophic losses in finance or the potential for the customers who buy large numbers of books.
An aside, from Seth Godin, about the importance of the heavy user segment in bookstores:
To illustrate the importance of even the relatively small ratios for a few books, this diagram shows the percentage of purchases categorized by size of purchase:
Yes, the large number of customers who buy a small number of books still gets a large percent of the total, but each of these is not a good customer to have: elaborating on Seth's post, these one-book customers are costly to serve, typically will buy a heavily-discounted best-seller and are unlikely to buy the high-margin specialized books, and tend to be followers, not influencers of what other customers will spend money on (so there are no spillovers from their purchase).
The small probabilities have been ignored long enough; finance is now becoming weary of kurtosis, marketing should go back to its roots and merge niche marketing with big data, instead of trying to reinvent the well-know wheel.
Lunchtime addendum: The differences between the exponential and the power law long tail are reproduced, to a smaller extent, across different power law regimes:
Note that the logarithmic scale implies that the increasing vertical distances with $N$ are in fact increasing probability ratios.
- - - - - - - - -
Well, that plan to make this blog more popular really panned out, didn't it? :-)
A couple of observations:
1. Because the power law has $\Pr(N=0|s)=0$, I'll actually use a Poisson + 1 process for the exponential long tail. This essentially means that the analysis would be restricted to people who buy at least one book. This assumption is not as bad as it might seem: (a) for brick-and-mortar retailers, this data is only collected when there's an actual purchase; (b) the process of buying a book at all -- which includes going to the store -- may be different from the process of deciding whether to buy a given book or the number of books to buy.
2. Since I'm not calibrating the parameters of these distributions on client data (which is confidential), I'm going to set these parameters to equalize the means of the two long tails. There are other approaches, for example setting them to minimize a measure of distance, say the Kullback-Leibler divergence or the mean square error, but the equal means is simpler.
The following diagram compares a Zipf distribution with $s=3$ (which makes $\mu=1.37$) and a 1 + Poisson process with $\lambda=0.37$ (click for larger):
The important data is the grey line, which maps into the right-side logarithmic scale: for all the visually impressive differences in the small numbers $N$ on the left, the really large ratios happen in the long tail. This is one of the issues a lot of probabilists point out to practitioners: it's really important to understand the behavior at the small probability areas of the distribution support, especially if they represent -- say -- the possibility of catastrophic losses in finance or the potential for the customers who buy large numbers of books.
An aside, from Seth Godin, about the importance of the heavy user segment in bookstores:
Amazon and the Kindle have killed the bookstore. Why? Because people who buy 100 or 300 books a year are gone forever. The typical American buys just one book a year for pleasure. Those people are meaningless to a bookstore. It's the heavy users that matter, and now officially, as 2009 ends, they have abandoned the bookstore. It's over.
To illustrate the importance of even the relatively small ratios for a few books, this diagram shows the percentage of purchases categorized by size of purchase:
Yes, the large number of customers who buy a small number of books still gets a large percent of the total, but each of these is not a good customer to have: elaborating on Seth's post, these one-book customers are costly to serve, typically will buy a heavily-discounted best-seller and are unlikely to buy the high-margin specialized books, and tend to be followers, not influencers of what other customers will spend money on (so there are no spillovers from their purchase).
The small probabilities have been ignored long enough; finance is now becoming weary of kurtosis, marketing should go back to its roots and merge niche marketing with big data, instead of trying to reinvent the well-know wheel.
Lunchtime addendum: The differences between the exponential and the power law long tail are reproduced, to a smaller extent, across different power law regimes:
Note that the logarithmic scale implies that the increasing vertical distances with $N$ are in fact increasing probability ratios.
- - - - - - - - -
Well, that plan to make this blog more popular really panned out, didn't it? :-)
Friday, December 2, 2011
Dilbert gets the Correlation-Causation difference wrong
This was the Dilbert comic strip for Nov. 28, 2011:
It seems to imply that even though there's a correlation between the pointy-haired boss leaving Dilbert's cubicle and receiving an anonymous email about the worst boss in the world, there's no causation.
THAT IS WRONG!
Clearly there's causation: PHB leaves Dilbert's cubicle, which causes Wally to send the anonymous email. PHB's implication that he thinks Dilbert sends the email is wrong, but that doesn't mean that the correlation he noticed isn't in this case created by a causal link between leaving Dilbert's cubicle and getting the email.
I think Edward Tufte once said that the statement "correlation is not causation" was incomplete; at least it should read "correlation is not causation, but it sure hints at some relationship that must be investigated further." Or words to that effect.
It seems to imply that even though there's a correlation between the pointy-haired boss leaving Dilbert's cubicle and receiving an anonymous email about the worst boss in the world, there's no causation.
THAT IS WRONG!
Clearly there's causation: PHB leaves Dilbert's cubicle, which causes Wally to send the anonymous email. PHB's implication that he thinks Dilbert sends the email is wrong, but that doesn't mean that the correlation he noticed isn't in this case created by a causal link between leaving Dilbert's cubicle and getting the email.
I think Edward Tufte once said that the statement "correlation is not causation" was incomplete; at least it should read "correlation is not causation, but it sure hints at some relationship that must be investigated further." Or words to that effect.
Labels:
causality,
Dilbert,
Probability,
statistics
Thursday, November 24, 2011
Data cleaning or cherry-picking?
Sometimes there's a fine line between data cleaning and cherry-picking your data.
My new favorite example of this is based on something Nassim Nicholas Taleb said at a talk at Penn (starting at 32 minutes in): that 92% of all kurtosis for silver in the last 40 years of trading could be traced to a single day; 83% of stock market kurtosis could also be traced to one day in 40 years.
One day in forty years is about 1/14,600 of all data. Such a disproportionate effect might lead some "outlier hunters" to discard that one data point. After all, there are many data butchers (not scientists if they do this) who create arbitrary rules for outlier detection (say, more than four standard deviations away from the mean) and use them without thinking.
In the NNT case, however, that would be counterproductive: the whole point of measuring kurtosis (or, in his argument, the problem that kurtosis is not measurable in any practical way) is to hedge against risk correctly. Underestimating kurtosis will create ineffective hedges, so disposing of the "outlier" will undermine the whole point of the estimation.
In a recent research project I removed one data point from the analysis, deeming it an outlier. But I didn't do it because it was four standard deviations from the mean alone. I found it because it did show an aggregate behavior that was five standard deviations higher than the mean. Then I examined the disaggregate data and confirmed that this was anomalous behavior: the experimental subject had clicked several times on links and immediately clicked back, not even looking at the linked page. This temporally disaggregate behavior, not the aggregate measure of total clicks, was the reason why I deemed the datum an outlier, and excluded it from analysis.
Data cleaning is an important step in data analysis. We should take care to ensure that it's done correctly.
My new favorite example of this is based on something Nassim Nicholas Taleb said at a talk at Penn (starting at 32 minutes in): that 92% of all kurtosis for silver in the last 40 years of trading could be traced to a single day; 83% of stock market kurtosis could also be traced to one day in 40 years.
One day in forty years is about 1/14,600 of all data. Such a disproportionate effect might lead some "outlier hunters" to discard that one data point. After all, there are many data butchers (not scientists if they do this) who create arbitrary rules for outlier detection (say, more than four standard deviations away from the mean) and use them without thinking.
In the NNT case, however, that would be counterproductive: the whole point of measuring kurtosis (or, in his argument, the problem that kurtosis is not measurable in any practical way) is to hedge against risk correctly. Underestimating kurtosis will create ineffective hedges, so disposing of the "outlier" will undermine the whole point of the estimation.
In a recent research project I removed one data point from the analysis, deeming it an outlier. But I didn't do it because it was four standard deviations from the mean alone. I found it because it did show an aggregate behavior that was five standard deviations higher than the mean. Then I examined the disaggregate data and confirmed that this was anomalous behavior: the experimental subject had clicked several times on links and immediately clicked back, not even looking at the linked page. This temporally disaggregate behavior, not the aggregate measure of total clicks, was the reason why I deemed the datum an outlier, and excluded it from analysis.
Data cleaning is an important step in data analysis. We should take care to ensure that it's done correctly.
Sunday, November 13, 2011
Vanity Fair bungles probability example
There's an interesting article about Danny Kahneman in Vanity Fair, written by Michael Lewis. Kahneman's book Thinking: Fast And Slow is an interesting review of the state of decision psychology and well worth reading, as it the Vanity Fair article.
But the quiz attached to that article is an example of how not to popularize technical content.
This example, question 2, is wrong:
$p(math|eng) = 0.5$; half the engineers have math as a hobby.
$p(math|law) = 0.001$; one in a thousand lawyers has math as a hobby.
Then the posterior probabilities (once the description is known) are given by
I understand the representativeness heuristic, which mistakes $p(math|eng)/p(math|law)$ for $p(eng|math)/p(law|math)$, ignoring the base rates, but there's no reason to give up the inference process if some data in the description is actually informative.
-- -- -- --
* This example shows the elucidative power of working through some numbers. One might be tempted to say "ok, there's some updating, but it will probably still fall under the 10-40pct category" or "you may get large numbers with a disproportionate example like one-half of the engineers and one-in-a-thousand lawyers, but that's just an extreme case." Once we get some numbers down, these two arguments fail miserably.
Numbers are like examples, personas, and prototypes: they force assumptions and definitions out in the open.
But the quiz attached to that article is an example of how not to popularize technical content.
This example, question 2, is wrong:
A team of psychologists performed personality tests on 100 professionals, of which 30 were engineers and 70 were lawyers. Brief descriptions were written for each subject. The following is a sample of one of the resulting descriptions:
Jack is a 45-year-old man. He is married and has four children. He is generally conservative, careful, and ambitious. He shows no interest in political and social issues and spends most of his free time on his many hobbies, which include home carpentry, sailing, and mathematics.
What is the probability that Jack is one of the 30 engineers?No. Most people have knowledge beyond what is in the description; so, starting from the appropriate prior probabilities, $p(law) = 0.7$ and $p(eng) = 0.3$, they update them with the fact that engineers like math more than lawyers, $p(math|eng) >> p(math|law)$. For illustration consider
A. 10–40 percent
B. 40–60 percent
C. 60–80 percent
D. 80–100 percent
If you answered anything but A (the correct response being precisely 30 percent), you have fallen victim to the representativeness heuristic again, despite having just read about it.
$p(math|eng) = 0.5$; half the engineers have math as a hobby.
$p(math|law) = 0.001$; one in a thousand lawyers has math as a hobby.
Then the posterior probabilities (once the description is known) are given by
$p(eng|math) = \frac{ p(math|eng) \times p(eng)}{p(math)}$
$p(law|math) = \frac{ p(math|law) \times p(law)}{p(math)}$with $p(math) = p(math|eng) \times p(eng) + p(math|law) \times p(law)$. In other words, with the conditional probabilities above,
$p(eng|math) = 0.995$
$p(law|math) = 0.005$
Note that even if engineers as a rule don't like math, only a small minority does, the probability is still much higher than 0.30 as long as the minority of engineers is larger than the minority of lawyers*:
Yes, that last case is a two-to-one ratio of engineers who like math to lawyers who like math; and it still falls out of the 10-40pct category.$p(math|eng) = 0.25$ implies $p(eng|math) = 0.991$
$p(math|eng) = 0.10$ implies $p(eng|math) = 0.977$
$p(math|eng) = 0.05$ implies $p(eng|math) = 0.955$
$p(math|eng) = 0.01$ implies $p(eng|math) = 0.811$
$p(math|eng) = 0.005$ implies $p(eng|math) = 0.682$
$p(math|eng) = 0.002$ implies $p(eng|math) = 0.462$
I understand the representativeness heuristic, which mistakes $p(math|eng)/p(math|law)$ for $p(eng|math)/p(law|math)$, ignoring the base rates, but there's no reason to give up the inference process if some data in the description is actually informative.
-- -- -- --
* This example shows the elucidative power of working through some numbers. One might be tempted to say "ok, there's some updating, but it will probably still fall under the 10-40pct category" or "you may get large numbers with a disproportionate example like one-half of the engineers and one-in-a-thousand lawyers, but that's just an extreme case." Once we get some numbers down, these two arguments fail miserably.
Numbers are like examples, personas, and prototypes: they force assumptions and definitions out in the open.
Sunday, September 18, 2011
Probability interlude: from discrete events to continuous time
Lunchtime fun: the relationship between Bernoulli and Exponential distributions.
Let's say the probability of Joe getting a coupon for Pepsi in any given time interval $\Delta t$, say a month, is given by $p$. This probability depends on a number of things, such as intensity of couponing activity, quality of targeting, Joe not throwing away all junk mail, etc.
For a given integer number of months, $n$, we can easily compute the probability, $P$, of Joe getting at least one coupon during the period, which we'll call $t$, as
$P(n) = 1 - (1-p)^n$.
Since the period $t$ is $t= n \times \Delta t$, we can write that as
$P(t) = 1 - (1-p)^{\frac{t}{\Delta t}}.$
Or, with a bunch of assumptions that we'll assume away,
$P(t) = 1- \exp\left(t \times \frac{\log (1-p)}{\Delta t}\right).$
Note that $\log (1-p)<0$. Defining $r = - \log (1-p) /\Delta t$, we get
$P(t) = 1 - \exp (- r t)$.
And that is the relationship between the Bernoulli distribution and the Exponential distribution.
We can now build continuous-time analyses of couponing activity. Continuous analysis is much easier to do than discrete analysis. Also, though most simulators are, by computational necessity, discrete, building them based on continuous time models is usually simpler and easier to explain to managers using them.
Let's say the probability of Joe getting a coupon for Pepsi in any given time interval $\Delta t$, say a month, is given by $p$. This probability depends on a number of things, such as intensity of couponing activity, quality of targeting, Joe not throwing away all junk mail, etc.
For a given integer number of months, $n$, we can easily compute the probability, $P$, of Joe getting at least one coupon during the period, which we'll call $t$, as
$P(n) = 1 - (1-p)^n$.
Since the period $t$ is $t= n \times \Delta t$, we can write that as
$P(t) = 1 - (1-p)^{\frac{t}{\Delta t}}.$
Or, with a bunch of assumptions that we'll assume away,
$P(t) = 1- \exp\left(t \times \frac{\log (1-p)}{\Delta t}\right).$
Note that $\log (1-p)<0$. Defining $r = - \log (1-p) /\Delta t$, we get
$P(t) = 1 - \exp (- r t)$.
And that is the relationship between the Bernoulli distribution and the Exponential distribution.
We can now build continuous-time analyses of couponing activity. Continuous analysis is much easier to do than discrete analysis. Also, though most simulators are, by computational necessity, discrete, building them based on continuous time models is usually simpler and easier to explain to managers using them.
Labels:
mathematics,
Probability
Saturday, September 17, 2011
Small probabilities, big trouble.
After a long – work-related – hiatus, I'm back to blogging with a downer: the troublesome nature of small probability estimation.
The idea for this post came from a speech by Nassim Nicholas Taleb at Penn. Though the video is a bit rambling, it contains several important points. One that is particularly interesting to me is the difficulty of estimating the probability of rare events.
For illustration, let's consider a Normally distributed random variable $P$, and see what happens when small model errors are introduced. In particular we want to how the probability density $f_{P}(\cdot)$ predicted by four different models changes as a function of distance to zero, $x$. The higher the $x$ the more infrequently the event $P = x$ happens.
The densities are computed in the following table (click for larger):

The first column gives $f_{P}(x)$ for $P \sim \mathcal{N}(0,1)$, the base case. The next column is similar except that there's a 0.1% increase in the variance (10 basis points*). The third column is the ratio of these densities. (These are not probabilities, since $P$ is a continuous variable.)
Two observations jump at us:
1. Near the mean, where most events happen, it's very difficult to separate the two cases: the ratio of the densities up to two standard deviations ($x=2$) is very close to 1.
2. Away from the mean, where events are infrequent (but potentially with high impact), the small error of 10 basis points is multiplied: at highly infrequent events ($x>7$) the density is off by over 500 basis points.
So: it's very difficult to tell the models apart with most data, but they make very different predictions for uncommon events. If these events are important when they happen, say a stock market crash, this means trouble.
Moving on, the fourth column uses $P \sim \mathcal{N}(0.001,1)$, the same 10 basis points error, but in the mean rather than the variance. Column five is the ratio of these densities to the base case.
Comparing column five with column three we see that similarly sized errors in mean estimation have less impact than errors in variance estimation. Unfortunately variance is harder to estimate accurately than the mean (it uses the mean estimate as an input, for one), so this only tells us that problems are likely to happen where they are more damaging to model predictive abilities.
Column six shows the effect of a larger variance (100 basis points off the standard, instead of 10); column seven shows the ratio of this density to the base case.
With an error of 1% in the estimate of the variance it's still hard to separate the models within two standard deviations (for a Normal distribution about 95% of all events fall within two standard deviations of the mean), but the error in density estimates at $x=7$ is 62%.
Small probability events are very hard to predict because most of the times all the information available is not enough to choose between models that have very close parameters but these models predict very different things for infrequent cases.
Told you it was a downer.
-- -- --
* Some time ago I read a criticism of this nomenclature by someone who couldn't see its purpose. The purpose is good communication design: when there's a lot of 0.01% and 0.1% being spoken in a noisy environment it's a good idea to say "one basis point" or "ten basis points" instead of "point zero one" or "zero point zero one" or "point zero zero one." It's the same reason we say "Foxtrot Universe Bravo Alpha Romeo" instead of "eff u bee a arr" in audio communication.
NOTE for probabilists appalled at my use of $P$ in $f_{P}(x)$ instead of more traditional nomenclature $f_{X}(x)$ where the uppercase $X$ would mean the variable and the lowercase $x$ the value: most people get confused when they see something like $p=\Pr(x=X)$.
The idea for this post came from a speech by Nassim Nicholas Taleb at Penn. Though the video is a bit rambling, it contains several important points. One that is particularly interesting to me is the difficulty of estimating the probability of rare events.
For illustration, let's consider a Normally distributed random variable $P$, and see what happens when small model errors are introduced. In particular we want to how the probability density $f_{P}(\cdot)$ predicted by four different models changes as a function of distance to zero, $x$. The higher the $x$ the more infrequently the event $P = x$ happens.
The densities are computed in the following table (click for larger):

The first column gives $f_{P}(x)$ for $P \sim \mathcal{N}(0,1)$, the base case. The next column is similar except that there's a 0.1% increase in the variance (10 basis points*). The third column is the ratio of these densities. (These are not probabilities, since $P$ is a continuous variable.)
Two observations jump at us:
1. Near the mean, where most events happen, it's very difficult to separate the two cases: the ratio of the densities up to two standard deviations ($x=2$) is very close to 1.
2. Away from the mean, where events are infrequent (but potentially with high impact), the small error of 10 basis points is multiplied: at highly infrequent events ($x>7$) the density is off by over 500 basis points.
So: it's very difficult to tell the models apart with most data, but they make very different predictions for uncommon events. If these events are important when they happen, say a stock market crash, this means trouble.
Moving on, the fourth column uses $P \sim \mathcal{N}(0.001,1)$, the same 10 basis points error, but in the mean rather than the variance. Column five is the ratio of these densities to the base case.
Comparing column five with column three we see that similarly sized errors in mean estimation have less impact than errors in variance estimation. Unfortunately variance is harder to estimate accurately than the mean (it uses the mean estimate as an input, for one), so this only tells us that problems are likely to happen where they are more damaging to model predictive abilities.
Column six shows the effect of a larger variance (100 basis points off the standard, instead of 10); column seven shows the ratio of this density to the base case.
With an error of 1% in the estimate of the variance it's still hard to separate the models within two standard deviations (for a Normal distribution about 95% of all events fall within two standard deviations of the mean), but the error in density estimates at $x=7$ is 62%.
Small probability events are very hard to predict because most of the times all the information available is not enough to choose between models that have very close parameters but these models predict very different things for infrequent cases.
Told you it was a downer.
-- -- --
* Some time ago I read a criticism of this nomenclature by someone who couldn't see its purpose. The purpose is good communication design: when there's a lot of 0.01% and 0.1% being spoken in a noisy environment it's a good idea to say "one basis point" or "ten basis points" instead of "point zero one" or "zero point zero one" or "point zero zero one." It's the same reason we say "Foxtrot Universe Bravo Alpha Romeo" instead of "eff u bee a arr" in audio communication.
NOTE for probabilists appalled at my use of $P$ in $f_{P}(x)$ instead of more traditional nomenclature $f_{X}(x)$ where the uppercase $X$ would mean the variable and the lowercase $x$ the value: most people get confused when they see something like $p=\Pr(x=X)$.
Subscribe to:
Posts (Atom)























