Wednesday, August 28, 2019

Fun with numbers - August 28, 2019

Air conditioning FTW!


Part of what I do is training people, under the fancy name of "executive education," and to match the fancy name we tend to get nice AV equipment, color handouts, markers that work, fancy chairs, and climate-controlled rooms.

The least remarkable of those is also probably the most important. The air conditioning, not the markers.

For a large-ish event with around 60 people in a comfortably large room, with about 50% of heat losses to the exterior, the temperature would increase by almost 15 °C in a 90-minute session:


Okay, there's a lot of approximations in that calculation, but even 10 °C increase means that either it was too cold at the beginning or too warm at the end. So hurrah for air conditioning.


Seriously, how can people fall for this?


Today The Tesla Promotion Network Electrek, posted a news item about batteries. Apparently "[a] startup that spun out of Cambridge University claims a battery breakthrough that can charge an electric car in just six minutes."

That phrasing is unclear: how much charge? Even the slowest chargers in my neighborhood (6 kW) will give any electric car battery some charge in 6 minutes (0.6 kWh). Most people will read that to mean they could give a full charge to, say, a Tesla Model 3 in six minutes.

Say the Tesla has a 80 kWh battery; charging it in 6 minutes requires an average power of 800 kW; even using 480 V, that would mean the current would be almost 1700 A. That would be a really interesting current to see in a lithium-ion battery, or as the fire department calls it, "the initiating event of the fire."

For comparison, a typical gas pump can pump 3 l/s of gasoline, which at 34 MJ/l means that the gas pump can transfer energy at a rate of over 100 MW or around 125 times the rate of that fictional battery.



Not math: some infrastructure




Friday, August 9, 2019

Big Data without Serious Theory has a Big Problem

There's a common procedure to look for relationships in data, one that leads to problems when applied to "big data."

Let's say we want to check whether two variables are related: if changes in the value of one can be used to predict changes in the value of the other. There's a procedure for that:

Take the two variables, compute a metric called a correlation, check whether that correlation is above a threshold from a table. If the value is above the threshold, then say they're related, publish a paper, and have its results mangled by your institution's PR department and misrepresented by mass media. This leads to random people on social media ascribing the most nefarious of motivations to you, your team, your institution, and the international Communist conspiracy to sap and impurify all of our precious bodily fluids.

The threshold used depends on a number of things, the most visible of which is called the significance level. It's common in many human-related applications (social sciences, medicine, market research) to choose a 95% significance level.

At the 95% level, if we have 2 variables with significant correlation, the probability that that correlation is spurious, in other words that it comes from uncorrelated variables, is 5%.

(More precisely, that 95% means that if two uncorrelated variables were subjected to the computation that yields the correlation, the probability that the result would be above the threshold is 5%. But that's close enough, in the case of simple correlation, to saying that the probability of the correlation being spurious is 5%.)

The problem is when we have more variables.

If we have 3 variables, there are 3 possible pairs, so the probability of a spurious correlation is $1-0.95^3 = 0.143$.

If we have 4 variables, there are 6 possible pairs, so the probability of a spurious correlation is $1-0.95^6 = 0.265$.

If we have 5 variables, there are 10 possible pairs, so the probability of a spurious correlation is $1-0.95^{10} = 0.401$.

Let's pause for a moment and note that we just computed a 40% probability of a spurious correlation between 5 independent (non-correlated) variables. Five variables isn't exactly the giant datasets that go by the moniker "big data."

What about better significance? 99%, 99.5%? A little better, for small numbers, but even at 99.5%, all it takes is a set with 15 variables and we're back to a 40% probability of a spurious correlation. And these are not Big Data numbers, not by a long shot.


But it's okay, one would think, for there's a procedure in the statistics toolbox that has been developed specifically for avoiding over-confidence. It's called validation with a hold-out sample.

(That about 0.00% of all social science and medicine published results (though not business or market research, huzzah!) use that procedure is a minor quibble that we shall ignore. It wouldn't make a difference in large sets, anyway.)

The idea is simple: we hold some of the data out (hence the "hold-out" sample, also known as the validation sample) and compute our correlations on the remaining data (called the calibration sample). Say we find that variables $A$ and $B$ are correlated in the calibration sample. Then we take the validation sample and determine whether $A$ and $B$ are correlated there as well.

If they are, and the correlation is similar to that of the calibration sample, we're a little more confident in the result; after all, if each of these correlations is 95% significant, then the probability of both together being spurious is $0.05^2 = 0.0025$.

(Note that that "similar" has entire books written about it, but for now let's just say it has to be in the same direction, so if $A$ and $B$ have to be positively or negatively correlated in both, not positively in one and negatively in the other.)

Alas, as the number of variables increases, so does the number of possible spurious correlations. In fact it grows so fast, even very strict significance levels can lead to having, for example, ten spurious correlations.

And when you test each of those spurious correlations with a hold-out set, the probability of at least one appearing significant is not negligible. For example (explore the table to get an idea of how bad things get):


These numbers should scare us, as many results being presented as the great advantages of big data for understanding the many dimensions of the human experience are drawn from sets with thousands of variables. And as said above, almost none is ever validated with a hold-out sample.

They serve as good foundations for stories, but human brains are great machines for weaving narratives around correlations, spurious or not.

- - - - -

As an addendum, here are the expected number of spurious correlations and the probability that at least one of those correlations passes a validation test for some numbers of variables, as a function of the significance level. Just to drive the point home.



The point that Big Data without Serious Theory has a Big Problem.


Sunday, August 4, 2019

A = B, B = C, but A ≠ C. Depending on N, of course!

Statistics are weird. But it all makes sense.

Let's say we have 3 variables and 600 measurements, in other words 600 data points with three dimensions or a 600-row matrix with three columns; or this chart:



These are simulated data, drawn from three Normal distributions with variance 1 and means $\mu_A = 0, \mu_B = 0.25,$ and $\mu_C = 0.5$. The distributions are:


There's considerable overlap between these distributions, so any single point is insufficient to determine any relationships between $A$, $B$, and $C$.

Let's say we want to have "99.9 percent confidence" in our assertions. What does that "99.9 percent confidence" mean? The statistical meaning is that there's at most 0.1 percent chance that if the data were generated by "the null hypothesis" (which in our case is that any two distributions are the same) we'd see a test statistic above the critical value.

Test statistic?! Critical value?!

Yes. A test statistic is something we'll compute from the data (a 'statistic') in order to test it. In our case it'll be the difference of the empirical, or sample, means divided by the standard error of those empirical means.

If that number, a summary of the difference between the samples, scaled to deal with the randomness, is greater than some value --- the critical value --- we trust that it's too big to be the result of the random variation. At least we trust it to be too big 99.9 percent of the time.

Okay, that statistics refresher aside, what can we tell from a sample of 25 points? Here are the distributions of those means:



Note how the distributions of the means are much narrower than the distributions of the data. (The means don't change.) That's the effect of averaging over 25 data points. The variance of the mean is $1/25$ and the standard deviation of the mean, called the standard error to avoid confusion with the standard deviation of the data, is $1/\sqrt{25}$.

Someone with a vague memory of a long-forgotten statistics class may recall seeing a $\sigma/\sqrt{n-1}$ in this context and try to argue that 25 should be a 24. And they'd be right if we were estimating the standard error of the population data from the standard error of the sample data; but we're not. Our data is simulated, therefore we know the standard error and we're using that to simplify things like this. Another one of which is the next one: the critical value.

Knowing the distributions lets us bypass one of the most overused jokes in statistics, the Student T ("how do you make it? Boil water, steep the student in it for 3 minutes"; student tea, get it?). More seriously, when the standard error is estimated from the sample data, the critical value is derived from a Student's T distribution; in our case we'll pick one derived from the Normal distribution, which has the advantage of not depending on the size of the sample (or, as it's called in statistics, the number of degrees of freedom in the estimation).

Now for the critical value. We're going to choose a single-sided test, so when we say $A \neq B$, we're really testing for $A < B$.

So how do we test whether some estimate of $(\mu_B - \mu_A)/(\sigma/\sqrt{n})$ is statistically greater than zero? We test the difference in empirical means, $M_A - M_B$, instead of $\mu_B - \mu_A$; since $M_A$ and $M_B$ are averages of Normal variables their difference is a Normal variable; and dividing the difference by the standard error makes it a random variable that is normally distributed with mean zero and variance one, a standard Normal variable, usually denoted by $Z$, hence this sometimes is called a z-test.

Observation: we're dividing the difference of the means by $\sigma/\sqrt{n}$; with $\sigma=1$, we're multiplying that difference by $\sqrt{n}$.

All we need to do now is determine whether $(M_B - M_A)\, \sqrt{25}$ is above or below the point $z^{*}$ in a standard Normal distribution where $F_{Z}(z^{*}) = .999$. That point is the critical value.


(If that scaled difference falls into the blue shaded area, then we can't reject the possibility that it was generated by randomness, instead of actual difference, with the probability that we selected; in the diagram it's 0.99, for our purposes in this post will be 0.999.)

Thanks to the miracle of computers we no longer need to look up critical values in books of statistical tables, like the peasants of old. Using, for example, the inverse normal distribution function of Apple Numbers, we learn that $F_Z(z^*) = 0.999$ implies $z^{*} = 3.09$.

So, with a sample of 25 data points, what can we conclude?

Between $A$ and $B$ our test statistic is $0.25 \times 5 = 1.25$, well below 3.09. So, $A = B$.
Between $B$ and $C$ our test statistic is $0.25 \times 5 = 1.25$, well below 3.09. So, $B = C$.
Between $A$ and $C$ our test statistic is $0.5 \times 5 = 2.5$, still below 3.09. So, $A = C$.

That was a lot of work to prove what we could see from the picture with the distributions for the average of 25 points: too much overlap and no way to tell the three variables apart from a sample of 25 points.

Ah, but we're leveling up! 100 points. Here are the distributions for the averages of 100-point samples:


Between $A$ and $B$ our test statistic is $0.25 \times 10 = 2.5$, well below 3.09. So, $A = B$.
Between $B$ and $C$ our test statistic is $0.25 \times 10 = 2.5$, well below 3.09. So, $B = C$.
Between $A$ and $C$ our test statistic is $0.5 \times 10 = 5$,  well above 3.09. So, $A \neq C$.

Now this is the weird case that gets people confused: $A = B$, $B = C$, but $A \neq C$! Equality is no longer transitive. And it does depend on $N$.

But wait, there's more. 300 more data points, and we get the 400 point case, with the following distributions:


Between $A$ and $B$ our test statistic is $0.25 \times 20 = 5$, well above 3.09. So, $A \neq B$.
Between $B$ and $C$ our test statistic is $0.25 \times 20 = 5$, well above 3.09. So, $B \neq C$.
Between $A$ and $C$ our test statistic is $0.5 \times 20 = 10$, well above 3.09. So, $A \neq C$.

Equality is again transitive. So there's only a small range of $N$ for which statistics are weird. (Not hard to figure out that range: consider it an entertaining puzzle.) This gets more complicated if those variables have different variances.

One has to be carefull with drawing inferences about equality (real equality) from statistical non-significant differences. Especially when there are small data sets and test values close to the critical values.