Showing posts with label Probability. Show all posts
Showing posts with label Probability. Show all posts

Friday, November 15, 2019

Fun with numbers for November 15, 2019

How many test rigs for a successful product at scale?


From the last Fun with Numbers:


This is a general comment on how new technologies are presented in the media: usually something that is either a laboratory test rig or at best a proof-of-concept technology demonstration is hailed as a revolutionary product ready to take the world and be deployed at scale.

Consider how many is "a lot of," as a function of success probabilities at each stage:


Yep, notwithstanding all good intentions in the world, there's a lot of work to be done behind the scenes before a test rig becomes a product at scale, and many of the candidates are eliminated along the way.



Recreational math: statistics of the maximum draw of N random variables


At the end of a day of mathematical coding, and since Rstudio was already open (it almost always is), I decided to check whether running 1000 iterations versus 10000 iterations of simulated maxima (drawing N samples from a standard distribution and computing the maximum, repeated either 1000 times or 10000 times) makes a difference. (Yes, an elaboration on the third part of this blog post.)

Turns out, not a lot of difference:


Workflow: BBEdit (IMNSHO the best editor for coding) --> RStudio --> Numbers (for pretty tables) --> Keynote (for layout); yes, I'm sure there's an R package that does layouts, but this workflow is WYSIWYG.

The R code is basically two nested for-loops, the built-in functions max and rnorm doing all the heavy lifting.

Added later: since I already had the program parameterized, I decided to run a 100,000 iteration simulation to see what happens. Turns out, almost nothing worth noting:


Adding a couple of extra lines of code, we can iterate over the number of iterations, so for now here's a summary of the preliminary results (to be continued later, possibly):


And a couple of even longer simulations (all for the maximum of 10,000 draws):


Just for fun, the probability (theoretical) of the maximum for a variety of $N$ (powers of ten in this example) is greater than some given $x$ is:




More fun with Solar Roadways


Via EEVblog on twitter, the gift that keeps on giving:


This Solar Roadways installation is in Sandpoint, ID (48°N). Solar Roadways claims its panels can be used to clear the roads by melting the snow… so let's do a little recreational numerical thermodynamics, like one does.

Average solar radiation level for Idaho in November: 3.48 kWh per m$^2$ per day or 145 W/m$^2$ average power. (This is solar radiation, not electrical output. But we'll assume that Solar Roadways has perfectly efficient solar panels, for now.)

Density of fallen snow (lowest estimate, much lower than fresh powder): 50 kg/m$^3$ via the University of British Columbia.

Energy needed to melt 1 cm of snowfall (per m$^2$): 50 [kg/m^3] $\times$ 0.01 [m/cm] $\times$ 334 [kJ/kg] (enthalpy of fusion for water) = 167 kJ/m$^2$ ignoring the energy necessary to raise the temperature, as it's usually much lower than the enthalpy of fusion (at 1 atmosphere and 0°C, the enthalpy of fusion of water is equal to the energy needed to raise the temperature of the resulting liquid water to approximately 80°C).

So, with perfect solar panels and perfect heating elements, in fact with no energy loss anywhere whatsoever, Solar Roadways could deal with a snowfall of 3.1 cm per hour (= 145 $\times$ 3600 / 167,000) as long as the panel and surroundings (and snow) were at 0°C.

Just multiply that 3.1 cm/hr by the efficiency coefficient to get more realistic estimates. Remember that the snow, the panels, and the surroundings have to be at 0°C for these numbers to work. Colder doesn't just make it harder; small changes can make it impossible (because the energy doesn't go into the snow, goes into the surrounding area).



Another week, another Rotten Tomatoes vignette


This time for the movie Midway (the 2019 movie, not the 1972 classic Midway):


Critics and audience are 411,408,053,038,500,000 (411 quadrillion) times more likely to use opposite criteria than same criteria.

Recap of model: each individual has a probability $\theta_i$ of liking the movie/show; we simplify by having only two possible cases, critics and audience using the same $\theta_0$ or critics using a $\theta_1$ and audience using a $\theta_A = 1-\theta_1$. We estimate both cases using the four numbers above (percentages and number of critics and audience members), then compute a likelihood ratio of the probability of those ratings under $\theta_0$ and $\theta_1$. That's where the 411 quadrillion times comes from: the probability of a model using $\theta_1$ generating those four numbers is 411 quadrillion times the probability of a model using $\theta_0$ generating those four numbers. (Numerical note: for accuracy, the computations are made in log-space.)



Google gets fined and YouTubers get new rules


Via EEVBlog's EEVblab #67, we learn that due to non-compliance with COPPA, YouTube got fined 170 million dollars and had to change some rules for content (having to do with children-targeted videos):


Backgrounder from The Verge here; or directly from the FTC: "Google and YouTube Will Pay Record $170 Million for Alleged Violations of Children’s Privacy Law." (Yes, technically it's Alphabet now, but like Boaty McBoatface, the name everyone knows is Google. Even the FTC uses it.)

According to Statista: "In the most recently reported fiscal year, Google's revenue amounted to 136.22 billion US dollars. Google's revenue is largely made up by advertising revenue, which amounted to 116 billion US dollars in 2018."

170 MM / 136,220 MM =  0.125 %

2018 had 31,536,000 seconds, so that 170 MM corresponds to 10 hours, 57 minutes of revenue for Google. 

Here's a handy visualization:






Engineering, the key to success in sporting activities


Bowling 2.0 (some might call it cheating, I call it winning via superior technology) via Mark Rober:


I'd like a tool wall like his but it doesn't go with minimalism.



No numbers: recommendation success but product design fail.



Nerdy, pro-engineering products are a good choice for Amazon to recommend to me, but unfortunately many of them suffer from a visual form of "The Igon Value Problem."

Friday, November 1, 2019

Fun with numbers for November 1, 2019

Fast-charging batteries


From the web site that hangs off of the brand equity of the very prestigious journal Science: "New charging technique could power an electric car battery in 10 minutes

Congratulations to the team improving battery technology. But:

I. According to the news, this is a technology demonstration, though that might be inaccurate (the original report makes it a testing rig, which is one step farther back from a final product). There's a lot of work to do (and many avenues for failure) before this becomes a deployable product, much less at scale.


II. Charging a 75 kWh battery (AFAIK, the smallest battery in a Tesla car) in 10 minutes requires a charging power of 450 kW. Even using 480 V as the charging voltage, that's still a 937.5 A current; those cables will need some serious heft, and any impurities in the contacts will be a serious fire hazard.

III. A typical gas pump moves about 3 l of gasoline per second. Gasoline has around 34 MJ/l energy density, so that pump has a power rating of 102 MW, 227 times higher energy throughput than the new battery. Even if the distance/energy efficiency of internal combustion engines is lower than electric motors, that's a big difference. Also, you can buy Reese's peanut butter cups at gas stations.



More fun with Rotten Tomatoes



Watchmen (HBO series) shows that sometimes when data changes, the conclusions change.


Despite the caterwauling of many in the comic-book nerd community (not that I would know, as I don't belong… okay, I occasionally might take a look, but I'm not a comic book nerd… not since the early 70s…), data show that it's much more likely that the critics and the audience are using similar criteria for their evaluation of Joker than opposite criteria.

How much more likely? Glad you asked:

210,565,169,600,721,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 times more likely.

Ah, the power of parameterized models: you set them once, you can nerd out on them till the end of time. (I haven't watched either the show or the movie. Maybe when they get to Netflix or Amazon Prime.)


Added Nov 3: Haven't watched it yet, but Rotten Tomatoes data shows that critics are 1,361,188 times more likely to be using the same criteria as the audience than opposite criteria to evaluate "For All Mankind."



Some progress in nuclear fusion?



Some simple physics:
1 kg mass = 9E16 J of energy ($E = mc^2$)
Coal has 30 MJ/kg specific energy
10E6 kg coal have 3E14 J (assuming Bloomberg meant using combustion)
Fusion is to have 1/300 efficiency relative to pure mass-energy conversion?

Kudos. Now, get to it!



Shredded Sports Science eats an apple


Shredded Sports Science has a video making fun of people who know even less about fitness and nutrition than the "experts" in those "sciences," where he takes a bite of an apple and says "one rep," another bite, "two reps," the joke being on Chris Heria of Thenx.


Huh, the quant says, I wonder how the numbers will go…

Let's say a warm-up set of 100 kg squats and the total vertical path is 1 m. How much energy does one rep use, just for the mechanical work?

Naïve physics neophyte: huh, zero, the rep starts and ends at the same point.

No. The mechanics of the rep are different on the way down and on the way up: assuming that the weight moves at constant speed most of the time, the down movement requires the body provide work to counteract acceleration, so we can approximate the total work by 2 * 100 * 9.8 * 1 = 1960 J.

Note that this is just the mechanical part. Muscles have less than 100% efficiency and that efficiency changes as fatigue increases, hence the heat (heat, and to a smaller degree, changes to the mix of waste products of muscle contraction, represent losses in efficiency).

The other side of the coin is the chemical energy in that apple, which is measured by the magic ['delusion' or 'deception' also work here] of mistaking the simple process of combustion for the very complex processes of digestion and respiration. But let's pretend…

Apples are basically 1/3 sugar and 2/3 water, with some esters and ester aldehydes for taste and aroma, so for a small bite let's say 15g of apple we get 5 g of sugar; that's 20 kCal or ~ 84,000 J.

Shredded Sport Science's little joke would point to a combined digestion, respiration, and muscle contraction efficiency of 2.33%.

Evolution would have selected this biochemical parameterization right out of the gene pool.



Fun with energy



Talk about counting calories in a way that matters. (From the BP energy stats 2019; and yes, their tables are in MtOE, not calories, but unit changes are trivial, except maybe for gymbros.)



Bay Area versus Europe


With the return of Silicon Valley on HBO, there's a lot of hating on the Bay Area going around, so here's a thought in numbers…



Wednesday, October 9, 2019

Fun with numbers for October 9, 2019

Rotten Tomatoes and Batwoman


The day after the pilot, a familiar pattern emerges:


Using the same math as these two previous posts, it's 198,134,550 (almost 200 million) times more likely that the critics are using the opposite criteria to those of the audience than they both using the same criteria.

A couple of days later, more data is available:


This data makes the case even more stark: it's now 2,924,953,580,108 (almost three trillion!) times more likely that the critics are using the opposite criteria to those of the audience than they both using the same criteria.

And today (with a tip of the Homburg to local vlogging nerd Nerdrotics), it's even worse:


With this data (ah, the joys of reusable models, even if "model" is a bit of a stretch for something so simple, relatively speaking) we get that it's now 11,028,450,795,963,200 (eleven quadrillion!) times more likely that the critics are using the opposite criteria to those of the audience than they both using the same criteria.

For what it's worth, I liked the pilot, despite my nit-picking it on twitter:




Aerobics: The Paper that Started The Craze.



Here are three explanations that all match the data:

I. The official story: running develops cardiovascular endurance. This is the story that led to the aerobics explosion, to jogging, and to all sorts of "cardio" nonsense. Note that this story is isomorphic to "playing basketball makes people taller."

II. The selection effect story: people with good cardiovascular systems can run faster than those without. This is the "tall people are better at basketball than short people" version of the story.

III. The athletes are better at both story: people who have athletic builds (strong muscles, large thoracic capacity, low body fat) are better at both running and cardiovascular fitness because of that athleticism.

Most likely the result is a combination of these three effects, or in expensive words, the three variables (cardiovascular fitness, muscular development, and running ability) are jointly endogenous. Note also the big excerpt from Body By Science at the end of this post.

Let's take  a closer look at that table:


Not that I'm questioning Cooper's data (okay, I am), but isn't it strange that there are no cases when, say a runner with a distance of 1.27 mi had VO2max of 33.6? That the discrete categories on one side map into non-overlapping categories on the other? No boundary errors? That's an unlikely scenario.

Also, no data about the distribution of the 115 research subjects over the five categories. That would be interesting to know, since the bins for the distance categories are clearly selected at fixed distance intervals, not as representatives of the distribution of subjects. (It would be extremely suspicious if the same number of subjects happened to fall into each category. But if they don't, that's informative and important to the interpretation of the data.)

I know this was the 60s; on the other hand, the 60s were the first real golden age of large-scale data processing (with those "computer" things) and a market research explosion.

One of the factors that confounds these "cardio" results is that training for a specific test makes you better at that test. Another is that strengthening the muscles that are used in a specific motion makes that motion less demanding and therefore puts less strain on the cardiovascular system.

This excerpt from Body By Science illustrates both of these confounds:




Grant Sanderson (3 Blue 1 Brown) on prime number spirals





A late addition: Elon Musk promises PowerPacks for CA



Which brings up two thoughts:

a. Is "just waiting on permits" the new "funding secured"?
 
b. Each powerpack has 210 kWh capacity, so one charges ~3 Teslas, assuming they're low on charge but not zero. (Typical tank truck ~ 11,000 gal tops-up 733 x 15 gal gas tanks. Just FYI)

Wednesday, October 2, 2019

When nutrition and fitness studies attempt science with naive statistics

A little statistics knowledge is a dangerous thing.

(Inspired by an argument on Twitter about a paper on intermittent fasting, which exposes the problem of blind trust in "studies" when such studies are done to lower statistical standards than market research since at least the 70s.)

Given an hypothesis, say "people using intermittent fasting lose weight faster than controls even when calories are equated," any market researcher worth their bonus and company Maserati would design a within-subjects experiment. (For what it's worth, here's a doctor suggesting within-subject experiments on muscle development.)

Alas, market researchers aren't doing fitness and nutrition studies, mostly because market researchers like money and marketing is where the market research money is (also, politics, which is basically marketing).

So, these fitness and nutrition studies tend to be between-subjects: take a bunch of people, assign them to control and treatment groups, track some variables, do some first-year undergraduate statistics, publish paper, get into fights on Twitter.

What's wrong with that?

People's responses to treatments aren't all the same, so the variance of those responses, alone, can make effects that exist at the individual level disappear when aggregated by naive statistics.

Huh?

If everyone loses weight faster on intermittent fasting, but some people just lose it a little bit faster and some people lose it a lot faster, that difference in response (to fasting) will end up making the statistics look like there's no effect. And what's worse, the bigger the differences between different people in the treatment group, the more likely the result is to be non-significant.

Warning: minor math ahead.

Let's say there are two conditions, control and treatment, $C$ and $T$. For simplicity there are two segments of the population: those who have a strong response $S$ and those who have a weak response $W$ to the treatment. Let the fraction of $W$ be represented by $w \in [0,1]$.

Our effect is measured by a random variable $x$, which is a function of the type and the condition. We start with the simplest case, no effect for anyone in the control condition:

$x_i(S,C) = x_i(W,C) = 0$.

By doing this our statistical test becomes a simple t-test of the treatment condition and we can safely ignore the control subsample.

For the treatment conditions, we'll consider that the $W$ part of the population has a baseline effect normalized to 1,

$x_i(W,T) = 1$.

Yes, no randomness. We're building the most favorable case to detect the effect and will show that population heterogeneity alone can hide that effect.

We'll consider that the $S$ part of the population has an effect size that is a multiple of the baseline, $M$,

$x_i(S,T) = M$.

Note that with any number of test subjects, if the populations were tested separately the effect would be significant, as there's no error. We could add some random factors, but that would only complicate the point, which is that even in the most favorable case (no error, both populations show a positive effect), the heterogeneity in the population hides the effect.

(If you slept through your probability course in college, skip to the picture.)

If our experiment has $N$ subjects in the treatment condition, the expected effect size is

$\bar x = w + (1-w) M$

with a standard error (the standard deviation of the sample mean) of

$\sigma_{\bar x} =  (M-1) \,\sqrt{\frac{w(1-w)}{N}} $.

(Note that because we actually know the mean, this being a probabilistic model rather than a statistical estimation, we see $N$ where most people would expect $N-1$.)

So, the test statistic is

$t = \bar x/\sigma_{\bar x} = \frac{w + (1-w) M}{(M-1) \,\sqrt{\frac{w(1-w)}{N}}}$.

It may look complicated, but it's basically a three parameter analytical function, so we can easily see what happens to significance with different $w,M,N$, which is our objective.

Because we're using a probabilistic model where all quantities are known, the test statistic is distributed Normal(0,1), so the critical value for, say, 0.95 confidence, single-sided, is given by $\Phi^{-1}(0.95) = 1.645$.

To start simply, let's fix $N= 20$ (say a convenience sample of undergraduates, assuming a class size of 40 and half of them in the control group). Now we can plot $t$ as a function of $M$ and $w$:


(The seemingly-high magnitudes of $M$ and $w$ are an artifact of not having any randomness in the model. We wanted this to be simple, so that's the trade-off.)

Recall that in our model both sub-populations respond to the treatment and there's no randomness in that response. And yet, for a small enough fraction of the $S$ population and a large enough multiplier effect $M$, our super-simple, extremely-favorable model shows non-significant effects using a single-sided test (the most favorable test, and we're using the lowest acceptable significance for most journals, 95%, also most favorable choice).

Let's be clear what that "non-significant effects" means: it means that a naive statistician would look at the results and say that the treatment shows no difference from the control, in the words of our example, that people using intermittent fasting don't lose weight faster than the controls.

This, even though everyone in our model loses weight faster when intermittent fasting.

Worse, the results are less and less significant the stronger the effect on the $S$ population relative to the $W$ population. In other words, the faster the weight loss of the highly-responsive subpopulation relative to the less-responsive subpopulation, when both are losing weight with intermittent fasting, the more the naive statistics shows intermittent fasting to be ineffectual at producing weight loss.

Market researchers have known about this problem for a very long time. Nutrition and fitness practices (can't bring myself to call them sciences) are now repeating errors from the 50s-60s.

That's not groovy!

Monday, September 9, 2019

Fun with numbers for Sep 9, 2019


So, this might become a thing, blogging augmented tweets.

Rotten Tomatoes is at it again



Dave Chappelle apparently has a Netflix special that has critics and audience at loggerheads. Being a little more quantitative, we can say that the critics are 862,712 times more likely to be using criteria opposite to those of the audience than the same criteria. The logic is in this post.

There are two differences between that blog post and this calculation that are worth mentioning:

1. Computing $c(12352,124)$ without loading special numerical packages that can handle large numbers is beyond the capabilities of most mathematical software, so we use a trick: as we're only interested in a likelihood ratio, and those combinations appear in the numerator and denominator, we know that in the end they'll cancel out, so we ignore them altogether.

2. Small probabilities raised to a large exponent quickly get to the precision limits of the floating point representations; to deal with that we make our calculations in log-space. So instead of computing $0.01^{124}$, which would be well below the 1E-99 ($10^{-99}$) limit for most numerical software, and be treated as zero, we compute $124 \times \log(0.01)$, do all the operations in this log space and at the end we exponentiate the result.

Used Apple Numbers (in lieu of RStudio) for this one, was surprised to learn that LOG is $\log_{10}(\cdot)$ despite Numbers also having a LOG10 function. Oh, well, no problem as long as one is careful:




Tesla bull tries to praise superchargers, arithmetic and hilarity ensue


Sooooo I tried the 250kW charger for the first time and I think I'm in loooooove 🥰 -- Went from 19% to 60% I kid you not in just 5 mins. Thanks @elonmusk @Tesla 🙏

Your battery capacity is 51 kWh?!
5 minutes = 300 s
250 kW * 300 s = 75 MJ
75 MJ = (60%-19%) * Capacity, or
Capacity = 183 MJ = 51 kWh
I thought TSLA batteries started at 75 kWh?! 🤔 Possible explanations:

1. Tesla bull is exaggerating; if it took 10 minutes, or the starting point was close to 40%, that would point to a 100 kWh battery.

2. Tesla software is lying to the car owner, making the numbers look rosier than they actually are.

3. Battery has lost capacity, which happens to batteries because of the underlying principles (two main chemical reactions, one exoelectric, one endoelectric; but secondary, parasitical reactions exist that lower the battery capacity over time). Even for a 75 kWh battery that would be a very big loss (1/3), unless his charging cycles are deep and irregular (that kills batteries faster).



Counting calories is like Enron accounting


Just thinking logically, here, if you were bailing water out of a boat, would you keep adding water in?

Okay, so if you're trying to lose weight by using body fat for energy, why would you eat carbs, whose sole nutrition value is as energy? Why eat when not hungry? *

(Most of the arguments I have about calories are with people who for some reason want others to eat carbs. Counting calories biases you towards choosing carbs over fat, since fat is more energy-dense.)

But more to the point, the whole foundation of calorie counting is Enron-like accounting, where some things are counted (more or less), some things are estimated, and many other things are sort-of, kind-of assumed away in some "basic metabolic energy needs" or other ways of saying "let's assume everyone has the same basic efficiency in chemical energy extraction and mechanical power production."


The lack of accounting for energy lost as heat, which anyone out of shape who's ever jogged with an athletic friend can tell you varies a lot with the person, is the most obvious Enron-like accounting.  Higher body effort for the same mechanical output is reflected in heat loss, and differences in that heat loss can be (as calculated in that figure) in the 100-200 kCal/hour range.

That's the same difference as the mechanical energy difference between jogging and walking.  One hour at the low end of that difference (remember, this is just heat, the mechanical energy is the same for both people) every other day is equivalent to 2 kg of fat extra per year if we believe in the basic model of calories-in calories-out.

Now,  how much difference can there be in unmeasured chemical energy output? Depends on the person and the diet, but note that on a 2500 kCal/day diet a systemic difference of 2%, that is 50 kCal/day, is equivalent to 2 kg of fat extra per year if we believe in the basic model of calories-in calories-out.

Can different people with the same general diet show a 2% difference? Yep. For example, a paper called "Energy content of stools in normal healthy controls and patients with cystic fibrosis," by Murphy, Wootton, Bond, and Jackson in Archives of Disease in Childhood (1991) [thanks PubMed], includes data about the controls' intake and stool. Here are the computations for the first 5 healthy controls:



Yeah, just like that, if CICO were true, these people, on the exact same diet, would show a 3 kg per year weight gain difference. 30 kg per decade.

So, whenever people start talking about calories, be aware that they might be looking for a way to say "you're overweight because of your moral failings; if only you were as virtuous as I am!"

- - - - -
* 1. Carbs are delicious, even addictive. Just be aware of the trade-off: they slow down body fat loss and make you hungrier faster. Because controlling appetite is key to fat loss, that second part is much more damaging than the first. Any "diet" that requires constant attention and self-control is going to fail for normal people with normal lives in normal society: just look around you.

2. There's a situation when I'll eat even though I'm not hungry: if I know I'll become hungry later when no high-protein food will be available and the hunger will be inconvenient or require iron will to avoid eating institutional carbs-and-fat food. Usually this situation can be avoided by taking high-protein foods like Biltong (no, it's not jerky; yes, it's worth the price) or hard-boiled eggs with you, but there are situations when that's socially unacceptable.



Nerding out with science fiction




The book is Dream of the Iron Dragon by Robert Kroese. Highly recommended science fiction.

At $c/3$, each kg of mass in the ship has kinetic energy of 607 TJ, the equivalent of a large tactical nuclear weapon (145 kiloton TNT), or about nine times the Hiroshima explosion. The relativistic increase in mass in small (around 6%, of course), but that velocity-squared, that's the big deal. (At these speeds we have to use the relativistic formula for KE, the one with $mc^2$ in the numerator.)

A table of temporal dilation (it's a highly non-linear transformation):


At 99.95% of the speed of the light, one hour of ship time would be 31 hours, 37 minutes, and 48 seconds in the resting frame. At that speed, each kilogram of mass in the ship would have 2.75 exajoule of kinetic energy or, in big boom terms, about 13 times the energy of the largest hydrogen bomb explosion (the Tsar Bomba at 210 PJ or 50 MtTNT).

Another excerpt of the same book, non-numeric, but very dear to anyone who's ever worked in a large bureaucratic organization:


#NerdWhoMe

Friday, August 9, 2019

Big Data without Serious Theory has a Big Problem

There's a common procedure to look for relationships in data, one that leads to problems when applied to "big data."

Let's say we want to check whether two variables are related: if changes in the value of one can be used to predict changes in the value of the other. There's a procedure for that:

Take the two variables, compute a metric called a correlation, check whether that correlation is above a threshold from a table. If the value is above the threshold, then say they're related, publish a paper, and have its results mangled by your institution's PR department and misrepresented by mass media. This leads to random people on social media ascribing the most nefarious of motivations to you, your team, your institution, and the international Communist conspiracy to sap and impurify all of our precious bodily fluids.

The threshold used depends on a number of things, the most visible of which is called the significance level. It's common in many human-related applications (social sciences, medicine, market research) to choose a 95% significance level.

At the 95% level, if we have 2 variables with significant correlation, the probability that that correlation is spurious, in other words that it comes from uncorrelated variables, is 5%.

(More precisely, that 95% means that if two uncorrelated variables were subjected to the computation that yields the correlation, the probability that the result would be above the threshold is 5%. But that's close enough, in the case of simple correlation, to saying that the probability of the correlation being spurious is 5%.)

The problem is when we have more variables.

If we have 3 variables, there are 3 possible pairs, so the probability of a spurious correlation is $1-0.95^3 = 0.143$.

If we have 4 variables, there are 6 possible pairs, so the probability of a spurious correlation is $1-0.95^6 = 0.265$.

If we have 5 variables, there are 10 possible pairs, so the probability of a spurious correlation is $1-0.95^{10} = 0.401$.

Let's pause for a moment and note that we just computed a 40% probability of a spurious correlation between 5 independent (non-correlated) variables. Five variables isn't exactly the giant datasets that go by the moniker "big data."

What about better significance? 99%, 99.5%? A little better, for small numbers, but even at 99.5%, all it takes is a set with 15 variables and we're back to a 40% probability of a spurious correlation. And these are not Big Data numbers, not by a long shot.


But it's okay, one would think, for there's a procedure in the statistics toolbox that has been developed specifically for avoiding over-confidence. It's called validation with a hold-out sample.

(That about 0.00% of all social science and medicine published results (though not business or market research, huzzah!) use that procedure is a minor quibble that we shall ignore. It wouldn't make a difference in large sets, anyway.)

The idea is simple: we hold some of the data out (hence the "hold-out" sample, also known as the validation sample) and compute our correlations on the remaining data (called the calibration sample). Say we find that variables $A$ and $B$ are correlated in the calibration sample. Then we take the validation sample and determine whether $A$ and $B$ are correlated there as well.

If they are, and the correlation is similar to that of the calibration sample, we're a little more confident in the result; after all, if each of these correlations is 95% significant, then the probability of both together being spurious is $0.05^2 = 0.0025$.

(Note that that "similar" has entire books written about it, but for now let's just say it has to be in the same direction, so if $A$ and $B$ have to be positively or negatively correlated in both, not positively in one and negatively in the other.)

Alas, as the number of variables increases, so does the number of possible spurious correlations. In fact it grows so fast, even very strict significance levels can lead to having, for example, ten spurious correlations.

And when you test each of those spurious correlations with a hold-out set, the probability of at least one appearing significant is not negligible. For example (explore the table to get an idea of how bad things get):


These numbers should scare us, as many results being presented as the great advantages of big data for understanding the many dimensions of the human experience are drawn from sets with thousands of variables. And as said above, almost none is ever validated with a hold-out sample.

They serve as good foundations for stories, but human brains are great machines for weaving narratives around correlations, spurious or not.

- - - - -

As an addendum, here are the expected number of spurious correlations and the probability that at least one of those correlations passes a validation test for some numbers of variables, as a function of the significance level. Just to drive the point home.



The point that Big Data without Serious Theory has a Big Problem.


Sunday, August 4, 2019

A = B, B = C, but A ≠ C. Depending on N, of course!

Statistics are weird. But it all makes sense.

Let's say we have 3 variables and 600 measurements, in other words 600 data points with three dimensions or a 600-row matrix with three columns; or this chart:



These are simulated data, drawn from three Normal distributions with variance 1 and means $\mu_A = 0, \mu_B = 0.25,$ and $\mu_C = 0.5$. The distributions are:


There's considerable overlap between these distributions, so any single point is insufficient to determine any relationships between $A$, $B$, and $C$.

Let's say we want to have "99.9 percent confidence" in our assertions. What does that "99.9 percent confidence" mean? The statistical meaning is that there's at most 0.1 percent chance that if the data were generated by "the null hypothesis" (which in our case is that any two distributions are the same) we'd see a test statistic above the critical value.

Test statistic?! Critical value?!

Yes. A test statistic is something we'll compute from the data (a 'statistic') in order to test it. In our case it'll be the difference of the empirical, or sample, means divided by the standard error of those empirical means.

If that number, a summary of the difference between the samples, scaled to deal with the randomness, is greater than some value --- the critical value --- we trust that it's too big to be the result of the random variation. At least we trust it to be too big 99.9 percent of the time.

Okay, that statistics refresher aside, what can we tell from a sample of 25 points? Here are the distributions of those means:



Note how the distributions of the means are much narrower than the distributions of the data. (The means don't change.) That's the effect of averaging over 25 data points. The variance of the mean is $1/25$ and the standard deviation of the mean, called the standard error to avoid confusion with the standard deviation of the data, is $1/\sqrt{25}$.

Someone with a vague memory of a long-forgotten statistics class may recall seeing a $\sigma/\sqrt{n-1}$ in this context and try to argue that 25 should be a 24. And they'd be right if we were estimating the standard error of the population data from the standard error of the sample data; but we're not. Our data is simulated, therefore we know the standard error and we're using that to simplify things like this. Another one of which is the next one: the critical value.

Knowing the distributions lets us bypass one of the most overused jokes in statistics, the Student T ("how do you make it? Boil water, steep the student in it for 3 minutes"; student tea, get it?). More seriously, when the standard error is estimated from the sample data, the critical value is derived from a Student's T distribution; in our case we'll pick one derived from the Normal distribution, which has the advantage of not depending on the size of the sample (or, as it's called in statistics, the number of degrees of freedom in the estimation).

Now for the critical value. We're going to choose a single-sided test, so when we say $A \neq B$, we're really testing for $A < B$.

So how do we test whether some estimate of $(\mu_B - \mu_A)/(\sigma/\sqrt{n})$ is statistically greater than zero? We test the difference in empirical means, $M_A - M_B$, instead of $\mu_B - \mu_A$; since $M_A$ and $M_B$ are averages of Normal variables their difference is a Normal variable; and dividing the difference by the standard error makes it a random variable that is normally distributed with mean zero and variance one, a standard Normal variable, usually denoted by $Z$, hence this sometimes is called a z-test.

Observation: we're dividing the difference of the means by $\sigma/\sqrt{n}$; with $\sigma=1$, we're multiplying that difference by $\sqrt{n}$.

All we need to do now is determine whether $(M_B - M_A)\, \sqrt{25}$ is above or below the point $z^{*}$ in a standard Normal distribution where $F_{Z}(z^{*}) = .999$. That point is the critical value.


(If that scaled difference falls into the blue shaded area, then we can't reject the possibility that it was generated by randomness, instead of actual difference, with the probability that we selected; in the diagram it's 0.99, for our purposes in this post will be 0.999.)

Thanks to the miracle of computers we no longer need to look up critical values in books of statistical tables, like the peasants of old. Using, for example, the inverse normal distribution function of Apple Numbers, we learn that $F_Z(z^*) = 0.999$ implies $z^{*} = 3.09$.

So, with a sample of 25 data points, what can we conclude?

Between $A$ and $B$ our test statistic is $0.25 \times 5 = 1.25$, well below 3.09. So, $A = B$.
Between $B$ and $C$ our test statistic is $0.25 \times 5 = 1.25$, well below 3.09. So, $B = C$.
Between $A$ and $C$ our test statistic is $0.5 \times 5 = 2.5$, still below 3.09. So, $A = C$.

That was a lot of work to prove what we could see from the picture with the distributions for the average of 25 points: too much overlap and no way to tell the three variables apart from a sample of 25 points.

Ah, but we're leveling up! 100 points. Here are the distributions for the averages of 100-point samples:


Between $A$ and $B$ our test statistic is $0.25 \times 10 = 2.5$, well below 3.09. So, $A = B$.
Between $B$ and $C$ our test statistic is $0.25 \times 10 = 2.5$, well below 3.09. So, $B = C$.
Between $A$ and $C$ our test statistic is $0.5 \times 10 = 5$,  well above 3.09. So, $A \neq C$.

Now this is the weird case that gets people confused: $A = B$, $B = C$, but $A \neq C$! Equality is no longer transitive. And it does depend on $N$.

But wait, there's more. 300 more data points, and we get the 400 point case, with the following distributions:


Between $A$ and $B$ our test statistic is $0.25 \times 20 = 5$, well above 3.09. So, $A \neq B$.
Between $B$ and $C$ our test statistic is $0.25 \times 20 = 5$, well above 3.09. So, $B \neq C$.
Between $A$ and $C$ our test statistic is $0.5 \times 20 = 10$, well above 3.09. So, $A \neq C$.

Equality is again transitive. So there's only a small range of $N$ for which statistics are weird. (Not hard to figure out that range: consider it an entertaining puzzle.) This gets more complicated if those variables have different variances.

One has to be carefull with drawing inferences about equality (real equality) from statistical non-significant differences. Especially when there are small data sets and test values close to the critical values.




Thursday, July 25, 2019

Yeah, about that exponential economy...

There's a lot of management and technology writing that refers to "exponential growth," but I think that most of it is a confusion between early life cycle convexity and true exponentials.

Here's a bunch of data points from what looks like exponential growth:


Looks nicely convex, and that red curve is an actual exponential fit to the data,
\[
y = 0.0057 \, \exp(0.0977 \, x)   \qquad  [R^2 = 0.971].
\]
Model explains 97.1% of variance. I mean, what more proof could one want? A board of directors filled with political apparatchiks? A book by [a ghostwriter for] a well-known management speaker? Fourteen years of negative earnings and a CEO that consumes recreational drugs during interviews?

Alas, those data points aren't proof of an exponential process, rather, they are the output of a logistic process with some minor stochastic disturbances thrown in:
\[
y = \frac{1}{1+\exp(-0.1 \, x+5)} + \epsilon_x \qquad \epsilon_x \sim \text{Normal}(0,0.005).
\]
The logistic process is a convenient way to capture growth behavior where there's a limited potential: early on, the limit isn't very important, so the growth appears to be exponential, but later on there's less and less opportunity for growth so the process converges to the potential. This can be seen by plotting the two together:


This difference is important because — and this has been a constant in the management and technology popular press — in the beginning of new industries, new segments in an industry, and new technologies, unit sales look like the data above: growth, growth, growth. So, the same people who declared the previous ten to twenty s-shaped curves "exponential economies" at their start come out of the woodwork once again to tell us how [insert technology name here] is going to revolutionize everything.

Ironically, knowledge is one of the few things that shows a rate of growth that's proportional to the size of the [knowledge] base. Which would make knowing stuff (like the difference between the convex part of an s-shaped curve and an exponential) a true exponential capability.

But that would require those who talk of "exponential economy" to understand what exponential means.

Friday, July 19, 2019

Fat tails and extremistan - not the same thing



Extremistan and mediocrestan


What, are we making up words, now? (All words are made up. Think about it.)

Extremistan and mediocrestan are characterizations of distributions; a simple way to think about them is that very large events either totally dominate (extremistan) or don't (mediocrestan):

Height is in mediocrestan: if the average height in a room with ten people is 200 cm, that's probably from ten people between 190 and 210 cm tall and not nine people 100 cm tall and one person 1100 cm tall.  
Wealth is in extremistan: if the average wealth in a room with 10 people is 2 billion dollars, that's more likely to be one billionaire with 20 billion and nine average income people than ten billionaires with 2 billion each.

This classification determines whether you can estimate relevant population parameters from samples (mediocrestan yes, extremistan no) and how well-behaved order statistics (maximum, second place, etc) are (mediocrestan nicely predictable, extremistan not so much).

There's a fairly common error that people make when they learn about extremistan: they think that because distributions in extremistan have fat tails and are dominated by extreme values, then — and this is the error — distributions that have fat tails, especially those with extreme values, are in extremistan.

Note the error: $a \Rightarrow b$ is being used to assert $b \Rightarrow a$.

As we'll see next, not all fat-tailed and extreme-valued distributions are in extremistan.


A tale of two tails


Let us compare (a) the probability that $n$ similar outcomes of large size $M$ add up to a combined event of size $nM$ (or, equivalently, average to $M$) with (b) the probability of an extreme event of size $nM$ and $n-1$ events of size 0 add up to that combined event $nM$. If the first is higher than the second, we're in mediocrestan, if the second is higher than the first, we're in extremistan.

For the Normal distribution, the probabilities (a) denoted $P(\text{Similar})$ and (b) denoted $P(\text{Extreme})$ are:
\begin{eqnarray*}
P(\text{Similar}) &=& \frac{1}{(2 \pi)^{n/2}} \exp(- n \, M^2/2) \\
P(\text{Extreme})  &=& \frac{1}{(2 \pi)^{n/2}} \exp( - n^2 \,  M^2/2)
\end{eqnarray*}
It's trivial to see that for the Normal we have
\[
P(\text{Similar}) > P(\text{Extreme}).
\]
Unsurprisingly enough, with its reference excess kurtosis of 0, the Normal distribution is well inside mediocrestan.

For our fat-tailed, extreme-valued distribution, we'll use the Gumbel distribution, which is also known as Extreme Value Type I. A simple form of this distribution has the following pdf:
\[
f_X(x) = \exp(- x - \exp(-x))
\]
As shown here, its variance is $\pi^2/6$, while the Normal above has variance 1, but since we're comparing within class (Normal with Normal and Gumbel with Gumbel), that makes no difference and saves a lot of unnecessary clutter if we just use that pdf as is.

For Gumbel we have the following probabilities:
\begin{eqnarray*}
P(\mathrm{Similar}) &=& \exp(-nM - n \, \exp(-M))
\\
P(\mathrm{Extreme}) &=&  \exp(-nM - \exp(-nM) -n+1)
\end{eqnarray*}
Since for large $M$ we have $\exp(-nM) \approx 0$ and $\exp(-M) \approx 0$, then $\exp(-nM) +n-1 > n \, \exp(-M)$, for Gumbel we also have
\[
P(\text{Similar}) > P(\text{Extreme}).
\]
The Gumbel distribution belongs in mediocrestan, despite its fat tails and extreme values.

Really makes us think about the specialness of scale independent distributions, where we can bet on a big event to overwhelm all the small events (i.e. an extremistan distribution). Those are the distributions for which a trading strategy of enduring many small losses to capture the one big win can beat a strategy of consistent small wins.


What about the maximum?


In many cases the maximum is more relevant than the mean or median. So, how do fat tails influence the maxima?

When you look at the maximum of something, say the fastest kid in a class, the larger the class, the higher the maximum will be, on average. So the fastest kid in a group of 100 is on average faster than the fastest kid in a group of 10, for example.

In mediocrestan this increase is concave on the number of kids (the difference between the fastest kids in classes of 100 and 200 kids is bigger than the difference between the fastest kids in classes of 1100 and 1200 kids, on average); in extremistan there are no guarantees.

But once again, fat tails and extreme value distributions (the Gumbel, here scaled to have variance 1) have well-behaved maxima:



This nice concavity (note the logarithmic horizontal scale) makes things predictable; since many real-world metrics are known to be fat-tailed, it's comforting to know that their maxima don't explode all of a sudden.

Note that there's an effect of the extreme value: the maxima are larger and they grow faster, with less concavity than for the Normal.


And the point is…?


There are a number of people who assert that all sorts of research and social metrics are unusable because their analysis is based on mediocrestan (either by using sample statistics to estimate population statistics or by assuming regular behavior from order statistics), but — so goes the argument — these real world metrics have fat tails, so they are in extremistan.

The point of the above was to show that this form of argument (usually punctuated with gratuitous insults, expletives, and Mathematica-based math or other forms of using pretend-math to bully one's audience) is wrong, tout court.

Only a small subset of fat-tailed, extreme-valued distributions is in extremistan. For all the rest, we can use our usual tools.

Friday, July 5, 2019

A family has two children. One is a boy. Now, do the math!


Problem


A family has two children. One is a boy. How likely is it that the other child is a boy?


Popular yet wrong solution


"There are four possible cases: two boys, a boy and a girl, a girl and a boy, and two girls. But because one child is a boy, it can't be the last case (two girls), so there are only three cases. Therefore the probability is one-third."

This solution is popular. Among others, Nassim Nicholas Taleb (on a since deleted tweet), vlogbrother Hank Green in an old SciShow episode (IIRC), probability instructors trying to show how interesting their class is to bored undergraduates, and people interviewing job candidates have used this solution.

This solution is fun because it's counter-intuitive; because of that it also looks like a smart solution.

This solution is wrong.

It's wrong because after we use "one is a boy" to eliminate the possibility of a family with two girls, we can no longer divide the probability equally among the remaining three possibilities. Equal division of probability can be used in a case of no information, but not in a case when information has already been used to change the set of possibilities.

The more attentive reader will notice that this is the same error most people make in the Monty Hall three-door problem. As a general rule, it's a bad idea to try to solve math problems by hand-waving.

If it's a math problem, do the math.*


Frequentist approach


Let's say we have a large number of cases, 4000 families for example. That's 1000 each for each combination of children: $(B,B), (B,G), (G,B)$, and $(G,G)$. Now we look at all the possibilities where we observe one of the children at random:

1000 $(B,B)$ families yield a total of 1000 boys;
1000 $(B,G)$ families yield a total of 500 boys;
1000 $(G,B)$ families yield a total of 500 boys;
1000 $(G,G)$ families yield a total of 0 boys.

We have a total of 2000 observed boys, and 1000 of these boys come from the case when the family has two boys, $(B,B)$. Half the time we observe a boy the underlying family has two boys; therefore the probability of a second boy is 1/2.

If instead of 4000 we had generic $N$ families, and called them "cases," this argument would be the frequentist derivation of the result. In frequentist parlance, the 2000 total boys are called the "possibles" and the 1000 boys from $(B,B)$ are called the "favorables." The probability is calculated as the ratio of favorables to possibles.

(The frequentist approach is how most people learn about probability and combinatorics.)


Bayesian approach


Frequentist arguments become unwieldy with more elaborate problems, so we can use this puzzle to illustrate a more elegant approach, Bayesian inference.†

First let's call things by their name: $(B,B), (B,G), (G,B)$, and $(G,G)$ are the unobserved states of the world. "One is a boy," which we'll represent by $B$, is an observed event.

Some events are uninformative, for example "one is blond," in that they don't help answer the question. Others like "one is a boy," $B$, are informative, because they help answer the question. But how can we tell?

Event $B$ is informative because it happens with different probabilities in different states of the world; therefore observing $B$ gives information about what states we're more likely to be in:

$\Pr(B|(B,B)) = 1$;
$\Pr(B|(B,G)) = 1/2$;
$\Pr(B|(G,B)) = 1/2$;
$\Pr(B|(G,G)) = 0$.

We don't know the unobserved state of the world (that is, in which of those four states the family in question falls), so in this situation we can assign equal probabilities to all four (we could look up demographics tables and confirm the numbers, but let's keep this simple):

$\Pr((B,B)) = \Pr((B,G)) = \Pr((G,B)) = \Pr((G,G)) = 1/4$.

What we want is the probability of the state $(B,B)$ having observed the event $B$; this is the conditional probability $\Pr((B,B)|B)$, which can be computed using the Bayes formula,

\[
\Pr((B,B)|B) = \frac{\Pr(B|(B,B)) \Pr((B,B))}{\Pr(B)}.
\]
Because the $\Pr(B)$ trips a lot of people, let's be clear about what it is: it's the probability that you will observe a boy in general, not in this particular case; sometimes called the a-priori probability or the unconditional probability. This is the probability that if we picked a two-child family at random and then picked one of the children at random, that child would be a boy. It's not "one, because we observe a boy," a common error.

To compute $\Pr(B)$ we must consider all four states of the world and add up ("integrate over the space of states" in expensive wording) the probability of observing a boy in each of these states weighed by the probability of the state itself:

$\begin{array}{rl}\Pr(B) =& \Pr(B|(B,B)) \Pr((B,B)) + \\
 & \Pr(B|(B,G)) \Pr((B,G)) +  \\
& \Pr(B|(G,B)) \Pr((G,B)) + \\
&\Pr(B|(G,G)) \Pr((G,G)) \\
=& 1/2
\end{array}$

(Unsurprisingly, it's 1/2, since half of the children are boys.)

Now we can compute our quantity of interest $\Pr((B,B)|B)$ by replacing the numbers in the Bayes formula. In fact, we can do that for all the states,

$\Pr((B,B)|B) = 1/2$;
$\Pr((B,G)|B) = 1/4$;
$\Pr((G,B)|B) = 1/4$;
$\Pr((G,G)|B) = 0$.

(As they used to say in the Soviet Union, trust but verify: check those numbers to be sure.)



If it's a math problem, do the math.




-- -- -- --
* "Do the math" means apply the rules of math, not just the notation and numbers.

† There's a bit of a schism in statistical modeling between frequentists and Bayesians. I'll let you figure out which side I'm on.