Saturday, May 30, 2020

Off the cutting-room floor: Fitness "science," a collection of bad thinking.

(Off the cutting-room floor: writing that didn't make the book. References to pitfalls come from that.)


People who run regularly are fitter (have better cardiovascular function) than people who don't. Based on that observed regularity, early fitness gurus helped start the jogging craze in the mid-1960s.

But there's an obvious problem with the causality: are people fitter because they run (running causes fitness) or are people able to run because they're fit (fitness causes running)?

We've been socially conditioned to accept the first implication without question, but as we've learned from pitfall 4 (hidden factors in correlations), these questions are more complex than they appear. In particular, as we know from pitfalls 2 and 3, results from self-selected samples and truncated data can be very deceptive.

What was the data, how was it modeled, and how do the results influence decisions?


When the data isn't representative


Because they didn't like the self-selected sample of runners, many aspiring fitness researchers run what they thought were controlled experiments: they took a group of sedentary people willing to participate in the experiment and divided them in two sides, half kept their sedentary life, half took up running. After some period, the runners' fitness was compared with the sedentary group. Runners were fitter.

However people with the best intentions to get in shape might not follow through, and many of these experiments had drop-out rates of more than 50% on the runner side (more than half the people who started the experiment abandoned it).

These studies traded off the hidden factor correlation and self-selection of people who don't run to begin with for the hidden factor correlation and self-selection of people who didn't finish the experiment of the runner side.

Eventually some researchers run controlled experiments with appropriate samples, tracking all participants (even the ones who dropped out) and analyzing the data properly. The studies showed lower effect sizes than those of self-selected samples, of course, but the results supported the relationship between running and fitness.



When information isn't what it appears


The controlled experiments showed that there was causality from running to being fit. Are we done?

In a paper that revolutionized the exercise world, a team led by Izumi Tabata showed that short but intense workouts could deliver the same gains as long sessions of low-intensity exercise like jogging.  (Yes, gym Tabatas are named after him, even though most of them don't follow the protocol of the paper. The paper is: "Metabolic profile of high intensity intermittent exercises" published in 1996 in the journal Medicine & Science in Sports & Exercise.)

This raised another issue: high-intensity exercise might worth through muscle strengthening: after all, if the muscles get stronger, there's less effort to do the same task, and that lower effort puts less stress on the cardiovascular system, so the person appears fitter.

This is a controversy still going on in the fitness industry and sports medicine. We'll sidestep it because we only care about the modeling implications: if this is a hidden mediating factor correlation (again pitfall 4!), then it can be tested to some extent by measuring the factor itself: muscle strength. (We'll leave it at that, because that's the modeling insight.)

We could ask whether it makes a difference which process (direct fitness effects or indirect effects via muscle strength) is at work, and with that we enter decision analysis.


When decisions don't follow from information


Why does the causal process make a difference? If it's a hidden mediating factor of muscle strength, we can get the muscles stronger with Tabatas, and if it's direct effect on fitness, we can get that fitness with Tabatas too. It's Tabatas in both cases, so why make a fuss?

Ah, but Tabatas (and similar techniques) aren't the only way to strengthen muscles, and they are high-impact activities; if the first causal relationship is the true one, then the elderly and those recovering from injury might get the same fitness gains from slow-movement weight training, but if it's the second causal relationship, they can't.

Knowing which model is right makes a difference to the elderly and those recovering from injury. That's worth some fuss.

We won't take a position on this, because the book is not about fitness, but it's clear that knowing which model describes reality, the direct causality of cardio exercise to cardiovascular fitness or the indirect causality of cardio exercise to muscle strength to cardiovascular fitness, can make a lot of difference for people who can't do high-impact exercise.

On a separate issue, given Tabata et al's results, why do people jog?

This brings us back to pitfall 10 and the difference between decision-making models and decision-support models. Determining whether to jog based on the effectiveness of jogging for cardiovascular fitness uses the model as the sole driver of decision making.

But even if we understand that Tabatas are a better form of cardiovascular exercise, and we take the result of the model as information for our decision, we might use other criteria to make the decision. We might enjoy the activity; or use it to socialize with coworkers; or understand that running is useful skill for life and like all skills needs training.

We make the decisions, not the models. Well, not always the models.

Saturday, May 9, 2020

I'm writing a short book

Greetings, carbon-based lifeforms,

No, I haven't given up blogging; it's just on hold, while  I'm writing a short book putting together some thoughts on what people do wrong with numbers, data, and models. It'll have a lot of pictures and be priced to move.

It'll include material from some blog posts and executive education materials, reworked a bit, of course, but there's a lot of extra work involved.

Here are some of the pictures, work in progress (as usual, click for bigger):

(Adapted from this blog post.)

 (Adapted from exec-ed materials, not a blog post.)

(Adapted from this blog post.)

(Adapted from this blog post.)

Live long and prosper,

JCS

- - - - - -

Yes, the greeting is a paraphrase of the title of an AC Clarke book of collected essays.  Call it a homage.

Sunday, April 12, 2020

Random clusters: how we make too much of coincidences

Understanding randomness and coincidence


If we flip a fair coin 1000 times, how likely is it that we see a sequence of 10 tails somewhere?

Give that problem a try, then read on.

We start by computing the probability that a sequence of ten flips is all tails:

$\Pr(\text{10 tails in a row}) = 2^{-10} = 1/1024$

There are 991 sequences of 10 flips in 1000 flips, so we might be tempted to say that the probability of at least one sequence of 10 tails is 991/1024.

This is obviously the wrong answer, because if the question had been for a sequence of 10 tails in 2000 flips the same logic would yield a probability of 1991/1024, which is greater than 1.

(People make this equivalence between event disjunction and probability addition all the time; it's wrong every single time they do it. The rationale, inasmuch as there's one, is that "and" translates to multiplication, so they expect "or" to translate to addition; it doesn't.)

The correct probability calculation starts by asking what is the probability that those 991 ten-flip sequences don't include one sequence with ten tails in a row, in other words,

$(1 - \Pr(\text{10 tails in a row}))^{991}$.

This is the probability of the negative of the event we want (at least one sequence of 10 tails in 1000 flips), which makes the probability of the event we want

$1 - (1 - \Pr(\text{10 tails in a row}))^{991} = 0.62$.

62% of the time one of these 1000-flip sequences will show, somewhere, a sequence of 10 tails.

Now consider Bob, a journalist, who sees those 10 tails and hurries home to write about the "one in a thousand" event that he just witnessed, demanding a coin fairness validation authority, and 4 trillion dollars in QE for the Fed, 1.5 trillion dollars for politically connected businesses, 1100 pages of legislation containing every legislator's wishlist, and an app on everyone's phone an embedded chip in everyone's wrist to track their vaccine statu… I mean to check for fairness in coin flips.

And this brings us to the problem of random clustering.


Random clusters


Say there's an event that happens with probability 14.5/100,000 per person-year (that's the flu death rate for California in 2018, according to the CDC).

What's the probability that on a given year we see a cluster of 30 or more events in a single week (50% above the average, it turns out later) in the San Francisco Bay Area (population 7,000,000 for this example)? How about 40 or more events (twice the average)?

Try it for yourself.

Done?

Sure you don't want to try to do it first?

Okay, here we go.

First we need to see what the distribution of the number of events per week is. Assuming that the 14.5/100,000 per person-year is distributed uniformly over the 52 weeks (which it isn't, but by choosing this we make the case for random clusters stronger, because the deaths happen mostly during flu season and the average during flu season is higher), the individual event probability is

$p = 14.5/(100,000 \times 52) = 0.0000027885$ events per person-week.

A population of 7,000,000 is much bigger than 30, so we use the Poisson distribution for the number of events/week. (The Poisson distribution is unbounded, but the error we make by assuming it is trivial, given that the cluster sizes we care about are so small compared to the size of the population.)

If the probability of an individual event per person-week is $p$, then the average number of events per week $\mu$ given a Poisson distribution (which is also the parameter of the Poisson distribution, helpfully), is $\mu = 7000000 \times p = 19.51923$

Our number of events per week, a random variable $P$, will be distributed Poisson with parameter $\mu$; the probability that we see a week with a number of events $N$ or higher is $(1-F_P(N))$, where $F_P(N)$ is the c.d.f. of $P$ evaluated at value $N$; hence the probability of a year with at least one week where the number of events (we'll call that cluster size) is $N$ or bigger is

$1- F_P(N)^{52}$ where $P \sim \mathrm{Poisson}(19.52)$.

We can now use the miracle of computers and spreadsheets (or the R programming language, which is my preference) to compute and plot the probability that there's a cluster size of at least $N$ events on any week in one year:


 So, from randomness alone, we expect that with 40% probability there'll be one week with a cluster 50% larger than average in the Bay Area, and with 0.08% probability one that's twice as large.

That 0.08% looks very small… until we realize that there are thousands of townships in the US, and if each of them has a similar probability (different in detail because populations and death rates are different, but generally in the same order of magnitude, so we'll just use the same number for illustration), the aggregate number for, say, 1000 townships will be

$1- (1-0.0008)^{1000} = 0.55$.

Any cherry-picking person can, with some effort, find one such cluster at least in half the years. And that's the problem, because everything said here is driven by pure probability, but the interpretation of the clusters is always assumed to be caused by some other process.

"After all, could randomness really cause such outliers?"

Yes. Randomness was the only process assumed above, all numbers were driven by the same process, and those so-called outliers were nothing but the result of the same processes that create the averages.

Remember that the next time people point at scary numbers like 50% or 100% above average.



Mathematical note (illustration, not proof)


A small numerical illustration of the similarity between probability rates for incremental percentages, to show that the assumption that the various townships would have similar (not equal) probabilities for different $\mu$, as a function of the size the cluster relative to the average:


Sunday, April 5, 2020

An irritating logical error that permeates the COVID-19 discussion

Apparently many people in influential media outlets don't know that the proposition
"most patients who die of COVID19 have [a given precondition]"
doesn't imply the proposition
"most COVID19 patients with [a given precondition] die."
This is a logic error; it's made in all sorts of news outlets and social media right now. 

Don't take my word that this is a logic error; here's a counter-example:


Saturday, April 4, 2020

Sampling on the dependent variable

I think "sampling on the dependent variable" is about to beat "ignoring hidden factor correlation" as the most common data analysis error in the wild.

Walter is a very popular professor of physics at the Boston Institute of Technology [name of institution cleverly disguised for legal reasons], who teaches a 400-student class in the largest classroom at BIT. The other 20 professors of physics are very boring, so at any given time they have on average 5 students each in their classe.

A journalist stands at the door of the physics department and asks every fourth student how full their physics classes are. The results are as follows:

- 100 students say that their class was completely full;
- 25 students say that their class was mostly empty.

This is reported as "4 out of 5 classes are full at BIT; new building needed to address the lack of space, new faculty must be hired urgently."

Did you see the error? It's subtle.

Here's a visualization of the process to help see it:


In fact, only one class is full. The problem is that the likelihood of a student being in the sample (a random sample of students coming out of the building) is proportional to the variable of interest (the number of students in the class); in other words, the journalist is sampling on the dependent variable.

The more full a class is, the more over-represented that class will be in the sample of students.

This looks like some rare error, the kind of thing that would only happen to hapless journalists, except that it happens all the time and in serious circumstances.

Consider the case of health authorities trying to determine the seriousness of a condition, namely how many of the people with the condition die. They could count the cases that get tested and compute the fraction of those that die. (That's what most of the preliminary COVID-19 case fatality rate numbers in the media are.)

And that's the same error that the journalist made.

In this case, the dependent variable is not size of class, it's seriousness of disease, and the sampling problem is not with the number of students in a class, is with the people who choose to get tested. These people choose to get tested (or get tested at a hospital when admitted) because they have symptoms that make them take the trouble.

In other words, the more serious the level of the disease a patient P has, the more likely P will be tested (sampling on the dependent variable, again), and more of these tested patients will die than if the testing was done to a random sample of the population.

(This is different from the truncation argument made in this previous post. Truncation is also a type of sampling on the dependent variable; a form that is easier to correct, as the non-truncated part of the sample distribution is the same as the population distribution up to scaling.)

To illustrate the effect of different degrees of sampling on the dependent variable, let us consider the case of a uniformly distributed variable (in the population) and different degrees of sampling:


Let's consider two persons, A with $x_A=0.2$ and B with $x_B=0.4$. With correct, random, sampling, A and B would have equal chance of being in the sample. With sampling proportional to $x$, B would be twice as likely to be in the sample than A, biasing the sample average upwards relative to the population; with sampling proportional to $x^2$, B would be four times more likely to be in the sample, which would bias the sample average even more.

To put this $x^2$ in the context of, for example, COVID-19 tests, a sampling proportional to $x^2$ means that people in a group X with symptoms twice as bad as people in a group Y will be four times more likely to seek treatment (and be tested). Basically, each person in group X will be counted four times more often in the statistics than each person in group Y (the group with less serious symptoms).

(The other degrees, $x^3$ and $x^4$ capture cases where people avoid the hospital unless their symptoms are serious or very serious.)

As we can see from the charts in the image above, the more distortion of the underlying population distribution the sampling process creates, the higher the sample average, and all the while the population average stays at a constant 1/2.

Sampling on the dependent variable: something to keep in mind when people talk about dire situations in the news.

Saturday, March 21, 2020

Fun with numbers for March 21, 2020

Recycling some tweets on the third day of California shelter-in-place. Weather is nice:




I really don't like these "flatten the curve" diagrams posing as science



Maybe it’s just me, but this diagram strikes me as a number of unsupported unquantified statements presented as if it’s some sort of quantitative model based on real data
  1. Axes have labels but no scales… so all we can measure is the relative magnitudes. Is that high peak at (D,A) 1%, 10%, 25%, or 90% of the population? Does it happen in a week, a month, or a year?
  2. A/B = 415/110 so this undefined intervention lowers the peak by 73.5%. How many patients is that? Do these measures really slow down infection rates this much? Assuming that there’s no change to recovery speed, that’s a 4-fold reduction from an unidentified intervention.
  3. E/D = 475/280 so this undefined intervention delays the peak by 70%. So if D is a month, this delays the peak a further three weeks, not long enough for a vaccine; if D is a year, that’s another 8 months, presumably enough. 
  4. B is still greater than C, so what happens when the slowed-down process crosses over the health system capacity? Rationing/triage or does this mean bodies littering the streets? That depends on that (B-C)/C = (110-83)/83 or 33% over capacity, but what happens needs absolute numbers, not relative; since there are no numbers, there’s no real meaning.

All the calculations above are just to show that if we’re to take a chart seriously we need to have real numbers and real details, and the above figure is just a qualitative “let’s hope this works to convince people to wash hands and stay away from others” masquerading as a technical models.

BY ALL MEANS, WASH YOUR HANDS, DON’T TOUCH YOUR FACE, AND STAY AWAY FROM OTHERS, because that makes sense. I've been doing it for as far as I can remember.



The information we're getting is preliminary and we're treating it as dogma


From a study of Italian testing:


The internal consistency of this test is 75% (25% of the time the test doesn’t agree with itself in retesting); this doesn’t mean that the test is 75% accurate, because that’s measured relative to the underlying condition. This is an upper bound on the accuracy of the test, since we know that at least 25% of the time it's inaccurate for sure. (Sample size appears small, but for Medicine this is almost their version of "big data.")


A more general point about COVID19 testing


It's easy to show that missing covariates leads to panic-inducing overestimates. The following numbers are not COVID19 data, just an illustration


Sometimes I despair of what people try to do with small amounts of data, and then the sarcasm comes out:
How can anyone deny this calamity?! In less than two months the entire population of the Earth will test positive. 
In 100 days, over 8 trillion people will test positive. That's 5 times the total number of humans who've ever lived!!!! 



TSLA twitter, always good for a laugh



No matter what the stock does or at what price it's trading, Ross always says "buy." One wonders how he charges 2-and-20 to his clients to give advice of this quality.



Richard "Hamster" Hammond drives a Tesla Model X



And gets very excited at adding one mile every few seconds at a Tesla Supercharger. (We can see in the touchscreen that the Supercharger is delivering 65 kW and Tesla claims 310 Wh/mi,* so that would average out at about 16 seconds per mile of range.) Not to be a spoilsport, but a gas pump adds about 26 miles of range per second (3 l/s in a 35 MPG car).

Then there's a small blur fail that reveals Hammond isn't really driving under the speed limit:


That's okay, Mr. Hammond, no one else is either.

- - - - -
* If you believe that number, you're exactly the kind of investor I'm targeting with a new product structured mostly with 2020 pandemic cat bonds; act now, supplies are limited.**

** CYA statement: this is a facetious offer, expressing derision for Tesla's number, not a proffer of a tradable security structured from out-of-the-money cat bonds.



Some videos to watch while the economy tanks around us



Grant Sanderson of 3blue1brown gave a talk at Berkeley about having people engage with math. The gist is that people want relevance and/or a story. That's good advice, but I think 3B1B's problem is that his audience is self-selected. In other words, that's how you engage an audience that's predisposed to look for and watch math videos. Still, good points.



Experimentboy is back, with thermal cameras. Very fun stuff.



PhysicsGirl suggests fun experiments to keep us from losing our minds while we wait to be moved to FEMA camps or be turned into Soylent Green.


YouTube affords the überdorks amongst us the opportunity to watch talks waaaaay above our expertise, something that in real life would be embarrassing, not to mention logistically difficult. So here are some links to:

Caltech. MIT-West, as some people who went to a technical school in Massachusetts call it.

Stanford Institute for Theoretical Physics. Fair warning: Susskind eats cookies when he talks, so there's spraying in some videos (all Susskind videos, really).

Institute for Advanced Studies at Princeton.

Nasa Jet Propulsion Laboratory.

Art talks at Le LouvreMusée D'Orsay, the British Museum, the Smithsonian Institution, and the Museum of Fine Arts in Boston, a small town in a hard-to-spell state




Live long and prosper.

Sunday, March 15, 2020

Fun with geekage while social distancing for March 15, 2020

(I'm trying to get a post out every week, as a challenge to produce something intellectual outside of work. Some* of this is recycled from Twitter, as I tend to send things there first.)


Multicriteria decision-making gets a boost from Covid-19



A potential upside (among many downsides) of the coronavirus covid-19 event is that some smart people will realize that there's more to life choices than a balance between efficiency and convenience and will build [for themselves if not the system] some resilience.

In a very real sense, it's possible that PG&E's big fire last year and follow-up blackouts saved a lot of people the worst of the new flu season: after last Fall, many local non-preppers stocked up on N95 masks and home essentials because of what chaos PG&E had wrought in Northern California.



Anecdotal evidence is a bad source for estimates: coin flips


Having some fun looking at small-numbers effects on estimates or how unreliable anecdotal evidence really can be as a source of estimates.

The following is a likelihood ratio of various candidate estimates versus the maximum likelihood estimate for the probability of heads given a number of throws and heads of a balanced coin; because there's an odd number of flips, even the most balanced outcome is not 50-50:


This is an extreme example of small numbers, but it captures the problem of using small samples, or in the limit, anecdotes, to try to estimate quantities. There's just not enough information in the data.

This is the numerical version of the old medicine research paper joke: "one-third of the sample showed marked improvement; one-third of the sample showed no change; and the third rat died."

Increasing sample size makes for better information, but can also exacerbate the effect of a few errors:


Note that the number of errors necessary to get the "wrong" estimate goes up: 1 (+1/2), 3, 6.



Context! Numbers need to be in context!



I'm looking at this pic and asking myself: what is the unconditional death rate for each of these categories; i.e. if you're 80 today in China, how likely is it you don't reach march 15, 2021, by all causes?

Because that'd be relevant context, I think.



Estimates vs decisions: why some smart people did the wrong thing regarding Covid-19



On a side note, while some people choose to lock themselves at home for social distancing, I prefer to find places outdoors where there's no one else. For example: a hike on the Eastern span of the Bay Bridge, where I was the only person on the 3.5 km length of the bridge (the only person on the pedestrian/bike path, that is).




How "Busted!" videos corrupt formerly-good YouTube channels


Recently saw a "Busted!" video from someone I used to respect and another based on it from someone I didn't; I feel stupider for having watched the videos, even though I did it to check on a theory. (Both channels complain about demonetization repeatedly.) The theory:


Many of these "Busted!" videos betray a lack of understanding (or fake a lack of understanding for video-making reasons) of how the new product/new technology development process goes; they look at lab rigs or technology demonstrations and point out shortcomings of these rigs as end products. For illustration, here's a common problem (the opposite problem) with media portrayal of these innovations:


It's not difficult to "Bust!" media nonsense, but what these "Busted!" videos do is ascribe the media nonsense to the product/technology designers or researchers, to generate views, comments, and Patreon donations. This is somewhere between ignorance/laziness and outright dishonesty.

In the name of "loving science," no less!



Johns Hopkins visualization makes pandemic look worse than it is



Not to go all Edward Tufte on Johns Hopkins, but the size of the bubbles on this site makes the epidemic look much worse than it is: Spain, France, and Germany are completely covered by bubbles, while their cases are
0.0167 % for Spain
0.0070 % for Germany
0.0067 % for France
of the population.



Cumulative numbers increase; journalists flabbergasted!



At some point someone should explain to journalists that cumulative deaths always go up, it's part of the definition of the word "cumulative." Then again, maybe it's too quantitative for some people who think all numbers ending in "illions" are the same scale.



Stanford Graduate School of Education ad perpetuates stereotypes about schools of education


If this is real, then someone at Stanford needs to put their ad agency "in review." (Ad world-speak for "fired with prejudice.")





Never give up; never surrender.


- - - - -
* All.