Wednesday, October 2, 2019

When nutrition and fitness studies attempt science with naive statistics

A little statistics knowledge is a dangerous thing.

(Inspired by an argument on Twitter about a paper on intermittent fasting, which exposes the problem of blind trust in "studies" when such studies are done to lower statistical standards than market research since at least the 70s.)

Given an hypothesis, say "people using intermittent fasting lose weight faster than controls even when calories are equated," any market researcher worth their bonus and company Maserati would design a within-subjects experiment. (For what it's worth, here's a doctor suggesting within-subject experiments on muscle development.)

Alas, market researchers aren't doing fitness and nutrition studies, mostly because market researchers like money and marketing is where the market research money is (also, politics, which is basically marketing).

So, these fitness and nutrition studies tend to be between-subjects: take a bunch of people, assign them to control and treatment groups, track some variables, do some first-year undergraduate statistics, publish paper, get into fights on Twitter.

What's wrong with that?

People's responses to treatments aren't all the same, so the variance of those responses, alone, can make effects that exist at the individual level disappear when aggregated by naive statistics.

Huh?

If everyone loses weight faster on intermittent fasting, but some people just lose it a little bit faster and some people lose it a lot faster, that difference in response (to fasting) will end up making the statistics look like there's no effect. And what's worse, the bigger the differences between different people in the treatment group, the more likely the result is to be non-significant.

Warning: minor math ahead.

Let's say there are two conditions, control and treatment, $C$ and $T$. For simplicity there are two segments of the population: those who have a strong response $S$ and those who have a weak response $W$ to the treatment. Let the fraction of $W$ be represented by $w \in [0,1]$.

Our effect is measured by a random variable $x$, which is a function of the type and the condition. We start with the simplest case, no effect for anyone in the control condition:

$x_i(S,C) = x_i(W,C) = 0$.

By doing this our statistical test becomes a simple t-test of the treatment condition and we can safely ignore the control subsample.

For the treatment conditions, we'll consider that the $W$ part of the population has a baseline effect normalized to 1,

$x_i(W,T) = 1$.

Yes, no randomness. We're building the most favorable case to detect the effect and will show that population heterogeneity alone can hide that effect.

We'll consider that the $S$ part of the population has an effect size that is a multiple of the baseline, $M$,

$x_i(S,T) = M$.

Note that with any number of test subjects, if the populations were tested separately the effect would be significant, as there's no error. We could add some random factors, but that would only complicate the point, which is that even in the most favorable case (no error, both populations show a positive effect), the heterogeneity in the population hides the effect.

(If you slept through your probability course in college, skip to the picture.)

If our experiment has $N$ subjects in the treatment condition, the expected effect size is

$\bar x = w + (1-w) M$

with a standard error (the standard deviation of the sample mean) of

$\sigma_{\bar x} =  (M-1) \,\sqrt{\frac{w(1-w)}{N}} $.

(Note that because we actually know the mean, this being a probabilistic model rather than a statistical estimation, we see $N$ where most people would expect $N-1$.)

So, the test statistic is

$t = \bar x/\sigma_{\bar x} = \frac{w + (1-w) M}{(M-1) \,\sqrt{\frac{w(1-w)}{N}}}$.

It may look complicated, but it's basically a three parameter analytical function, so we can easily see what happens to significance with different $w,M,N$, which is our objective.

Because we're using a probabilistic model where all quantities are known, the test statistic is distributed Normal(0,1), so the critical value for, say, 0.95 confidence, single-sided, is given by $\Phi^{-1}(0.95) = 1.645$.

To start simply, let's fix $N= 20$ (say a convenience sample of undergraduates, assuming a class size of 40 and half of them in the control group). Now we can plot $t$ as a function of $M$ and $w$:


(The seemingly-high magnitudes of $M$ and $w$ are an artifact of not having any randomness in the model. We wanted this to be simple, so that's the trade-off.)

Recall that in our model both sub-populations respond to the treatment and there's no randomness in that response. And yet, for a small enough fraction of the $S$ population and a large enough multiplier effect $M$, our super-simple, extremely-favorable model shows non-significant effects using a single-sided test (the most favorable test, and we're using the lowest acceptable significance for most journals, 95%, also most favorable choice).

Let's be clear what that "non-significant effects" means: it means that a naive statistician would look at the results and say that the treatment shows no difference from the control, in the words of our example, that people using intermittent fasting don't lose weight faster than the controls.

This, even though everyone in our model loses weight faster when intermittent fasting.

Worse, the results are less and less significant the stronger the effect on the $S$ population relative to the $W$ population. In other words, the faster the weight loss of the highly-responsive subpopulation relative to the less-responsive subpopulation, when both are losing weight with intermittent fasting, the more the naive statistics shows intermittent fasting to be ineffectual at producing weight loss.

Market researchers have known about this problem for a very long time. Nutrition and fitness practices (can't bring myself to call them sciences) are now repeating errors from the 50s-60s.

That's not groovy!