I think "sampling on the dependent variable" is about to beat "ignoring hidden factor correlation" as the most common data analysis error in the wild.
Walter is a very popular professor of physics at the Boston Institute of Technology [name of institution cleverly disguised for legal reasons], who teaches a 400-student class in the largest classroom at BIT. The other 20 professors of physics are very boring, so at any given time they have on average 5 students each in their classe.
A journalist stands at the door of the physics department and asks every fourth student how full their physics classes are. The results are as follows:
- 100 students say that their class was completely full;
- 25 students say that their class was mostly empty.
This is reported as "4 out of 5 classes are full at BIT; new building needed to address the lack of space, new faculty must be hired urgently."
Did you see the error? It's subtle.
Here's a visualization of the process to help see it:
In fact, only one class is full. The problem is that the likelihood of a student being in the sample (a random sample of students coming out of the building) is proportional to the variable of interest (the number of students in the class); in other words, the journalist is sampling on the dependent variable.
The more full a class is, the more over-represented that class will be in the sample of students.
This looks like some rare error, the kind of thing that would only happen to hapless journalists, except that it happens all the time and in serious circumstances.
Consider the case of health authorities trying to determine the seriousness of a condition, namely how many of the people with the condition die. They could count the cases that get tested and compute the fraction of those that die. (That's what most of the preliminary COVID-19 case fatality rate numbers in the media are.)
And that's the same error that the journalist made.
In this case, the dependent variable is not size of class, it's seriousness of disease, and the sampling problem is not with the number of students in a class, is with the people who choose to get tested. These people choose to get tested (or get tested at a hospital when admitted) because they have symptoms that make them take the trouble.
In other words, the more serious the level of the disease a patient P has, the more likely P will be tested (sampling on the dependent variable, again), and more of these tested patients will die than if the testing was done to a random sample of the population.
(This is different from the truncation argument made in this previous post. Truncation is also a type of sampling on the dependent variable; a form that is easier to correct, as the non-truncated part of the sample distribution is the same as the population distribution up to scaling.)
To illustrate the effect of different degrees of sampling on the dependent variable, let us consider the case of a uniformly distributed variable (in the population) and different degrees of sampling:
Let's consider two persons, A with $x_A=0.2$ and B with $x_B=0.4$. With correct, random, sampling, A and B would have equal chance of being in the sample. With sampling proportional to $x$, B would be twice as likely to be in the sample than A, biasing the sample average upwards relative to the population; with sampling proportional to $x^2$, B would be four times more likely to be in the sample, which would bias the sample average even more.
To put this $x^2$ in the context of, for example, COVID-19 tests, a sampling proportional to $x^2$ means that people in a group X with symptoms twice as bad as people in a group Y will be four times more likely to seek treatment (and be tested). Basically, each person in group X will be counted four times more often in the statistics than each person in group Y (the group with less serious symptoms).
(The other degrees, $x^3$ and $x^4$ capture cases where people avoid the hospital unless their symptoms are serious or very serious.)
As we can see from the charts in the image above, the more distortion of the underlying population distribution the sampling process creates, the higher the sample average, and all the while the population average stays at a constant 1/2.
Sampling on the dependent variable: something to keep in mind when people talk about dire situations in the news.
Non-work posts by Jose Camoes Silva; repurposed in May 2019 as a blog mostly about innumeracy and related matters, though not exclusively.
Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts
Saturday, April 4, 2020
Sunday, March 15, 2020
Fun with geekage while social distancing for March 15, 2020
(I'm trying to get a post out every week, as a challenge to produce something intellectual outside of work. Some* of this is recycled from Twitter, as I tend to send things there first.)
A potential upside (among many downsides) of the coronavirus covid-19 event is that some smart people will realize that there's more to life choices than a balance between efficiency and convenience and will build [for themselves if not the system] some resilience.
In a very real sense, it's possible that PG&E's big fire last year and follow-up blackouts saved a lot of people the worst of the new flu season: after last Fall, many local non-preppers stocked up on N95 masks and home essentials because of what chaos PG&E had wrought in Northern California.
On a side note, while some people choose to lock themselves at home for social distancing, I prefer to find places outdoors where there's no one else. For example: a hike on the Eastern span of the Bay Bridge, where I was the only person on the 3.5 km length of the bridge (the only person on the pedestrian/bike path, that is).
Recently saw a "Busted!" video from someone I used to respect and another based on it from someone I didn't; I feel stupider for having watched the videos, even though I did it to check on a theory. (Both channels complain about demonetization repeatedly.) The theory:
Many of these "Busted!" videos betray a lack of understanding (or fake a lack of understanding for video-making reasons) of how the new product/new technology development process goes; they look at lab rigs or technology demonstrations and point out shortcomings of these rigs as end products. For illustration, here's a common problem (the opposite problem) with media portrayal of these innovations:
It's not difficult to "Bust!" media nonsense, but what these "Busted!" videos do is ascribe the media nonsense to the product/technology designers or researchers, to generate views, comments, and Patreon donations. This is somewhere between ignorance/laziness and outright dishonesty.
In the name of "loving science," no less!
Not to go all Edward Tufte on Johns Hopkins, but the size of the bubbles on this site makes the epidemic look much worse than it is: Spain, France, and Germany are completely covered by bubbles, while their cases are
If this is real, then someone at Stanford needs to put their ad agency "in review." (Ad world-speak for "fired with prejudice.")
Never give up; never surrender.
- - - - -
* All.
Multicriteria decision-making gets a boost from Covid-19
A potential upside (among many downsides) of the coronavirus covid-19 event is that some smart people will realize that there's more to life choices than a balance between efficiency and convenience and will build [for themselves if not the system] some resilience.
In a very real sense, it's possible that PG&E's big fire last year and follow-up blackouts saved a lot of people the worst of the new flu season: after last Fall, many local non-preppers stocked up on N95 masks and home essentials because of what chaos PG&E had wrought in Northern California.
Anecdotal evidence is a bad source for estimates: coin flips
Having some fun looking at small-numbers effects on estimates or how unreliable anecdotal evidence really can be as a source of estimates.
The following is a likelihood ratio of various candidate estimates versus the maximum likelihood estimate for the probability of heads given a number of throws and heads of a balanced coin; because there's an odd number of flips, even the most balanced outcome is not 50-50:
This is an extreme example of small numbers, but it captures the problem of using small samples, or in the limit, anecdotes, to try to estimate quantities. There's just not enough information in the data.
This is the numerical version of the old medicine research paper joke: "one-third of the sample showed marked improvement; one-third of the sample showed no change; and the third rat died."
This is the numerical version of the old medicine research paper joke: "one-third of the sample showed marked improvement; one-third of the sample showed no change; and the third rat died."
Increasing sample size makes for better information, but can also exacerbate the effect of a few errors:
Note that the number of errors necessary to get the "wrong" estimate goes up: 1 (+1/2), 3, 6.
Note that the number of errors necessary to get the "wrong" estimate goes up: 1 (+1/2), 3, 6.
Context! Numbers need to be in context!
I'm looking at this pic and asking myself: what is the unconditional death rate for each of these categories; i.e. if you're 80 today in China, how likely is it you don't reach march 15, 2021, by all causes?
Because that'd be relevant context, I think.
Estimates vs decisions: why some smart people did the wrong thing regarding Covid-19
On a side note, while some people choose to lock themselves at home for social distancing, I prefer to find places outdoors where there's no one else. For example: a hike on the Eastern span of the Bay Bridge, where I was the only person on the 3.5 km length of the bridge (the only person on the pedestrian/bike path, that is).
How "Busted!" videos corrupt formerly-good YouTube channels
Recently saw a "Busted!" video from someone I used to respect and another based on it from someone I didn't; I feel stupider for having watched the videos, even though I did it to check on a theory. (Both channels complain about demonetization repeatedly.) The theory:
Many of these "Busted!" videos betray a lack of understanding (or fake a lack of understanding for video-making reasons) of how the new product/new technology development process goes; they look at lab rigs or technology demonstrations and point out shortcomings of these rigs as end products. For illustration, here's a common problem (the opposite problem) with media portrayal of these innovations:
It's not difficult to "Bust!" media nonsense, but what these "Busted!" videos do is ascribe the media nonsense to the product/technology designers or researchers, to generate views, comments, and Patreon donations. This is somewhere between ignorance/laziness and outright dishonesty.
In the name of "loving science," no less!
Johns Hopkins visualization makes pandemic look worse than it is
Not to go all Edward Tufte on Johns Hopkins, but the size of the bubbles on this site makes the epidemic look much worse than it is: Spain, France, and Germany are completely covered by bubbles, while their cases are
0.0167 % for Spainof the population.
0.0070 % for Germany
0.0067 % for France
Cumulative numbers increase; journalists flabbergasted!
At some point someone should explain to journalists that cumulative deaths always go up, it's part of the definition of the word "cumulative." Then again, maybe it's too quantitative for some people who think all numbers ending in "illions" are the same scale.
Stanford Graduate School of Education ad perpetuates stereotypes about schools of education
If this is real, then someone at Stanford needs to put their ad agency "in review." (Ad world-speak for "fired with prejudice.")
Never give up; never surrender.
- - - - -
* All.
Sunday, March 1, 2020
Fun with COVID-19 Numbers for March 1, 2020
NOTA BENE: The Coronavirus COVID-2019 is a serious matter and we should be taking all reasonable precautions to minimize contagion and stay healthy. But there's a lot of bad quantitative thinking that's muddling the issue, so I'm collecting some of it here.
This was inspired by a tweet by Ted Naiman, MD, whose Protein-to-Energy ratio analysis of food I credit for at least half of my weight loss (the other half I credit P. D. Mangan, for the clearest argument for intermittent fasting, which convinced me); so this is not about Dr Naiman's tweet, just that his was the tweet I saw with a variation of this proposition:
"COVID-19 is 'like the flu,' except the death rate is 30 to 50 times higher."
But here's the problem with that proposition: we don't have reliable data to determine that. Here are two simple arguments that cast some doubt on the proposition:
⬆︎ How the death rate could be higher: government officials and health organizations under-report the number of deaths in order to contain panic or to minimize criticism of government and health organizations; also possible that some deaths from COVID-19 are attributed to conditions that were aggravated by COVID-19, for example being reported as deaths from pneumonia.
⬇︎ How the death rate could be lower: people with mild cases of COVID-19 don't report them and treat themselves with over-the-counter medication (to avoid getting taken into forced quarantine, for example), hence there's a bias in the cases known to the health organizations, towards more serious cases, which are more likely to die.
How much we believe the first argument applies depends on how much we trust the institutions of the countries reporting, and... you can draw your own conclusions!
To illustrate the second argument, consider the incentives of someone with flu-like symptoms and let's rate their seriousness or aversiveness, $a$, as a continuous variable ranging from zero (no symptoms) to infinity (death). We'll assume that the distribution of $a$ is an exponential, to capture thin tails, and to be simple let's make its parameter $\lambda =1$.
Each sick patient will have to decide whether to seek treatment other than over-the-counter medicine, but depending on the health system that might come with a cost (being quarantined at home, being quarantined in "sick wards," for example); let's call that cost, in the same scale of aversiveness, $c$.
What we care about is how the average aversiveness that is reported changes with $c$. Note that if everyone reported their $a$, that average would be $1/\lambda = 1$, but what we observe is a self-selected subset, so we need $E[a | a > c]$, which we can compute easily, given the exponential distribution, as
\[
E[a | a > c]
=
\frac{\int_{c}^{\infty} a \, f_A(a) da }{1 - F_A(c)}
=
\frac{\left[ - \exp(-a)(a+1)\right]^{\infty}_{c}}{\exp(-c)}
= c + 1
\]
Note that the probability of being reported is $\Pr(a>c) = \exp(-c)$, so as the cost of reporting goes up, a vanishingly small percentage of cases are reported, but their severity increases [linearly, but that's an artifact of the simple exponential] with the cost. That's the self-selection bias in the second argument above.
A plot for $c$ between zero (everyone reports their problems) and 5 (the cost of reporting is so high that only the sickest 0.67% risk reporting their symptoms to the authorities):
Death Rate I: We can't tell, there's no good data yet.
This was inspired by a tweet by Ted Naiman, MD, whose Protein-to-Energy ratio analysis of food I credit for at least half of my weight loss (the other half I credit P. D. Mangan, for the clearest argument for intermittent fasting, which convinced me); so this is not about Dr Naiman's tweet, just that his was the tweet I saw with a variation of this proposition:
"COVID-19 is 'like the flu,' except the death rate is 30 to 50 times higher."
But here's the problem with that proposition: we don't have reliable data to determine that. Here are two simple arguments that cast some doubt on the proposition:
⬆︎ How the death rate could be higher: government officials and health organizations under-report the number of deaths in order to contain panic or to minimize criticism of government and health organizations; also possible that some deaths from COVID-19 are attributed to conditions that were aggravated by COVID-19, for example being reported as deaths from pneumonia.
⬇︎ How the death rate could be lower: people with mild cases of COVID-19 don't report them and treat themselves with over-the-counter medication (to avoid getting taken into forced quarantine, for example), hence there's a bias in the cases known to the health organizations, towards more serious cases, which are more likely to die.
How much we believe the first argument applies depends on how much we trust the institutions of the countries reporting, and... you can draw your own conclusions!
To illustrate the second argument, consider the incentives of someone with flu-like symptoms and let's rate their seriousness or aversiveness, $a$, as a continuous variable ranging from zero (no symptoms) to infinity (death). We'll assume that the distribution of $a$ is an exponential, to capture thin tails, and to be simple let's make its parameter $\lambda =1$.
Each sick patient will have to decide whether to seek treatment other than over-the-counter medicine, but depending on the health system that might come with a cost (being quarantined at home, being quarantined in "sick wards," for example); let's call that cost, in the same scale of aversiveness, $c$.
What we care about is how the average aversiveness that is reported changes with $c$. Note that if everyone reported their $a$, that average would be $1/\lambda = 1$, but what we observe is a self-selected subset, so we need $E[a | a > c]$, which we can compute easily, given the exponential distribution, as
\[
E[a | a > c]
=
\frac{\int_{c}^{\infty} a \, f_A(a) da }{1 - F_A(c)}
=
\frac{\left[ - \exp(-a)(a+1)\right]^{\infty}_{c}}{\exp(-c)}
= c + 1
\]
Note that the probability of being reported is $\Pr(a>c) = \exp(-c)$, so as the cost of reporting goes up, a vanishingly small percentage of cases are reported, but their severity increases [linearly, but that's an artifact of the simple exponential] with the cost. That's the self-selection bias in the second argument above.
A plot for $c$ between zero (everyone reports their problems) and 5 (the cost of reporting is so high that only the sickest 0.67% risk reporting their symptoms to the authorities):
Remember that for all cases in this plot the average aversiveness/seriousness doesn't change: it's fixed at 1, and everyone has the disease, with around 63% of the population having less than the average aversiveness/seriousness. But, if the cost of reporting is, for example, equal to twice the aversiveness of the average (in other words, people dislike being put in involuntary quarantine twice as much as they dislike the symptoms of the average seriousness of the disease), only the sickest 13.5% of people will look for help from the authorities/health organizations, who will report a seriousness of 3 (three times the average seriousness of the disease in the general population).*
With mixed incentives for all parties involved, it's difficult to trust the current reported numbers.
Death Rate II: Using the data from the Diamond Princess cruise ship.
A second endemic problem is arguing about small differences in the death rate, based on small data sets. Many of these differences are indistinguishable statistically, and to be nice to all flavors of statistical testing we're going to compute likelihood ratios, not rely on simple point estimate tests.
The Diamond Princess cruise ship is as close as one gets to a laboratory experiment in COVID-19, but there's a small numbers problem. In other words we'll get good estimates when we have large scale, high-quality data. Thanks to @Clarksterh on Twitter for the idea.
Using data from Wikipedia for Feb 20, there were 634 confirmed infections (328 asymptomatic) aboard the Diamond Princess and as of Feb 28 there were 6 deaths among those infections. The death rate is 6/634 = 0.0095.
(The ship's population isn't representative of the general population, being older and richer, but that's not what's at stake here. This is about fixating on the point estimates and small differences thereof. There's also a delay between the diagnosis and the death, so these numbers might be off by a factor of two or three.)
What we're doing now: using $d$ as the death rate, $d = 0.0095$ is the maximum likelihood estimate, so it will give the highest probability for the data, $\Pr(\text{6 dead out of 634} | d = 0.0095)$. Below, we calculate and plot the likelihood ratio between that probability and the computed probability of the data for other candidate death rates, $d_i$.**
\[LR(d_i) = \frac{\Pr(\text{6 dead out of 634} | d = 0.0095)}{\Pr(\text{6 dead out of 634} | d = d_i)}\]
We can't reject any rates between 0.5% and 1.5% with any confidence (okay, some people using single-sided point tests with marginal significance might narrow that a bit, but let's not rehash old fights here), and that's a three-fold range. And there are still a lot of issues with the data.
On the other hand...
It's easy to see that the COVID-19 death rate is much higher than that of the seasonal flu (0.1%): using the data from the Diamond Princess, the $LR(0.001) = 3434.22$, which should satisfy both the most strong-headed frequentists and Bayesians that these two rates are different. Note that $LR(0.03) = 510.01$, which also shows that with the data above the Diamond Princess invalidates the 3% death rate. (Again, noting that the numbers might be off by a factor of two or three in either direction due to the delay in diagnosing the infection and between diagnosis and recovery or death.)
As with most of these analyses, disaggregate clinical data will be necessary to establish these rates, which we're estimating from much less reliable [aggregate] epidemiological data.
Stay safe: wash hands, don't touch your face, avoid unnecessary contact with other people.
- - - - -
* A friend pointed out that there are some countries or subcultures where hypochondria is endemic and that would lead to underestimation of the seriousness of the disease; this model ignores that, but anecdotally I've met people who get doctor's appointments because they have DOMS and want the doctor to reassure them that it's normal, prescribe painkillers and anti-inflammatories, and other borderline psychotic behavior...
** We're just computing the binomial here, no assumptions beyond that:
$\Pr(\text{6 dead out of 634} | d = d_i) = C(634,6) \, d_i^6 (1-d_i)^{628}$,
and since we use a ratio the big annoying combinatorials cancel out.
Wednesday, February 12, 2020
Contagion, coronavirus, and charlatans
This post is an illustration of a simple epidemiological model and why some of the ad-hoc modeling of coronavirus that some charlatans are spreading on social media platforms is a nonsensical distraction.
A simple model for infectious diseases, the SIR-1 model (also known as Kendrick-McCormack model), is too simple for the coronavirus, but contains some of the basic behavior of any epidemic.
The model uses a fixed population, with no deaths, no natural immunity, no latent period for the disease (when a person is exposed but not infectious; not to be mistaken for what happens with the coronavirus, where people are infectious but asymptomatic), and a simple topology (the population is in a single homogeneous pool, instead of different cities and countries sparsely connected).
There are three states that a given individual can be in: susceptible (fraction on the population in this state represented by $S$), infectious (fraction represented by $I$), and recovered (fraction represented by $R$); recovered means immune, so there isn't recurrence of an infection.
There are two parameters: $\beta$, the contagiousness of the disease, and $\gamma$, the recovery rate. To illustrate using discretized time, $\beta= 0.06$ means that any infectious individual has a 6% chance of infecting another individual in the next period (say, a day); $\gamma= 0.03$ means that any infectious individual has a 3% chance of recovering in the next period.
The dynamics of the model are described by three differential equations:
$\dot S = - \beta S I$;
$\dot I = (\beta S - \gamma) I$;
$\dot R = \gamma I$.
The ratio $R_0 = \beta/\gamma$ is critical to the behavior of an epidemic: if lower than one, the infection dies off without noticeable expansion, if much higher than one, it becomes a large epidemic.
There is no analytic solution to the differential equations, but they're easy enough to simulate and to fit data to. Here are some results for a discretized, 200-period simulation for some values of the parameters $(\beta, \gamma)$, starting with an initial infected population of 1%.
Math of contagion: the SIR-1 model
The model uses a fixed population, with no deaths, no natural immunity, no latent period for the disease (when a person is exposed but not infectious; not to be mistaken for what happens with the coronavirus, where people are infectious but asymptomatic), and a simple topology (the population is in a single homogeneous pool, instead of different cities and countries sparsely connected).
There are three states that a given individual can be in: susceptible (fraction on the population in this state represented by $S$), infectious (fraction represented by $I$), and recovered (fraction represented by $R$); recovered means immune, so there isn't recurrence of an infection.
There are two parameters: $\beta$, the contagiousness of the disease, and $\gamma$, the recovery rate. To illustrate using discretized time, $\beta= 0.06$ means that any infectious individual has a 6% chance of infecting another individual in the next period (say, a day); $\gamma= 0.03$ means that any infectious individual has a 3% chance of recovering in the next period.
The dynamics of the model are described by three differential equations:
$\dot S = - \beta S I$;
$\dot I = (\beta S - \gamma) I$;
$\dot R = \gamma I$.
The ratio $R_0 = \beta/\gamma$ is critical to the behavior of an epidemic: if lower than one, the infection dies off without noticeable expansion, if much higher than one, it becomes a large epidemic.
There is no analytic solution to the differential equations, but they're easy enough to simulate and to fit data to. Here are some results for a discretized, 200-period simulation for some values of the parameters $(\beta, \gamma)$, starting with an initial infected population of 1%.
First, a model with an $R_0=2$, illustrating the three processes:
Note that although a large percentage of the population is eventually infected (if we continue to run the model, it will converge to 100%), the number of people infectious at a given time (and presumably also feeling the symptoms of the disease) is much lower, and this is a very important metric, as the number of people sick at a given time determines how effectively health providers can deal with the disease.
Next, a model of runaway epidemic (the $R_0 = 24$ is beyond any epidemic I've known; used here only to make the point in a short 200 periods):
In this case, the number of sick people grows very fast, which makes it difficult for the health system to cope with the disease, plus the absence of the sick people from the workforce leads to second-order problems, including stalled production, insufficient logistics to distribute needed supplies, and lack of services and support for necessary infrastructure.
Finally, a model closer to non-epidemic diseases, like the seasonal flu (as opposed to epidemic flu), though the $(\beta,\gamma)$ are too high for that disease; this was necessary for presentation purposes, in order to make the 200-period chart more than three flat lines.
Note how low the number of people infected at any time is, which is why these things tend to die off, instead of growing into epidemics, once people start taking precautions and that $\beta$ becomes smaller than $\gamma$ which leads to a $R_0 < 1$, a condition for the disease to die off eventually.
One of the problems with ignoring the elements of these epidemiological models and calibrating statistical models on early data can be seen when we take the first example above ($\beta=0.06,\gamma=0.03$) and use the first 50 data points to calibrate a statistical model for forecasting the evolution of the epidemic:
As a general rule of thumb, models for processes that follow a S-shaped curve are extremely difficult to calibrate on early data; any data set that doesn't extend at least some periods into the concave region of the model is going to be of questionable value, especially if there are errors in measurement (as is always the case).
Consider that the failure of that estimation is for the simplest model (SIR-1), without the complexities of topology (multiple populations in different locations, each with a $(\beta,\gamma)$ of their own, connected by a network of transportation with different levels of quarantine and preventative measures, etc.), possible obfuscation of some data due to political concerns, misdiagnosis and under-reporting due to latency, changes to the $\beta$ and $\gamma$ as people's behavior adapts and health services adapt, and many other complications of a real-world epidemic including second-order effects on health services and essential infrastructure, which change people's behavior as well.
No, that forecasting error comes simply from that rule of thumb, that until the process passes the inflection point, it's almost certain that estimates based on aggregate numbers (as opposed to clinical measures of $\beta$ and $\gamma$, based on analysis of clinical cases; these are what epidemiologists use, by the way) will give nonsensical predictions.
But those nonsensical predictions get retweets, YouTube video views, and SuperChat money.
Note that although a large percentage of the population is eventually infected (if we continue to run the model, it will converge to 100%), the number of people infectious at a given time (and presumably also feeling the symptoms of the disease) is much lower, and this is a very important metric, as the number of people sick at a given time determines how effectively health providers can deal with the disease.
Next, a model of runaway epidemic (the $R_0 = 24$ is beyond any epidemic I've known; used here only to make the point in a short 200 periods):
In this case, the number of sick people grows very fast, which makes it difficult for the health system to cope with the disease, plus the absence of the sick people from the workforce leads to second-order problems, including stalled production, insufficient logistics to distribute needed supplies, and lack of services and support for necessary infrastructure.
Finally, a model closer to non-epidemic diseases, like the seasonal flu (as opposed to epidemic flu), though the $(\beta,\gamma)$ are too high for that disease; this was necessary for presentation purposes, in order to make the 200-period chart more than three flat lines.
Note how low the number of people infected at any time is, which is why these things tend to die off, instead of growing into epidemics, once people start taking precautions and that $\beta$ becomes smaller than $\gamma$ which leads to a $R_0 < 1$, a condition for the disease to die off eventually.
The problem with estimating ad-hoc models
As a general rule of thumb, models for processes that follow a S-shaped curve are extremely difficult to calibrate on early data; any data set that doesn't extend at least some periods into the concave region of the model is going to be of questionable value, especially if there are errors in measurement (as is always the case).
Consider that the failure of that estimation is for the simplest model (SIR-1), without the complexities of topology (multiple populations in different locations, each with a $(\beta,\gamma)$ of their own, connected by a network of transportation with different levels of quarantine and preventative measures, etc.), possible obfuscation of some data due to political concerns, misdiagnosis and under-reporting due to latency, changes to the $\beta$ and $\gamma$ as people's behavior adapts and health services adapt, and many other complications of a real-world epidemic including second-order effects on health services and essential infrastructure, which change people's behavior as well.
No, that forecasting error comes simply from that rule of thumb, that until the process passes the inflection point, it's almost certain that estimates based on aggregate numbers (as opposed to clinical measures of $\beta$ and $\gamma$, based on analysis of clinical cases; these are what epidemiologists use, by the way) will give nonsensical predictions.
But those nonsensical predictions get retweets, YouTube video views, and SuperChat money.
Friday, November 15, 2019
Fun with numbers for November 15, 2019
How many test rigs for a successful product at scale?
This is a general comment on how new technologies are presented in the media: usually something that is either a laboratory test rig or at best a proof-of-concept technology demonstration is hailed as a revolutionary product ready to take the world and be deployed at scale.
Consider how many is "a lot of," as a function of success probabilities at each stage:
Yep, notwithstanding all good intentions in the world, there's a lot of work to be done behind the scenes before a test rig becomes a product at scale, and many of the candidates are eliminated along the way.
Recreational math: statistics of the maximum draw of N random variables
At the end of a day of mathematical coding, and since Rstudio was already open (it almost always is), I decided to check whether running 1000 iterations versus 10000 iterations of simulated maxima (drawing N samples from a standard distribution and computing the maximum, repeated either 1000 times or 10000 times) makes a difference. (Yes, an elaboration on the third part of this blog post.)
Turns out, not a lot of difference:
Workflow: BBEdit (IMNSHO the best editor for coding) --> RStudio --> Numbers (for pretty tables) --> Keynote (for layout); yes, I'm sure there's an R package that does layouts, but this workflow is WYSIWYG.
The R code is basically two nested for-loops, the built-in functions max and rnorm doing all the heavy lifting.
Added later: since I already had the program parameterized, I decided to run a 100,000 iteration simulation to see what happens. Turns out, almost nothing worth noting:
Adding a couple of extra lines of code, we can iterate over the number of iterations, so for now here's a summary of the preliminary results (to be continued later, possibly):
And a couple of even longer simulations (all for the maximum of 10,000 draws):
Just for fun, the probability (theoretical) of the maximum for a variety of $N$ (powers of ten in this example) is greater than some given $x$ is:
More fun with Solar Roadways
Via EEVblog on twitter, the gift that keeps on giving:
This Solar Roadways installation is in Sandpoint, ID (48°N). Solar Roadways claims its panels can be used to clear the roads by melting the snow… so let's do a little recreational numerical thermodynamics, like one does.
Average solar radiation level for Idaho in November: 3.48 kWh per m$^2$ per day or 145 W/m$^2$ average power. (This is solar radiation, not electrical output. But we'll assume that Solar Roadways has perfectly efficient solar panels, for now.)
Density of fallen snow (lowest estimate, much lower than fresh powder): 50 kg/m$^3$ via the University of British Columbia.
Energy needed to melt 1 cm of snowfall (per m$^2$): 50 [kg/m^3] $\times$ 0.01 [m/cm] $\times$ 334 [kJ/kg] (enthalpy of fusion for water) = 167 kJ/m$^2$ ignoring the energy necessary to raise the temperature, as it's usually much lower than the enthalpy of fusion (at 1 atmosphere and 0°C, the enthalpy of fusion of water is equal to the energy needed to raise the temperature of the resulting liquid water to approximately 80°C).
So, with perfect solar panels and perfect heating elements, in fact with no energy loss anywhere whatsoever, Solar Roadways could deal with a snowfall of 3.1 cm per hour (= 145 $\times$ 3600 / 167,000) as long as the panel and surroundings (and snow) were at 0°C.
Just multiply that 3.1 cm/hr by the efficiency coefficient to get more realistic estimates. Remember that the snow, the panels, and the surroundings have to be at 0°C for these numbers to work. Colder doesn't just make it harder; small changes can make it impossible (because the energy doesn't go into the snow, goes into the surrounding area).
Another week, another Rotten Tomatoes vignette
This time for the movie Midway (the 2019 movie, not the 1972 classic Midway):
Critics and audience are 411,408,053,038,500,000 (411 quadrillion) times more likely to use opposite criteria than same criteria.
Recap of model: each individual has a probability $\theta_i$ of liking the movie/show; we simplify by having only two possible cases, critics and audience using the same $\theta_0$ or critics using a $\theta_1$ and audience using a $\theta_A = 1-\theta_1$. We estimate both cases using the four numbers above (percentages and number of critics and audience members), then compute a likelihood ratio of the probability of those ratings under $\theta_0$ and $\theta_1$. That's where the 411 quadrillion times comes from: the probability of a model using $\theta_1$ generating those four numbers is 411 quadrillion times the probability of a model using $\theta_0$ generating those four numbers. (Numerical note: for accuracy, the computations are made in log-space.)
Google gets fined and YouTubers get new rules
Via EEVBlog's EEVblab #67, we learn that due to non-compliance with COPPA, YouTube got fined 170 million dollars and had to change some rules for content (having to do with children-targeted videos):
Backgrounder from The Verge here; or directly from the FTC: "Google and YouTube Will Pay Record $170 Million for Alleged Violations of Children’s Privacy Law." (Yes, technically it's Alphabet now, but like Boaty McBoatface, the name everyone knows is Google. Even the FTC uses it.)
According to Statista: "In the most recently reported fiscal year, Google's revenue amounted to 136.22 billion US dollars. Google's revenue is largely made up by advertising revenue, which amounted to 116 billion US dollars in 2018."
170 MM / 136,220 MM = 0.125 %
2018 had 31,536,000 seconds, so that 170 MM corresponds to 10 hours, 57 minutes of revenue for Google.
Here's a handy visualization:
Still not as egregious as Facebook's toy fine for its part in the Cambridge Analytica data mining snafu:
Engineering, the key to success in sporting activities
Bowling 2.0 (some might call it cheating, I call it winning via superior technology) via Mark Rober:
I'd like a tool wall like his but it doesn't go with minimalism.
No numbers: recommendation success but product design fail.
Nerdy, pro-engineering products are a good choice for Amazon to recommend to me, but unfortunately many of them suffer from a visual form of "The Igon Value Problem."
Wednesday, October 2, 2019
When nutrition and fitness studies attempt science with naive statistics
A little statistics knowledge is a dangerous thing.
(Inspired by an argument on Twitter about a paper on intermittent fasting, which exposes the problem of blind trust in "studies" when such studies are done to lower statistical standards than market research since at least the 70s.)
Given an hypothesis, say "people using intermittent fasting lose weight faster than controls even when calories are equated," any market researcher worth their bonus and company Maserati would design a within-subjects experiment. (For what it's worth, here's a doctor suggesting within-subject experiments on muscle development.)
Alas, market researchers aren't doing fitness and nutrition studies, mostly because market researchers like money and marketing is where the market research money is (also, politics, which is basically marketing).
So, these fitness and nutrition studies tend to be between-subjects: take a bunch of people, assign them to control and treatment groups, track some variables, do some first-year undergraduate statistics, publish paper, get into fights on Twitter.
What's wrong with that?
People's responses to treatments aren't all the same, so the variance of those responses, alone, can make effects that exist at the individual level disappear when aggregated by naive statistics.
Huh?
If everyone loses weight faster on intermittent fasting, but some people just lose it a little bit faster and some people lose it a lot faster, that difference in response (to fasting) will end up making the statistics look like there's no effect. And what's worse, the bigger the differences between different people in the treatment group, the more likely the result is to be non-significant.
Warning: minor math ahead.
Let's say there are two conditions, control and treatment, $C$ and $T$. For simplicity there are two segments of the population: those who have a strong response $S$ and those who have a weak response $W$ to the treatment. Let the fraction of $W$ be represented by $w \in [0,1]$.
Our effect is measured by a random variable $x$, which is a function of the type and the condition. We start with the simplest case, no effect for anyone in the control condition:
$x_i(S,C) = x_i(W,C) = 0$.
By doing this our statistical test becomes a simple t-test of the treatment condition and we can safely ignore the control subsample.
For the treatment conditions, we'll consider that the $W$ part of the population has a baseline effect normalized to 1,
$x_i(W,T) = 1$.
Yes, no randomness. We're building the most favorable case to detect the effect and will show that population heterogeneity alone can hide that effect.
We'll consider that the $S$ part of the population has an effect size that is a multiple of the baseline, $M$,
$x_i(S,T) = M$.
Note that with any number of test subjects, if the populations were tested separately the effect would be significant, as there's no error. We could add some random factors, but that would only complicate the point, which is that even in the most favorable case (no error, both populations show a positive effect), the heterogeneity in the population hides the effect.
(If you slept through your probability course in college, skip to the picture.)
If our experiment has $N$ subjects in the treatment condition, the expected effect size is
$\bar x = w + (1-w) M$
with a standard error (the standard deviation of the sample mean) of
$\sigma_{\bar x} = (M-1) \,\sqrt{\frac{w(1-w)}{N}} $.
(Note that because we actually know the mean, this being a probabilistic model rather than a statistical estimation, we see $N$ where most people would expect $N-1$.)
So, the test statistic is
$t = \bar x/\sigma_{\bar x} = \frac{w + (1-w) M}{(M-1) \,\sqrt{\frac{w(1-w)}{N}}}$.
It may look complicated, but it's basically a three parameter analytical function, so we can easily see what happens to significance with different $w,M,N$, which is our objective.
Because we're using a probabilistic model where all quantities are known, the test statistic is distributed Normal(0,1), so the critical value for, say, 0.95 confidence, single-sided, is given by $\Phi^{-1}(0.95) = 1.645$.
To start simply, let's fix $N= 20$ (say a convenience sample of undergraduates, assuming a class size of 40 and half of them in the control group). Now we can plot $t$ as a function of $M$ and $w$:
(The seemingly-high magnitudes of $M$ and $w$ are an artifact of not having any randomness in the model. We wanted this to be simple, so that's the trade-off.)
Recall that in our model both sub-populations respond to the treatment and there's no randomness in that response. And yet, for a small enough fraction of the $S$ population and a large enough multiplier effect $M$, our super-simple, extremely-favorable model shows non-significant effects using a single-sided test (the most favorable test, and we're using the lowest acceptable significance for most journals, 95%, also most favorable choice).
Let's be clear what that "non-significant effects" means: it means that a naive statistician would look at the results and say that the treatment shows no difference from the control, in the words of our example, that people using intermittent fasting don't lose weight faster than the controls.
This, even though everyone in our model loses weight faster when intermittent fasting.
Worse, the results are less and less significant the stronger the effect on the $S$ population relative to the $W$ population. In other words, the faster the weight loss of the highly-responsive subpopulation relative to the less-responsive subpopulation, when both are losing weight with intermittent fasting, the more the naive statistics shows intermittent fasting to be ineffectual at producing weight loss.
Market researchers have known about this problem for a very long time. Nutrition and fitness practices (can't bring myself to call them sciences) are now repeating errors from the 50s-60s.
That's not groovy!
(Inspired by an argument on Twitter about a paper on intermittent fasting, which exposes the problem of blind trust in "studies" when such studies are done to lower statistical standards than market research since at least the 70s.)
Given an hypothesis, say "people using intermittent fasting lose weight faster than controls even when calories are equated," any market researcher worth their bonus and company Maserati would design a within-subjects experiment. (For what it's worth, here's a doctor suggesting within-subject experiments on muscle development.)
Alas, market researchers aren't doing fitness and nutrition studies, mostly because market researchers like money and marketing is where the market research money is (also, politics, which is basically marketing).
So, these fitness and nutrition studies tend to be between-subjects: take a bunch of people, assign them to control and treatment groups, track some variables, do some first-year undergraduate statistics, publish paper, get into fights on Twitter.
What's wrong with that?
People's responses to treatments aren't all the same, so the variance of those responses, alone, can make effects that exist at the individual level disappear when aggregated by naive statistics.
Huh?
If everyone loses weight faster on intermittent fasting, but some people just lose it a little bit faster and some people lose it a lot faster, that difference in response (to fasting) will end up making the statistics look like there's no effect. And what's worse, the bigger the differences between different people in the treatment group, the more likely the result is to be non-significant.
Warning: minor math ahead.
Let's say there are two conditions, control and treatment, $C$ and $T$. For simplicity there are two segments of the population: those who have a strong response $S$ and those who have a weak response $W$ to the treatment. Let the fraction of $W$ be represented by $w \in [0,1]$.
Our effect is measured by a random variable $x$, which is a function of the type and the condition. We start with the simplest case, no effect for anyone in the control condition:
$x_i(S,C) = x_i(W,C) = 0$.
By doing this our statistical test becomes a simple t-test of the treatment condition and we can safely ignore the control subsample.
For the treatment conditions, we'll consider that the $W$ part of the population has a baseline effect normalized to 1,
$x_i(W,T) = 1$.
Yes, no randomness. We're building the most favorable case to detect the effect and will show that population heterogeneity alone can hide that effect.
We'll consider that the $S$ part of the population has an effect size that is a multiple of the baseline, $M$,
$x_i(S,T) = M$.
Note that with any number of test subjects, if the populations were tested separately the effect would be significant, as there's no error. We could add some random factors, but that would only complicate the point, which is that even in the most favorable case (no error, both populations show a positive effect), the heterogeneity in the population hides the effect.
(If you slept through your probability course in college, skip to the picture.)
If our experiment has $N$ subjects in the treatment condition, the expected effect size is
$\bar x = w + (1-w) M$
with a standard error (the standard deviation of the sample mean) of
$\sigma_{\bar x} = (M-1) \,\sqrt{\frac{w(1-w)}{N}} $.
(Note that because we actually know the mean, this being a probabilistic model rather than a statistical estimation, we see $N$ where most people would expect $N-1$.)
So, the test statistic is
$t = \bar x/\sigma_{\bar x} = \frac{w + (1-w) M}{(M-1) \,\sqrt{\frac{w(1-w)}{N}}}$.
It may look complicated, but it's basically a three parameter analytical function, so we can easily see what happens to significance with different $w,M,N$, which is our objective.
Because we're using a probabilistic model where all quantities are known, the test statistic is distributed Normal(0,1), so the critical value for, say, 0.95 confidence, single-sided, is given by $\Phi^{-1}(0.95) = 1.645$.
To start simply, let's fix $N= 20$ (say a convenience sample of undergraduates, assuming a class size of 40 and half of them in the control group). Now we can plot $t$ as a function of $M$ and $w$:
(The seemingly-high magnitudes of $M$ and $w$ are an artifact of not having any randomness in the model. We wanted this to be simple, so that's the trade-off.)
Recall that in our model both sub-populations respond to the treatment and there's no randomness in that response. And yet, for a small enough fraction of the $S$ population and a large enough multiplier effect $M$, our super-simple, extremely-favorable model shows non-significant effects using a single-sided test (the most favorable test, and we're using the lowest acceptable significance for most journals, 95%, also most favorable choice).
Let's be clear what that "non-significant effects" means: it means that a naive statistician would look at the results and say that the treatment shows no difference from the control, in the words of our example, that people using intermittent fasting don't lose weight faster than the controls.
This, even though everyone in our model loses weight faster when intermittent fasting.
Worse, the results are less and less significant the stronger the effect on the $S$ population relative to the $W$ population. In other words, the faster the weight loss of the highly-responsive subpopulation relative to the less-responsive subpopulation, when both are losing weight with intermittent fasting, the more the naive statistics shows intermittent fasting to be ineffectual at producing weight loss.
Market researchers have known about this problem for a very long time. Nutrition and fitness practices (can't bring myself to call them sciences) are now repeating errors from the 50s-60s.
That's not groovy!
Labels:
fitness,
nutrition,
Probability,
statistics
Monday, September 23, 2019
Weight loss, moving averages, and cheating.
If you don't measure it, you can't manage it. — Old management consulting saying.
Like most people who do HIT, I track all my gym data; unlike most people who do HIT, I use fancy-pants (technical term) algorithms to analyse it. (Hey, if you gots thems, you uses thems.)
I also weigh myself almost every morning, and log the weight. But since I noticed clothes getting looser, I didn't actually analyse the weight data. There it was, in a nice .csv file, and nothing ever done to it. Not even a chart.
To be fair, when you lose 22 percent of your bodyweight in 7 months, you don't really need a chart… What?! Who are you and what have you done with Jose Silva?? Data without analysis is just wasted bits!
So, I decided to see if there was any insight there. And, lo and behold, there was.
The problem with bodyweight data is that it's too noisy. Small things, like whether you've been to the throne prior to weighing or had a few gallons of caffeine delivery liquids earlier can make your weight vary a lot. And bathroom scales, even digital ones, can be quite a bit off depending on how your weight is distributed over the platform.
One way to deal with any noisy sequence data is to take moving averages: by averaging a data point with its neighbors, idiosyncratic noise is cancelled out and only information stays. Of course that's only correct if the spectrum of the noise fits within the bandwidth of the averaging. So we simplify (in less engineering-centric terms: cheat) and filter the filtered data, or in our case, average the averages.
An average of this type in a time series is called a moving average; say we want to make a five-datapoint moving average, called MA(5). We start by averaging data points 1 through 5, that's our first result; then 2 through 6, for the second, etc. This creates a new time series, the first-order moving average MA(5). Then we take this new time series and use the same process again, which creates a second-order moving average.
(We could choose a different number of datapoints to average, but I'm staying with 5, a number I pulled out of my ars… a number that my old and trustworthy time series textbook uses as "a good starting point." Note also that these averages may be weighted averages to express a specific filtering function. We'll be using simple, or unweighted, averages.*)
For privacy, and because it's actually useful as we'll soon see, we'll plot the data normalized to the time period: in other words, the heaviest weight will map to 1 and the lightest weight to 0:
The dashed line is a linear model fit to the data in the chart (that is the second-order moving average) and it can be used as a reference for analysis. And that reference shows quite clearly that things went a bit wrong around July.
July, July… yep: due to social pressure and a few events where I had to be present, carbs were consumed, or as I like to call it, cheating on the eating rules** was potentiated by outside events. (Always blame the environment.)
Now, the use of normalized results shines when we use the quartiles: how long it took to lose the first 25 percent of the weight, the second, the third, etc:
And this shows what the cost of cheating was: lost time in the weight loss progression. Because of the cheating, there was a delay of about seven weeks to return to the trend.
As a result, I just threw out my last rice cakes. What am I doing eating carbs? If the general idea underlying fat loss is to have the body burn it for energy, there's no [nutrition] point in eating carbs, whose only nutrition function is as energy.
I didn't know about those lost seven weeks until I run this simple data analysis. Now I adhere to my eating rules more closely, seeing what a waste of time it was to cheat. Quantitatively, not qualitatively.
Data without analysis is just a waste of bits.
- - - - - - -
* Since we're using unweighted averages, we could have simply turned that second-order process into a first-order weighted process. (You can always do this, minus the "simply" in the first sentence.) It's easy to see that the second-order MA(5,5) above is equivalent to a first-order MA(9) with weights
\[
[0.04,0.08,0.12,0.16,0.2,0.16,0.12,0.08,0.04],
\] which is a triangular smoothing function. If we wanted to be fancy about it, we'd use Gaussian kernels with support over the full data set and variable bandwidth, and tweak said bandwidth just enough to get rid of unsightly noise in the data. Yes, by eye… or using information criteria, if we were really really overthinking a simple weight loss analysis.
** The eating rules, derived from P.D. Mangan and Ted Naiman, MD:
1. Eat only when hungry, not peckish or bored. This usually means around 18 hours of fasting daily, for me. Sometimes I eat only one meal in a day (OMAD), though usually I eat two, one of which is a protein shake or Greek yogurt mixed with protein powder.
2. When hungry eat protein (in the culinary sense: meat, fish, eggs, mostly; Greek yogurt with whey protein supplements on occasion, protein shakes when 'needs must'), minimize added fat, and avoid all carbs. Don't count calories or macros or any other pretend-science metric; those only lead you astray.
3. Cheat only at Michelin-starred restaurants and only on someone else's expense account. (This rule is my personal addition. These occasional gourmet cheats are enough to keep life interesting, gastronomically speaking. As for the someone else's expense account, I use mine for important things, like computers, software, and books, not hospitality.)
These worked for me, because they solve the only problem that really matters in fat loss: they're easy to adhere to; note how I'm never hungry for long, as when I'm hungry — and only then — I eat.
Like most people who do HIT, I track all my gym data; unlike most people who do HIT, I use fancy-pants (technical term) algorithms to analyse it. (Hey, if you gots thems, you uses thems.)
I also weigh myself almost every morning, and log the weight. But since I noticed clothes getting looser, I didn't actually analyse the weight data. There it was, in a nice .csv file, and nothing ever done to it. Not even a chart.
To be fair, when you lose 22 percent of your bodyweight in 7 months, you don't really need a chart… What?! Who are you and what have you done with Jose Silva?? Data without analysis is just wasted bits!
So, I decided to see if there was any insight there. And, lo and behold, there was.
The problem with bodyweight data is that it's too noisy. Small things, like whether you've been to the throne prior to weighing or had a few gallons of caffeine delivery liquids earlier can make your weight vary a lot. And bathroom scales, even digital ones, can be quite a bit off depending on how your weight is distributed over the platform.
One way to deal with any noisy sequence data is to take moving averages: by averaging a data point with its neighbors, idiosyncratic noise is cancelled out and only information stays. Of course that's only correct if the spectrum of the noise fits within the bandwidth of the averaging. So we simplify (in less engineering-centric terms: cheat) and filter the filtered data, or in our case, average the averages.
An average of this type in a time series is called a moving average; say we want to make a five-datapoint moving average, called MA(5). We start by averaging data points 1 through 5, that's our first result; then 2 through 6, for the second, etc. This creates a new time series, the first-order moving average MA(5). Then we take this new time series and use the same process again, which creates a second-order moving average.
(We could choose a different number of datapoints to average, but I'm staying with 5, a number I pulled out of my ars… a number that my old and trustworthy time series textbook uses as "a good starting point." Note also that these averages may be weighted averages to express a specific filtering function. We'll be using simple, or unweighted, averages.*)
For privacy, and because it's actually useful as we'll soon see, we'll plot the data normalized to the time period: in other words, the heaviest weight will map to 1 and the lightest weight to 0:
The dashed line is a linear model fit to the data in the chart (that is the second-order moving average) and it can be used as a reference for analysis. And that reference shows quite clearly that things went a bit wrong around July.
July, July… yep: due to social pressure and a few events where I had to be present, carbs were consumed, or as I like to call it, cheating on the eating rules** was potentiated by outside events. (Always blame the environment.)
Now, the use of normalized results shines when we use the quartiles: how long it took to lose the first 25 percent of the weight, the second, the third, etc:
And this shows what the cost of cheating was: lost time in the weight loss progression. Because of the cheating, there was a delay of about seven weeks to return to the trend.
As a result, I just threw out my last rice cakes. What am I doing eating carbs? If the general idea underlying fat loss is to have the body burn it for energy, there's no [nutrition] point in eating carbs, whose only nutrition function is as energy.
I didn't know about those lost seven weeks until I run this simple data analysis. Now I adhere to my eating rules more closely, seeing what a waste of time it was to cheat. Quantitatively, not qualitatively.
Data without analysis is just a waste of bits.
- - - - - - -
* Since we're using unweighted averages, we could have simply turned that second-order process into a first-order weighted process. (You can always do this, minus the "simply" in the first sentence.) It's easy to see that the second-order MA(5,5) above is equivalent to a first-order MA(9) with weights
\[
[0.04,0.08,0.12,0.16,0.2,0.16,0.12,0.08,0.04],
\] which is a triangular smoothing function. If we wanted to be fancy about it, we'd use Gaussian kernels with support over the full data set and variable bandwidth, and tweak said bandwidth just enough to get rid of unsightly noise in the data. Yes, by eye… or using information criteria, if we were really really overthinking a simple weight loss analysis.
** The eating rules, derived from P.D. Mangan and Ted Naiman, MD:
1. Eat only when hungry, not peckish or bored. This usually means around 18 hours of fasting daily, for me. Sometimes I eat only one meal in a day (OMAD), though usually I eat two, one of which is a protein shake or Greek yogurt mixed with protein powder.
2. When hungry eat protein (in the culinary sense: meat, fish, eggs, mostly; Greek yogurt with whey protein supplements on occasion, protein shakes when 'needs must'), minimize added fat, and avoid all carbs. Don't count calories or macros or any other pretend-science metric; those only lead you astray.
3. Cheat only at Michelin-starred restaurants and only on someone else's expense account. (This rule is my personal addition. These occasional gourmet cheats are enough to keep life interesting, gastronomically speaking. As for the someone else's expense account, I use mine for important things, like computers, software, and books, not hospitality.)
These worked for me, because they solve the only problem that really matters in fat loss: they're easy to adhere to; note how I'm never hungry for long, as when I'm hungry — and only then — I eat.
Friday, August 9, 2019
Big Data without Serious Theory has a Big Problem
There's a common procedure to look for relationships in data, one that leads to problems when applied to "big data."
Let's say we want to check whether two variables are related: if changes in the value of one can be used to predict changes in the value of the other. There's a procedure for that:
Take the two variables, compute a metric called a correlation, check whether that correlation is above a threshold from a table. If the value is above the threshold, then say they're related, publish a paper, and have its results mangled by your institution's PR department and misrepresented by mass media. This leads to random people on social media ascribing the most nefarious of motivations to you, your team, your institution, and the international Communist conspiracy to sap and impurify all of our precious bodily fluids.
The threshold used depends on a number of things, the most visible of which is called the significance level. It's common in many human-related applications (social sciences, medicine, market research) to choose a 95% significance level.
At the 95% level, if we have 2 variables with significant correlation, the probability that that correlation is spurious, in other words that it comes from uncorrelated variables, is 5%.
(More precisely, that 95% means that if two uncorrelated variables were subjected to the computation that yields the correlation, the probability that the result would be above the threshold is 5%. But that's close enough, in the case of simple correlation, to saying that the probability of the correlation being spurious is 5%.)
The problem is when we have more variables.
If we have 3 variables, there are 3 possible pairs, so the probability of a spurious correlation is $1-0.95^3 = 0.143$.
If we have 4 variables, there are 6 possible pairs, so the probability of a spurious correlation is $1-0.95^6 = 0.265$.
If we have 5 variables, there are 10 possible pairs, so the probability of a spurious correlation is $1-0.95^{10} = 0.401$.
Let's pause for a moment and note that we just computed a 40% probability of a spurious correlation between 5 independent (non-correlated) variables. Five variables isn't exactly the giant datasets that go by the moniker "big data."
What about better significance? 99%, 99.5%? A little better, for small numbers, but even at 99.5%, all it takes is a set with 15 variables and we're back to a 40% probability of a spurious correlation. And these are not Big Data numbers, not by a long shot.
But it's okay, one would think, for there's a procedure in the statistics toolbox that has been developed specifically for avoiding over-confidence. It's called validation with a hold-out sample.
(That about 0.00% of all social science and medicine published results (though not business or market research, huzzah!) use that procedure is a minor quibble that we shall ignore. It wouldn't make a difference in large sets, anyway.)
The idea is simple: we hold some of the data out (hence the "hold-out" sample, also known as the validation sample) and compute our correlations on the remaining data (called the calibration sample). Say we find that variables $A$ and $B$ are correlated in the calibration sample. Then we take the validation sample and determine whether $A$ and $B$ are correlated there as well.
If they are, and the correlation is similar to that of the calibration sample, we're a little more confident in the result; after all, if each of these correlations is 95% significant, then the probability of both together being spurious is $0.05^2 = 0.0025$.
(Note that that "similar" has entire books written about it, but for now let's just say it has to be in the same direction, so if $A$ and $B$ have to be positively or negatively correlated in both, not positively in one and negatively in the other.)
Alas, as the number of variables increases, so does the number of possible spurious correlations. In fact it grows so fast, even very strict significance levels can lead to having, for example, ten spurious correlations.
And when you test each of those spurious correlations with a hold-out set, the probability of at least one appearing significant is not negligible. For example (explore the table to get an idea of how bad things get):
These numbers should scare us, as many results being presented as the great advantages of big data for understanding the many dimensions of the human experience are drawn from sets with thousands of variables. And as said above, almost none is ever validated with a hold-out sample.
They serve as good foundations for stories, but human brains are great machines for weaving narratives around correlations, spurious or not.
- - - - -
As an addendum, here are the expected number of spurious correlations and the probability that at least one of those correlations passes a validation test for some numbers of variables, as a function of the significance level. Just to drive the point home.
The point that Big Data without Serious Theory has a Big Problem.
Let's say we want to check whether two variables are related: if changes in the value of one can be used to predict changes in the value of the other. There's a procedure for that:
Take the two variables, compute a metric called a correlation, check whether that correlation is above a threshold from a table. If the value is above the threshold, then say they're related, publish a paper, and have its results mangled by your institution's PR department and misrepresented by mass media. This leads to random people on social media ascribing the most nefarious of motivations to you, your team, your institution, and the international Communist conspiracy to sap and impurify all of our precious bodily fluids.
The threshold used depends on a number of things, the most visible of which is called the significance level. It's common in many human-related applications (social sciences, medicine, market research) to choose a 95% significance level.
At the 95% level, if we have 2 variables with significant correlation, the probability that that correlation is spurious, in other words that it comes from uncorrelated variables, is 5%.
(More precisely, that 95% means that if two uncorrelated variables were subjected to the computation that yields the correlation, the probability that the result would be above the threshold is 5%. But that's close enough, in the case of simple correlation, to saying that the probability of the correlation being spurious is 5%.)
The problem is when we have more variables.
If we have 3 variables, there are 3 possible pairs, so the probability of a spurious correlation is $1-0.95^3 = 0.143$.
If we have 4 variables, there are 6 possible pairs, so the probability of a spurious correlation is $1-0.95^6 = 0.265$.
If we have 5 variables, there are 10 possible pairs, so the probability of a spurious correlation is $1-0.95^{10} = 0.401$.
Let's pause for a moment and note that we just computed a 40% probability of a spurious correlation between 5 independent (non-correlated) variables. Five variables isn't exactly the giant datasets that go by the moniker "big data."
What about better significance? 99%, 99.5%? A little better, for small numbers, but even at 99.5%, all it takes is a set with 15 variables and we're back to a 40% probability of a spurious correlation. And these are not Big Data numbers, not by a long shot.
But it's okay, one would think, for there's a procedure in the statistics toolbox that has been developed specifically for avoiding over-confidence. It's called validation with a hold-out sample.
(That about 0.00% of all social science and medicine published results (though not business or market research, huzzah!) use that procedure is a minor quibble that we shall ignore. It wouldn't make a difference in large sets, anyway.)
The idea is simple: we hold some of the data out (hence the "hold-out" sample, also known as the validation sample) and compute our correlations on the remaining data (called the calibration sample). Say we find that variables $A$ and $B$ are correlated in the calibration sample. Then we take the validation sample and determine whether $A$ and $B$ are correlated there as well.
If they are, and the correlation is similar to that of the calibration sample, we're a little more confident in the result; after all, if each of these correlations is 95% significant, then the probability of both together being spurious is $0.05^2 = 0.0025$.
(Note that that "similar" has entire books written about it, but for now let's just say it has to be in the same direction, so if $A$ and $B$ have to be positively or negatively correlated in both, not positively in one and negatively in the other.)
Alas, as the number of variables increases, so does the number of possible spurious correlations. In fact it grows so fast, even very strict significance levels can lead to having, for example, ten spurious correlations.
And when you test each of those spurious correlations with a hold-out set, the probability of at least one appearing significant is not negligible. For example (explore the table to get an idea of how bad things get):
These numbers should scare us, as many results being presented as the great advantages of big data for understanding the many dimensions of the human experience are drawn from sets with thousands of variables. And as said above, almost none is ever validated with a hold-out sample.
They serve as good foundations for stories, but human brains are great machines for weaving narratives around correlations, spurious or not.
- - - - -
As an addendum, here are the expected number of spurious correlations and the probability that at least one of those correlations passes a validation test for some numbers of variables, as a function of the significance level. Just to drive the point home.
The point that Big Data without Serious Theory has a Big Problem.
Labels:
big data,
Probability,
statistics,
theory
Sunday, August 4, 2019
A = B, B = C, but A ≠ C. Depending on N, of course!
Statistics are weird. But it all makes sense.
Let's say we have 3 variables and 600 measurements, in other words 600 data points with three dimensions or a 600-row matrix with three columns; or this chart:
These are simulated data, drawn from three Normal distributions with variance 1 and means $\mu_A = 0, \mu_B = 0.25,$ and $\mu_C = 0.5$. The distributions are:
There's considerable overlap between these distributions, so any single point is insufficient to determine any relationships between $A$, $B$, and $C$.
Let's say we want to have "99.9 percent confidence" in our assertions. What does that "99.9 percent confidence" mean? The statistical meaning is that there's at most 0.1 percent chance that if the data were generated by "the null hypothesis" (which in our case is that any two distributions are the same) we'd see a test statistic above the critical value.
Test statistic?! Critical value?!
Yes. A test statistic is something we'll compute from the data (a 'statistic') in order to test it. In our case it'll be the difference of the empirical, or sample, means divided by the standard error of those empirical means.
If that number, a summary of the difference between the samples, scaled to deal with the randomness, is greater than some value --- the critical value --- we trust that it's too big to be the result of the random variation. At least we trust it to be too big 99.9 percent of the time.
Okay, that statistics refresher aside, what can we tell from a sample of 25 points? Here are the distributions of those means:
Note how the distributions of the means are much narrower than the distributions of the data. (The means don't change.) That's the effect of averaging over 25 data points. The variance of the mean is $1/25$ and the standard deviation of the mean, called the standard error to avoid confusion with the standard deviation of the data, is $1/\sqrt{25}$.
Someone with a vague memory of a long-forgotten statistics class may recall seeing a $\sigma/\sqrt{n-1}$ in this context and try to argue that 25 should be a 24. And they'd be right if we were estimating the standard error of the population data from the standard error of the sample data; but we're not. Our data is simulated, therefore we know the standard error and we're using that to simplify things like this. Another one of which is the next one: the critical value.
Knowing the distributions lets us bypass one of the most overused jokes in statistics, the Student T ("how do you make it? Boil water, steep the student in it for 3 minutes"; student tea, get it?). More seriously, when the standard error is estimated from the sample data, the critical value is derived from a Student's T distribution; in our case we'll pick one derived from the Normal distribution, which has the advantage of not depending on the size of the sample (or, as it's called in statistics, the number of degrees of freedom in the estimation).
Now for the critical value. We're going to choose a single-sided test, so when we say $A \neq B$, we're really testing for $A < B$.
So how do we test whether some estimate of $(\mu_B - \mu_A)/(\sigma/\sqrt{n})$ is statistically greater than zero? We test the difference in empirical means, $M_A - M_B$, instead of $\mu_B - \mu_A$; since $M_A$ and $M_B$ are averages of Normal variables their difference is a Normal variable; and dividing the difference by the standard error makes it a random variable that is normally distributed with mean zero and variance one, a standard Normal variable, usually denoted by $Z$, hence this sometimes is called a z-test.
Observation: we're dividing the difference of the means by $\sigma/\sqrt{n}$; with $\sigma=1$, we're multiplying that difference by $\sqrt{n}$.
All we need to do now is determine whether $(M_B - M_A)\, \sqrt{25}$ is above or below the point $z^{*}$ in a standard Normal distribution where $F_{Z}(z^{*}) = .999$. That point is the critical value.
(If that scaled difference falls into the blue shaded area, then we can't reject the possibility that it was generated by randomness, instead of actual difference, with the probability that we selected; in the diagram it's 0.99, for our purposes in this post will be 0.999.)
Thanks to the miracle of computers we no longer need to look up critical values in books of statistical tables, like the peasants of old. Using, for example, the inverse normal distribution function of Apple Numbers, we learn that $F_Z(z^*) = 0.999$ implies $z^{*} = 3.09$.
So, with a sample of 25 data points, what can we conclude?
Between $A$ and $B$ our test statistic is $0.25 \times 5 = 1.25$, well below 3.09. So, $A = B$.
Between $B$ and $C$ our test statistic is $0.25 \times 5 = 1.25$, well below 3.09. So, $B = C$.
Between $A$ and $C$ our test statistic is $0.5 \times 5 = 2.5$, still below 3.09. So, $A = C$.
That was a lot of work to prove what we could see from the picture with the distributions for the average of 25 points: too much overlap and no way to tell the three variables apart from a sample of 25 points.
Ah, but we're leveling up! 100 points. Here are the distributions for the averages of 100-point samples:
Between $A$ and $B$ our test statistic is $0.25 \times 10 = 2.5$, well below 3.09. So, $A = B$.
Between $B$ and $C$ our test statistic is $0.25 \times 10 = 2.5$, well below 3.09. So, $B = C$.
Between $A$ and $C$ our test statistic is $0.5 \times 10 = 5$, well above 3.09. So, $A \neq C$.
Now this is the weird case that gets people confused: $A = B$, $B = C$, but $A \neq C$! Equality is no longer transitive. And it does depend on $N$.
But wait, there's more. 300 more data points, and we get the 400 point case, with the following distributions:
Between $A$ and $B$ our test statistic is $0.25 \times 20 = 5$, well above 3.09. So, $A \neq B$.
Between $B$ and $C$ our test statistic is $0.25 \times 20 = 5$, well above 3.09. So, $B \neq C$.
Between $A$ and $C$ our test statistic is $0.5 \times 20 = 10$, well above 3.09. So, $A \neq C$.
Equality is again transitive. So there's only a small range of $N$ for which statistics are weird. (Not hard to figure out that range: consider it an entertaining puzzle.) This gets more complicated if those variables have different variances.
One has to be carefull with drawing inferences about equality (real equality) from statistical non-significant differences. Especially when there are small data sets and test values close to the critical values.
Labels:
Probability,
statistics
Wednesday, July 3, 2019
From value to share (in math)
There are a few places online (and in some books) where a kind of magic happens. At some point we go from something like this
\[
v(x) = w_1 x_1 + \ldots + w_K x_K
\]
to, in a single magic jump,
\[
\Pr(\text{Buy } x) = \frac{\exp(v(x))}{1 + \exp(v(x))}.
\]
The $v(x) = w_1 x_1 + \ldots + w_K x_K$ is easy enough to understand: the value of some object $x$ is a weighted sum of the dimensions of $x$; there are $K$ dimensions $x_1, \ldots, x_K$ and associated weights, $w_1, \ldots, w_K$.
But then there's something along the lines of "using appropriate assumptions, the probability of a customer purchasing $x$ is given by" that fraction.
This magic jump appears in so many places discussing analytics, data science, big data, machine learning, artificial intelligence, and other fashionable ways to refer to statistics (because that's what's there) that we could be forgiven for suspecting that the people making the jump don't know where it comes from.
It's fine not to know where something comes from, as long as one is aware of that. But it's instructive to work through the simple cases, like this one. After all, a personal trainer needn't be a super-athlete, but should at least be able to exercise.
So, how do we go from the weighted sum to those exponentials (called the logit formula)?
First, that formula is only correct if, among other things: (a) the choice is between buying $x$ or nothing at all; and (b) the $v(\cdot)$ for buying nothing is set to be zero, denote that as $v(0) =0$. Okay, the (a) makes this a simple choice and (b) is just a threshold; to move it around we can always have a constant $x_0$ added to the $v(x)$.
Now, why isn't the probability either one or zero, then? After all, either $v(x) > 0 = v(0)$ and the customer should always buy or $v(x) < 0 = v(0)$ and the customer should never buy.
That's the second step: there are things that can change the $v(x)$ that we don't observe. Maybe the customer's mood changes and the $w_i$ change with it, for example. We don't know, so we say that the customer optimizes "utility" $u(x)$ instead of value $v(x)$, and the difference is a stochastic disturbance (expensive wording for random stuff) $\epsilon$:
\[
u(x) = v(x) - \epsilon.
\]
(Let's assume that $u(0) = v(0) = 0$. It's not necessary but makes things simpler.) This second step changes the nature of the problem: instead of a decision (buy or not), we can only predict a probability:
\[
\Pr(\text{Buy } x) = \Pr(u(x)>0) = \Pr(v(x)> \epsilon).
\]
Now we can compute that probability: it's just the cumulative distribution for the $\epsilon$ variable evaluated at the point $v(x)$, denoted $F_{\epsilon}(v(x))$.
Unfortunately, by their own nature the $\epsilon$ aren't known, so we need to assume a probability distribution for $\epsilon$, one that's acceptable and is computationally convenient.
We could try the Normal distribution, which most people accept without questioning, so it gets us the first desirable characteristic. Unfortunately the $\exp(-z^2/2)$ makes integrating it difficult. What we need is a distribution that looks more or less like a Normal but can be integrated into a manageable form.
Enter the logistic distribution.
It looks enough like a Normal (it has fat tails but it's hard to tell by looking). And its probability density function
\[
f_{\epsilon}(z) = \frac{\exp(- z)}{(1+\exp(-z))^2}
\]
integrates easily to
\[
F_{\epsilon}(v) = \int_{-\infty}^{v} f_{\epsilon}(z) \, dz = \frac{1}{1 + \exp(-v)} = \frac{\exp(v)}{1+\exp(v)},
\]
which, when replaced in the formula for probability of buy gives the logit formula:
\[
\Pr(\text{Buy } x) = \Pr(v(x)> \epsilon) = F_{\epsilon}(v(x)) = \frac{\exp(v(x))}{1+\exp(v(x))}.
\]
When you know the math, you understand the magic.
Tuesday, June 18, 2019
Hidden factor correlation
Correlation is not causation; everyone learns to say that. But if there's a correlation, there's probably some sort of causal relationship hiding somewhere, unless it's a spurious correlation.
If two variables, $A$ and $B$ are correlated, the three simplest causal relationships are: $A$ causes $B$; $B$ causes $A$; or $A$ and $B$ are caused by an unseen factor $C$. There are many more complicated causation relationships, but these are the three basic ones.
The third case, where an unseen variable $C$ is the real source of the correlation, is what we're interested in this post. To illustrate the case let's say $C$ is a standard normal random variable, and $A$ and $B$ are noisy measures of $C$,
$ \qquad A = C + \epsilon_A$ and $ B = C + \epsilon_B$,
where the $\epsilon_i$ are drawn from a normal distribution with $\sigma_{\epsilon} = 0.05$.
To illustrate we generate 10,000 draws of $C$ and create the 10,000 $A$ and $B$ using R:
hidden_factor = rnorm(10000)
var_A_visible = hidden_factor + 0.05 * rnorm(10000)
var_B_visible = hidden_factor + 0.05 * rnorm(10000)
Now we can plot $A$ and $B$, and the correlation is obvious
model_no_control = lm(var_A_visible~var_B_visible)
summary(model_no_control)
With the result:
Call:
lm(formula = var_A_visible ~ var_B_visible)
Residuals:
Min 1Q Median 3Q Max
-0.271466 -0.047214 -0.000861 0.047400 0.302517
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0005294 0.0007025 0.754 0.451
var_B_visible 0.9975142 0.0006913 1442.852 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.07025 on 9998 degrees of freedom
Multiple R-squared: 0.9952, Adjusted R-squared: 0.9952
F-statistic: 2.082e+06 on 1 and 9998 DF, p-value: < 2.2e-16
So, both the model and the graph confirm a strong correlation ($p < 0.0001$) between $A$ and $B$. And in many real-life cases, this is used to support the idea that either $A$ causes $B$ or $B$ causes $A$.
Now we proceed to show how the hidden factor is relevant. First, let us plot the residuals, $A-C$ against $B-C$:
The apparent correlation has now disappeared. And a linear model including the hidden factor confirms this:
model_with_control = lm(var_A_visible~var_B_visible+hidden_factor)
summary(model_with_control)
With the result
Call:
lm(formula = var_A_visible ~ var_B_visible + hidden_factor)
Residuals:
Min 1Q Median 3Q Max
-0.18347 -0.03382 -0.00021 0.03410 0.17780
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0007378 0.0004986 1.480 0.139
var_B_visible 0.0004082 0.0100560 0.041 0.968
hidden_factor 0.9997573 0.0100707 99.274 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04985 on 9997 degrees of freedom
Multiple R-squared: 0.9976, Adjusted R-squared: 0.9976
F-statistic: 2.072e+06 on 2 and 9997 DF, p-value: < 2.2e-16
Hidden factors are easy to test for, as seen here, but they are not always apparent. For example, in nutrition papers there's often an hidden factor relating to how health-conscious an individual is that is more often than not causing both observables (say exercising regularly and eating salads; high correlation, but exercising doesn't cause eating salads and eating salads doesn't cause exercise).
Correlation is not causation, but generally one can find a causal relationship behind a correlation, possibly one that involves hidden factors or more complex relationships.
Labels:
analytics,
causality,
Probability,
RStats,
statistics
Subscribe to:
Posts (Atom)


















































