Wednesday, February 12, 2020

Contagion, coronavirus, and charlatans

This post is an illustration of a simple epidemiological model and why some of the ad-hoc modeling of coronavirus that some charlatans are spreading on social media platforms is a nonsensical distraction.


Math of contagion: the SIR-1 model


A simple model for infectious diseases, the SIR-1 model (also known as Kendrick-McCormack model), is too simple for the coronavirus, but contains some of the basic behavior of any epidemic.

The model uses a fixed population, with no deaths, no natural immunity, no latent period for the disease (when a person is exposed but not infectious; not to be mistaken for what happens with the coronavirus, where people are infectious but asymptomatic), and a simple topology (the population is in a single homogeneous pool, instead of different cities and countries sparsely connected).

There are three states that a given individual can be in: susceptible (fraction on the population in this state represented by $S$), infectious (fraction represented by $I$), and recovered (fraction represented by $R$); recovered means immune, so there isn't recurrence of an infection.

There are two parameters: $\beta$, the contagiousness of the disease, and $\gamma$, the recovery rate. To illustrate using discretized time, $\beta= 0.06$ means that any infectious individual has a 6% chance of infecting another individual in the next period (say, a day); $\gamma= 0.03$ means that any infectious individual has a 3% chance of recovering in the next period.

The dynamics of the model are described by three differential equations:

$\dot S = - \beta S I$;
$\dot I = (\beta S - \gamma) I$;
$\dot R = \gamma I$.

The ratio $R_0 = \beta/\gamma$ is critical to the behavior of an epidemic: if lower than one, the infection dies off without noticeable expansion, if much higher than one, it becomes a large epidemic.

There is no analytic solution to the differential equations, but they're easy enough to simulate and to fit data to. Here are some results for a discretized, 200-period simulation for some values of the parameters $(\beta, \gamma)$, starting with an initial infected population of 1%.

First, a model with an $R_0=2$, illustrating the three processes:


Note that although a large percentage of the population is eventually infected (if we continue to run the model, it will converge to 100%), the number of people infectious at a given time (and presumably also feeling the symptoms of the disease) is much lower, and this is a very important metric, as the number of people sick at a given time determines how effectively health providers can deal with the disease.

Next, a model of runaway epidemic (the $R_0 = 24$ is beyond any epidemic I've known; used here only to make the point in a short 200 periods):


In this case, the number of sick people grows very fast, which makes it difficult for the health system to cope with the disease, plus the absence of the sick people from the workforce leads to second-order problems, including stalled production, insufficient logistics to distribute needed supplies, and lack of services and support for necessary infrastructure.

Finally, a model closer to non-epidemic diseases, like the seasonal flu (as opposed to epidemic flu), though the $(\beta,\gamma)$ are too high for that disease; this was necessary for presentation purposes, in order to make the 200-period chart more than three flat lines.


Note how low the number of people infected at any time is, which is why these things tend to die off, instead of growing into epidemics, once people start taking precautions and that $\beta$ becomes smaller than $\gamma$ which leads to a $R_0 < 1$, a condition for the disease to die off eventually.


The problem with estimating ad-hoc models


One of the problems with ignoring the elements of these epidemiological models and calibrating statistical models on early data can be seen when we take the first example above ($\beta=0.06,\gamma=0.03$) and use the first 50 data points to calibrate a statistical model for forecasting the evolution of the epidemic:


As a general rule of thumb, models for processes that follow a S-shaped curve are extremely difficult to calibrate on early data; any data set that doesn't extend at least some periods into the concave region of the model is going to be of questionable value, especially if there are errors in measurement (as is always the case).

Consider that the failure of that estimation is for the simplest model (SIR-1), without the complexities of topology (multiple populations in different locations, each with a $(\beta,\gamma)$ of their own, connected by a network of transportation with different levels of quarantine and preventative measures, etc.), possible obfuscation of some data due to political concerns, misdiagnosis and under-reporting due to latency, changes to the $\beta$ and $\gamma$ as people's behavior adapts and health services adapt, and many other complications of a real-world epidemic including second-order effects on health services and essential infrastructure, which change people's behavior as well.

No, that forecasting error comes simply from that rule of thumb, that until the process passes the inflection point, it's almost certain that estimates based on aggregate numbers (as opposed to clinical measures of $\beta$ and $\gamma$, based on analysis of clinical cases; these are what epidemiologists use, by the way) will give nonsensical predictions.

But those nonsensical predictions get retweets, YouTube video views, and SuperChat money.