Understanding randomness and coincidence
If we flip a fair coin 1000 times, how likely is it that we see a sequence of 10 tails somewhere?
Give that problem a try, then read on.
We start by computing the probability that a sequence of ten flips is all tails:
$\Pr(\text{10 tails in a row}) = 2^{-10} = 1/1024$
There are 991 sequences of 10 flips in 1000 flips, so we might be tempted to say that the probability of at least one sequence of 10 tails is 991/1024.
This is obviously the wrong answer, because if the question had been for a sequence of 10 tails in 2000 flips the same logic would yield a probability of 1991/1024, which is greater than 1.
(People make this equivalence between event disjunction and probability addition all the time; it's wrong every single time they do it. The rationale, inasmuch as there's one, is that "and" translates to multiplication, so they expect "or" to translate to addition; it doesn't.)
The correct probability calculation starts by asking what is the probability that those 991 ten-flip sequences don't include one sequence with ten tails in a row, in other words,
$(1 - \Pr(\text{10 tails in a row}))^{991}$.
This is the probability of the negative of the event we want (at least one sequence of 10 tails in 1000 flips), which makes the probability of the event we want
$1 - (1 - \Pr(\text{10 tails in a row}))^{991} = 0.62$.
62% of the time one of these 1000-flip sequences will show, somewhere, a sequence of 10 tails.
Now consider Bob, a journalist, who sees those 10 tails and hurries home to write about the "one in a thousand" event that he just witnessed, demanding a coin fairness validation authority, and 4 trillion dollars in QE for the Fed, 1.5 trillion dollars for politically connected businesses, 1100 pages of legislation containing every legislator's wishlist, and
And this brings us to the problem of random clustering.
Random clusters
Say there's an event that happens with probability 14.5/100,000 per person-year (that's the flu death rate for California in 2018, according to the CDC).
What's the probability that on a given year we see a cluster of 30 or more events in a single week (50% above the average, it turns out later) in the San Francisco Bay Area (population 7,000,000 for this example)? How about 40 or more events (twice the average)?
Try it for yourself.
Done?
Sure you don't want to try to do it first?
Okay, here we go.
First we need to see what the distribution of the number of events per week is. Assuming that the 14.5/100,000 per person-year is distributed uniformly over the 52 weeks (which it isn't, but by choosing this we make the case for random clusters stronger, because the deaths happen mostly during flu season and the average during flu season is higher), the individual event probability is
$p = 14.5/(100,000 \times 52) = 0.0000027885$ events per person-week.
A population of 7,000,000 is much bigger than 30, so we use the Poisson distribution for the number of events/week. (The Poisson distribution is unbounded, but the error we make by assuming it is trivial, given that the cluster sizes we care about are so small compared to the size of the population.)
If the probability of an individual event per person-week is $p$, then the average number of events per week $\mu$ given a Poisson distribution (which is also the parameter of the Poisson distribution, helpfully), is $\mu = 7000000 \times p = 19.51923$
Our number of events per week, a random variable $P$, will be distributed Poisson with parameter $\mu$; the probability that we see a week with a number of events $N$ or higher is $(1-F_P(N))$, where $F_P(N)$ is the c.d.f. of $P$ evaluated at value $N$; hence the probability of a year with at least one week where the number of events (we'll call that cluster size) is $N$ or bigger is
$1- F_P(N)^{52}$ where $P \sim \mathrm{Poisson}(19.52)$.
We can now use the miracle of computers and spreadsheets (or the R programming language, which is my preference) to compute and plot the probability that there's a cluster size of at least $N$ events on any week in one year:
So, from randomness alone, we expect that with 40% probability there'll be one week with a cluster 50% larger than average in the Bay Area, and with 0.08% probability one that's twice as large.
That 0.08% looks very small… until we realize that there are thousands of townships in the US, and if each of them has a similar probability (different in detail because populations and death rates are different, but generally in the same order of magnitude, so we'll just use the same number for illustration), the aggregate number for, say, 1000 townships will be
$1- (1-0.0008)^{1000} = 0.55$.
Any cherry-picking person can, with some effort, find one such cluster at least in half the years. And that's the problem, because everything said here is driven by pure probability, but the interpretation of the clusters is always assumed to be caused by some other process.
"After all, could randomness really cause such outliers?"
Yes. Randomness was the only process assumed above, all numbers were driven by the same process, and those so-called outliers were nothing but the result of the same processes that create the averages.
Remember that the next time people point at scary numbers like 50% or 100% above average.
Mathematical note (illustration, not proof)
A small numerical illustration of the similarity between probability rates for incremental percentages, to show that the assumption that the various townships would have similar (not equal) probabilities for different $\mu$, as a function of the size the cluster relative to the average: