Showing posts with label Probability. Show all posts
Showing posts with label Probability. Show all posts

Sunday, April 12, 2020

Random clusters: how we make too much of coincidences

Understanding randomness and coincidence


If we flip a fair coin 1000 times, how likely is it that we see a sequence of 10 tails somewhere?

Give that problem a try, then read on.

We start by computing the probability that a sequence of ten flips is all tails:

$\Pr(\text{10 tails in a row}) = 2^{-10} = 1/1024$

There are 991 sequences of 10 flips in 1000 flips, so we might be tempted to say that the probability of at least one sequence of 10 tails is 991/1024.

This is obviously the wrong answer, because if the question had been for a sequence of 10 tails in 2000 flips the same logic would yield a probability of 1991/1024, which is greater than 1.

(People make this equivalence between event disjunction and probability addition all the time; it's wrong every single time they do it. The rationale, inasmuch as there's one, is that "and" translates to multiplication, so they expect "or" to translate to addition; it doesn't.)

The correct probability calculation starts by asking what is the probability that those 991 ten-flip sequences don't include one sequence with ten tails in a row, in other words,

$(1 - \Pr(\text{10 tails in a row}))^{991}$.

This is the probability of the negative of the event we want (at least one sequence of 10 tails in 1000 flips), which makes the probability of the event we want

$1 - (1 - \Pr(\text{10 tails in a row}))^{991} = 0.62$.

62% of the time one of these 1000-flip sequences will show, somewhere, a sequence of 10 tails.

Now consider Bob, a journalist, who sees those 10 tails and hurries home to write about the "one in a thousand" event that he just witnessed, demanding a coin fairness validation authority, and 4 trillion dollars in QE for the Fed, 1.5 trillion dollars for politically connected businesses, 1100 pages of legislation containing every legislator's wishlist, and an app on everyone's phone an embedded chip in everyone's wrist to track their vaccine statu… I mean to check for fairness in coin flips.

And this brings us to the problem of random clustering.


Random clusters


Say there's an event that happens with probability 14.5/100,000 per person-year (that's the flu death rate for California in 2018, according to the CDC).

What's the probability that on a given year we see a cluster of 30 or more events in a single week (50% above the average, it turns out later) in the San Francisco Bay Area (population 7,000,000 for this example)? How about 40 or more events (twice the average)?

Try it for yourself.

Done?

Sure you don't want to try to do it first?

Okay, here we go.

First we need to see what the distribution of the number of events per week is. Assuming that the 14.5/100,000 per person-year is distributed uniformly over the 52 weeks (which it isn't, but by choosing this we make the case for random clusters stronger, because the deaths happen mostly during flu season and the average during flu season is higher), the individual event probability is

$p = 14.5/(100,000 \times 52) = 0.0000027885$ events per person-week.

A population of 7,000,000 is much bigger than 30, so we use the Poisson distribution for the number of events/week. (The Poisson distribution is unbounded, but the error we make by assuming it is trivial, given that the cluster sizes we care about are so small compared to the size of the population.)

If the probability of an individual event per person-week is $p$, then the average number of events per week $\mu$ given a Poisson distribution (which is also the parameter of the Poisson distribution, helpfully), is $\mu = 7000000 \times p = 19.51923$

Our number of events per week, a random variable $P$, will be distributed Poisson with parameter $\mu$; the probability that we see a week with a number of events $N$ or higher is $(1-F_P(N))$, where $F_P(N)$ is the c.d.f. of $P$ evaluated at value $N$; hence the probability of a year with at least one week where the number of events (we'll call that cluster size) is $N$ or bigger is

$1- F_P(N)^{52}$ where $P \sim \mathrm{Poisson}(19.52)$.

We can now use the miracle of computers and spreadsheets (or the R programming language, which is my preference) to compute and plot the probability that there's a cluster size of at least $N$ events on any week in one year:


 So, from randomness alone, we expect that with 40% probability there'll be one week with a cluster 50% larger than average in the Bay Area, and with 0.08% probability one that's twice as large.

That 0.08% looks very small… until we realize that there are thousands of townships in the US, and if each of them has a similar probability (different in detail because populations and death rates are different, but generally in the same order of magnitude, so we'll just use the same number for illustration), the aggregate number for, say, 1000 townships will be

$1- (1-0.0008)^{1000} = 0.55$.

Any cherry-picking person can, with some effort, find one such cluster at least in half the years. And that's the problem, because everything said here is driven by pure probability, but the interpretation of the clusters is always assumed to be caused by some other process.

"After all, could randomness really cause such outliers?"

Yes. Randomness was the only process assumed above, all numbers were driven by the same process, and those so-called outliers were nothing but the result of the same processes that create the averages.

Remember that the next time people point at scary numbers like 50% or 100% above average.



Mathematical note (illustration, not proof)


A small numerical illustration of the similarity between probability rates for incremental percentages, to show that the assumption that the various townships would have similar (not equal) probabilities for different $\mu$, as a function of the size the cluster relative to the average:


Sunday, March 15, 2020

Fun with geekage while social distancing for March 15, 2020

(I'm trying to get a post out every week, as a challenge to produce something intellectual outside of work. Some* of this is recycled from Twitter, as I tend to send things there first.)


Multicriteria decision-making gets a boost from Covid-19



A potential upside (among many downsides) of the coronavirus covid-19 event is that some smart people will realize that there's more to life choices than a balance between efficiency and convenience and will build [for themselves if not the system] some resilience.

In a very real sense, it's possible that PG&E's big fire last year and follow-up blackouts saved a lot of people the worst of the new flu season: after last Fall, many local non-preppers stocked up on N95 masks and home essentials because of what chaos PG&E had wrought in Northern California.



Anecdotal evidence is a bad source for estimates: coin flips


Having some fun looking at small-numbers effects on estimates or how unreliable anecdotal evidence really can be as a source of estimates.

The following is a likelihood ratio of various candidate estimates versus the maximum likelihood estimate for the probability of heads given a number of throws and heads of a balanced coin; because there's an odd number of flips, even the most balanced outcome is not 50-50:


This is an extreme example of small numbers, but it captures the problem of using small samples, or in the limit, anecdotes, to try to estimate quantities. There's just not enough information in the data.

This is the numerical version of the old medicine research paper joke: "one-third of the sample showed marked improvement; one-third of the sample showed no change; and the third rat died."

Increasing sample size makes for better information, but can also exacerbate the effect of a few errors:


Note that the number of errors necessary to get the "wrong" estimate goes up: 1 (+1/2), 3, 6.



Context! Numbers need to be in context!



I'm looking at this pic and asking myself: what is the unconditional death rate for each of these categories; i.e. if you're 80 today in China, how likely is it you don't reach march 15, 2021, by all causes?

Because that'd be relevant context, I think.



Estimates vs decisions: why some smart people did the wrong thing regarding Covid-19



On a side note, while some people choose to lock themselves at home for social distancing, I prefer to find places outdoors where there's no one else. For example: a hike on the Eastern span of the Bay Bridge, where I was the only person on the 3.5 km length of the bridge (the only person on the pedestrian/bike path, that is).




How "Busted!" videos corrupt formerly-good YouTube channels


Recently saw a "Busted!" video from someone I used to respect and another based on it from someone I didn't; I feel stupider for having watched the videos, even though I did it to check on a theory. (Both channels complain about demonetization repeatedly.) The theory:


Many of these "Busted!" videos betray a lack of understanding (or fake a lack of understanding for video-making reasons) of how the new product/new technology development process goes; they look at lab rigs or technology demonstrations and point out shortcomings of these rigs as end products. For illustration, here's a common problem (the opposite problem) with media portrayal of these innovations:


It's not difficult to "Bust!" media nonsense, but what these "Busted!" videos do is ascribe the media nonsense to the product/technology designers or researchers, to generate views, comments, and Patreon donations. This is somewhere between ignorance/laziness and outright dishonesty.

In the name of "loving science," no less!



Johns Hopkins visualization makes pandemic look worse than it is



Not to go all Edward Tufte on Johns Hopkins, but the size of the bubbles on this site makes the epidemic look much worse than it is: Spain, France, and Germany are completely covered by bubbles, while their cases are
0.0167 % for Spain
0.0070 % for Germany
0.0067 % for France
of the population.



Cumulative numbers increase; journalists flabbergasted!



At some point someone should explain to journalists that cumulative deaths always go up, it's part of the definition of the word "cumulative." Then again, maybe it's too quantitative for some people who think all numbers ending in "illions" are the same scale.



Stanford Graduate School of Education ad perpetuates stereotypes about schools of education


If this is real, then someone at Stanford needs to put their ad agency "in review." (Ad world-speak for "fired with prejudice.")





Never give up; never surrender.


- - - - -
* All.

Sunday, March 1, 2020

Fun with COVID-19 Numbers for March 1, 2020

NOTA BENE: The Coronavirus COVID-2019 is a serious matter and we should be taking all reasonable precautions to minimize contagion and stay healthy. But there's a lot of bad quantitative thinking that's muddling the issue, so I'm collecting some of it here.


Death Rate I: We can't tell, there's no good data yet.


This was inspired by a tweet by Ted Naiman, MD, whose Protein-to-Energy ratio analysis of food I credit for at least half of my weight loss (the other half I credit P. D. Mangan, for the clearest argument for intermittent fasting, which convinced me); so this is not about Dr Naiman's tweet, just that his was the tweet I saw with a variation of this proposition:

"COVID-19 is 'like the flu,' except the death rate is 30 to 50 times higher."

But here's the problem with that proposition: we don't have reliable data to determine that. Here are two simple arguments that cast some doubt on the proposition:

⬆︎ How the death rate could be higher: government officials and health organizations under-report the number of deaths in order to contain panic or to minimize criticism of government and health organizations; also possible that some deaths from COVID-19 are attributed to conditions that were aggravated by COVID-19, for example being reported as deaths from pneumonia.

⬇︎ How the death rate could be lower: people with mild cases of COVID-19 don't report them and treat themselves with over-the-counter medication (to avoid getting taken into forced quarantine, for example), hence there's a bias in the cases known to the health organizations, towards more serious cases, which are more likely to die.

How much we believe the first argument applies depends on how much we trust the institutions of the countries reporting, and... you can draw your own conclusions!

To illustrate the second argument, consider the incentives of someone with flu-like symptoms and let's rate their seriousness or aversiveness, $a$, as a continuous variable ranging from zero (no symptoms) to infinity (death). We'll assume that the distribution of $a$ is an exponential, to capture thin tails, and to be simple let's make its parameter $\lambda =1$.

Each sick patient will have to decide whether to seek treatment other than over-the-counter medicine, but depending on the health system that might come with a cost (being quarantined at home, being quarantined in "sick wards," for example); let's call that cost, in the same scale of aversiveness, $c$.

What we care about is how the average aversiveness that is reported changes with $c$. Note that if everyone reported their $a$, that average would be $1/\lambda = 1$, but what we observe is a self-selected subset, so we need $E[a | a > c]$, which we can compute easily, given the exponential distribution, as

\[
E[a | a > c]
=
\frac{\int_{c}^{\infty} a \, f_A(a) da }{1 - F_A(c)}
=
\frac{\left[ - \exp(-a)(a+1)\right]^{\infty}_{c}}{\exp(-c)}
= c + 1
\]
Note that the probability of being reported is $\Pr(a>c) = \exp(-c)$, so as the cost of reporting goes up, a vanishingly small percentage of cases are reported, but their severity increases [linearly, but that's an artifact of the simple exponential] with the cost. That's the self-selection bias in the second argument above.

A plot for $c$ between zero (everyone reports their problems) and 5 (the cost of reporting is so high that only the sickest 0.67% risk reporting their symptoms to the authorities):


Remember that for all cases in this plot the average aversiveness/seriousness doesn't change: it's fixed at 1, and everyone has the disease, with around 63% of the population having less than the average aversiveness/seriousness. But, if the cost of reporting is, for example, equal to twice the aversiveness of the average (in other words, people dislike being put in involuntary quarantine twice as much as they dislike the symptoms of the average seriousness of the disease), only the sickest 13.5% of people will look for help from the authorities/health organizations, who will report a seriousness of 3 (three times the average seriousness of the disease in the general population).*

With mixed incentives for all parties involved, it's difficult to trust the current reported numbers.


Death Rate II: Using the data from the Diamond Princess cruise ship.


A second endemic problem is arguing about small differences in the death rate, based on small data sets. Many of these differences are indistinguishable statistically, and to be nice to all flavors of statistical testing we're going to compute likelihood ratios, not rely on simple point estimate tests.

The Diamond Princess cruise ship is as close as one gets to a laboratory experiment in COVID-19, but there's a small numbers problem. In other words we'll get good estimates when we have large scale, high-quality data. Thanks to @Clarksterh on Twitter for the idea.

Using data from Wikipedia for Feb 20, there were 634 confirmed infections (328 asymptomatic) aboard the Diamond Princess and as of Feb 28 there were 6 deaths among those infections. The death rate is 6/634 = 0.0095.

(The ship's population isn't representative of the general population, being older and richer, but that's not what's at stake here. This is about fixating on the point estimates and small differences thereof. There's also a delay between the diagnosis and the death, so these numbers might be off by a factor of two or three.)

What we're doing now: using $d$ as the death rate, $d = 0.0095$ is the maximum likelihood estimate, so it will give the highest probability for the data, $\Pr(\text{6 dead out of 634} | d = 0.0095)$. Below, we calculate and plot the likelihood ratio between that probability and the computed probability of the data for other candidate death rates, $d_i$.**

\[LR(d_i) = \frac{\Pr(\text{6 dead out of 634} | d = 0.0095)}{\Pr(\text{6 dead out of 634} | d = d_i)}\]


We can't reject any rates between 0.5% and 1.5% with any confidence (okay, some people using single-sided point tests with marginal significance might narrow that a bit, but let's not rehash old fights here), and that's a three-fold range. And there are still a lot of issues with the data.

On the other hand...

It's easy to see that the COVID-19 death rate is much higher than that of the seasonal flu (0.1%): using the data from the Diamond Princess, the $LR(0.001) =  3434.22$, which should satisfy both the most strong-headed frequentists and Bayesians that these two rates are different. Note that $LR(0.03) = 510.01$, which also shows that with the data above the Diamond Princess invalidates the 3% death rate. (Again, noting that the numbers might be off by a factor of two or three in either direction due to the delay in diagnosing the infection and between diagnosis and recovery or death.)

As with most of these analyses, disaggregate clinical data will be necessary to establish these rates, which we're estimating from much less reliable [aggregate] epidemiological data.



Stay safe: wash hands, don't touch your face, avoid unnecessary contact with other people. 



- - - - - 

* A friend pointed out that there are some countries or subcultures where hypochondria is endemic and that would lead to underestimation of the seriousness of the disease; this model ignores that, but anecdotally I've met people who get doctor's appointments because they have DOMS and want the doctor to reassure them that it's normal, prescribe painkillers and anti-inflammatories, and other borderline psychotic behavior...


** We're just computing the binomial here, no assumptions beyond that:

$\Pr(\text{6 dead out of 634} | d = d_i) = C(634,6) \, d_i^6 (1-d_i)^{628}$,

and since we use a ratio the big annoying combinatorials cancel out.

Saturday, February 8, 2020

Fun with numbers for February 8, 2020

Some collected twitterage and other nerditude from the interwebs.

Converting California to EVs: we're going to need a bigger boat grid


I like how silent electric vehicles are, but if California is to convert a significant number of FF cars to electric (50-80%), its grid will need to deliver 11-18% more energy (we already import around 1/3 of that energy and our grid is not exactly underutilized).




Playing around with diffusion models to avoid thinking about coronavirus


Playing around with some diffusion models of infection, not really sophisticated enough to deal with the topological complexities of coronavirus given air travel but better than people who believe you get that virus from drinking too much Corona beer… 🤯




Better choose winners of the past or the new thing?


Based on the following tweet by TJIC, author of Prometheus Award winning hard scifi books (first, second) about homesteading the Moon, with uplifted (genetically engineered, intelligent) Dogs and sentient AI,


I decided to create a simple model and just run with it. For laughs only.

We need to have some sort of metric of quality, $x$, and we'll assume that since people can stop reading a novel if it's too bad, $x \ge 0$. We also know Sturgeon's law, that 90% of everything is dross, so we'll need a distribution with a long left tail. For now we're okay with the exponential distribution $f_X(x) = \lambda \exp(-\lambda x)$, and we'll go with a $\lambda = 1$ to start.

Instead of changing the average quality of the novels for different years, we'll change the sample size from which the winners are chosen; what we're interested in is, therefore, $M(x)_N = E\left[\max\{x_1,\ldots,x_N\}\right]$ for different $N$, the number of novels. Assuming that $10 < N < 100000$, we can use a simple simulation to find those $M(x)_N$:


The results are

$M(x)_{10} = 2.899432$
$M(x)_{100} = 5.230011$
$M(x)_{1000} = 7.512119$
$M(x)_{10000} =  9.750539$
$M(x)_{100000} =   12.122326$

Let's say there are between 100 and 1000 scifi novels worthy of that name in any given year of the last 100 years. So, unless the new novels have on average between 5.2 and 7.5 times the average quality of those in the previous 100 years, one is better off picking a winner at random from those 100 years than a random new novel.

(Yes, there's a lot of nonsense in this model, but the idea is just to show that when there's a long left tail, which comes from Sturgeon's law --- and this one isn't even that steep --- randomly picking past winners is a better choice than randomly picking new novels even if the quality improved a bit relative to the past.)



No numbers, just Bay Area seamanship





Live long and prosper.

Monday, January 6, 2020

Fun with numbers and geekage for January 6, 2020

Money and Death on Vox, a bad infographic


I saw this on Twitter, apparently it's an infographic (or, in the parlance of those who want information graphical design to be, well, informative, a "chartoon") from a Vox article:


To begin with, these bubble diagrams, when correctly dimensioned (when they represent the data in an accurate graphical form), make comparisons difficult. Can you tell from that chart which cancer, breast or prostate, is more over-funded?

To add to that, this infographic isn't correctly dimensioned; it uses geometry to tell a lie (probably unwittingly), and that lie can be quantified with a lie factor:


The lie factor is the ratio of the perceived relative size of the geometric objects (for circles: areas) to the relative magnitude of the numbers (the money and deaths): you could fit eighteen of the COPD deaths circles inside the heart disease deaths circle, though the number of heart disease deaths are just a bit over four times those of COPD.

I would have thought that decades after Edward Tufte made this point in The Visual Representation of Quantitative Information, we'd no longer see this problem, but I was mistaken.



The infographic is used to make the point that donations are not correlated with deadliness, by showing what's effectively only a comparison of two rank orders. A better way to compare these two numbers would be to compute how much money is donated for each death or how many people die for each donated dollar, or both:


Note how easy the comparisons become and how two clear clusters appear in this format. That's the purpose of information graphical design, to make the insights in the data visible, not to decorate articles as a dash of color.



An anniversary of sorts: my Rotten Tomatoes analysis model is one year old.



On Dec 31, 2018, I watched a Nerdrotics video where Gary made the qualitative case for critics and audiences on Rotten Tomatoes using opposite criteria to evaluate certain TV shows. Out of curiosity, I decided to check that with data. That led to a few entertaining hours doing all sorts of complicated things until I settled on a very simple model, which I quickly coded into a spreadsheet, for extra convenience, and a number of fun tweets ensued, like the latest one:


The model:

Step 1: Treat all ratings as discretized into $\{0,1\}$. Denote the number of critics and audience members respectively by $N_C$ and $N_A$ and their number of likes (1s) by $L_C$ and $L_A$.

Step 2: Operationalize the hypotheses as probabilities. Under 'same criteria,' the probability of critics and audience liking is denoted $\theta_0$; under 'opposite criteria,' probability of critics liking is denoted $\theta_1$, and since the audience has opposite criteria, their probability of liking is $1-\theta_1$.

Step 3: Using the data and the operationalization, get estimates for $\theta_0$ and $\theta_1$. Notation-wise we should call them $\hat \theta_0$ and $\hat \theta_1$ but we're going to keep calling them $\theta_0$ and $\theta_1$.

Step 4: Compute the likelihood ratio of the hypotheses (how much more probable 'opposite' is than 'same'), by computing

$LR = \frac{\theta_1^{L_C} \, (1-\theta_1)^{N_C - L_C}} {\theta_0^{L_C} \, (1-\theta_0)^{N_C - L_C}} \, \frac{(1- \theta_1)^{L_A} \, \theta_1^{N_A - L_A}}{\theta_0^{L_A} \, (1-\theta_0)^{N_A - L_A}} $

(For numerical reasons this is done in log-space.) The reason I use likelihood ratios is to get rid of the large combinatorics (note their absence from that formula), which in many cases are beyond the numerical reach of software without installing special packages:




Going to the Moon... Done, moving on.


☹️ Let's just let the numbers speak for themselves:




Sainsbury's bans veggie bags


In the UK, which is in England, they keep banning things:


To be fair to Sainsbury's, they probably see this as a monetization opportunity under the cover of social responsibility (objections will be socially costly for those objecting), so probably not a bad business decision, irritating though it might be.

(I use a backpack as a shopping bag, and have been doing so for a long time, before there was any talk of bans or charging for bags. Because it's more practical to carry stuff on your back than in your hands. But I agree with Sam Bowman, this is starting to be too much anti-consumer.)


Gas for a 5 mile drive in a 25 MPG car yields about 1.8 kg of CO2. A 4 g polyethylene bag has a 24 g CO2 footprint. So, someone who walks to a local store [me] could use 74 plastic bags and still have lower footprint than someone who drives to a strip mall supermarket.



Engineer watches Rogue One, critique ensues



Typically, switches with overarching functions (say, "master switches") will have some sort of mechanical barrier to accidental movement, for example you have to lift them or press a button to unlock them before moving; sometimes they have locking affordances so that only authorized people (with the key or the code) can move them. There were none of these basic precautions here.

Apparently this switch controlling the entire facility's communications was located on the side of the taxiway for one of the landing pads, for... reasons? (Well, there's a reason: to get the drama of the pilot linking the cable and then the sacrifice of the two other fighters.)

And as for the final fight on top of the tower…


Consider that even if there was some reason the antenna was in some way dependent on actuators located on these pontoons, the controls for those actuators need not be near the actuators. It would make more sense for them to be near the central column anyway, just like the controls for a ship's engine are in the engine control room and act electrically on the actuators in the engine room (where there are backup electric controls and also mechanical access to the actuators themselves).



Big box gyms playing their usual pricing games of this season



(It's not hard to identify 24HourFitne…, ahem, the Big Box franchise from the name of the plans, but this is not a franchise-specific problem, it's a "all big box gyms and many smaller gyms that copy their policies" problem.)

And of course gyms want resolutioners to sign up for a year, as they know most of them will drop out soon:




Book buying, a personal history



So many books, so little time. But at least the wait is much shorter now.



Linkage


Unlike all the CYA statements people add to their various social media accounts to emphasize that which should be obvious — that retweeting and commenting is not an endorsement, much less a blanket endorsement of the entire sub-topology of what is being retweeted or commented on — these links are my endorsement of the content linked:

Plants can improve your work life — Phys.org

This may be a transcendent year for SpaceXArs Technica.

The World's Largest Science ExperimentPhysics Girl on YouTube (video)

Metal Mayhem - with Andrew Szydlo Royal Institution on YouTube (video)

The Hacksmith is taking a social media break. (Instagram.)

And showing that sports are much better when you replace them with engineering, here's Destin 'Smarter Every Day' Sandlin:






Live long and prosper.

Tuesday, December 10, 2019

Analysis paralysis vs precipitate decisions

Making good decisions includes deciding when you should make the decision.

There was a discussion on twitter where Tanner Guzy, (whose writings/tweets about clothing provide a counterpoint to the stuffier subforums of The Style Forum and the traditionalist The London Lounge), expressed a common opinion that is, alas, too reductive:


The truth is out there... ahem, is more complicated than that:


Making a decision without enough information is precipitate and usually leads to wrong decisions, in that even if the outcome turns out well it's because of luck; relying on luck is not a good foundation for decision-making. The thing to do is continue to collect information until the risk of making a decision is within acceptable parameters.

(If a decision has to be made by a certain deadline, then the risk parameters should work as a guide to whether it's better to pass on the opportunities afforded by the decision or to risk making the decision based on whatever information is available at that time.)

Once enough information has been obtained to make the decision risk acceptable, the decision-maker should commit to the appropriate course of action. If the decision-maker keeps postponing the decision and waiting for more information, that's what is correctly called "analysis paralysis."

Let us clarify some of these ideas with numerical examples, using a single yes/no decision for simplicity. Say our question is whether to short the stock of a company that's developing aquaculture farms in the Rub' al Khali.

Our quantity of interest is the probability that the right choice is "yes," call it $p(I_t)$ where the $I_t$ is the set of information available at time $t$. At time zero we'll have $p(I_0) = 0.5$ to represent a no-information state.

Because we can hedge the decision somewhat, there's a defined range of probabilities for which the risk is unacceptable (say from 0.125 to 0.875 for our example), but outside of that range the decision can be taken: if the probability is consistently above 0.875 it's safe to choose yes, if it's below 0.125 it's safe to choose no.

Let's say we have some noisy data; there's one bit of information out there $T$ (for true), which is either zero or one (zero means the decision should be no, one that it should be yes), but each data event is a noisy representation of $T$, call it $E_i$, where $i$ is the number of data event, defined as

$E_i = T $ with probability $1 - \epsilon$  and

$E_i = 1-T $ with probability $\epsilon$,

where $\epsilon$ is the probability of an error. These data events could be financial analysts reports, feasibility analyses of aquaculture farms in desert climates, political stability in the area that might affect industrial policies, etc. As far as we're concerned, they're either favorable (if 1) or unfavorable (if 0) to our stock short.

Let's set $T=1$ for illustration, in other words, "yes" is the right choice (as seen by some hypothetical being with full information, not the decision-maker). In the words of the example decision, $T=1$ means it's a good idea to short the stock of companies that purport to build aquaculture farms in the desert (the "yes" decision).

The decision-maker doesn't know that $T=1$, and uses as a starting point the no-knowledge position, $p(I_0) = 0.5$.

The decision-maker collects information until such a time as the posterior probability is clearly outside the "zone of unacceptable risk," here the middle 75% of the probability range. Probabilities are updated using Bayes's rule assuming that the decision-maker knows the $\epsilon$, in other words the reliability of the data sources:

$p(I_{k+1} | E_{k+1} = 1) = \frac{ (1- \epsilon) \times p(I_k)}{(1- \epsilon) \times p(I_k) + \epsilon \times (1- p(I_k))}$  and

$p(I_{k+1} | E_{k+1} = 0) = \frac{ \epsilon \times p(I_k)}{  \epsilon \times p(I_k) + (1- \epsilon) \times (1- p(I_k)) }$.

For our first example, let's have $\epsilon=0.3$, a middle-of-the-road case. Here's an example (the 21 data events are in blue, but we can only see the ones because the zeros have zero height):


We get twenty-one reports and analyses; some (1, 4, 6, 8, 9, 13, 14, and 21) are negative (they say we shouldn't short the stock), while the others are positive; this data is used to update the probability, in red, and that probability is used to drive the decision. (Note that event 21 would be irrelevant as the decision would have been taken before that.)

In this case, making a decision before the 17th data event would be precipitate and for better resilience one should wait at least two more without entering the zone of unacceptable risk before committing to a yes, so making the decision only after event 19 isn't a case of analysis paralysis.

Another example, still with $\epsilon=0.3$:


In this case, committing to yes after event 13 would be precipitate, whereas after event 17 would be an appropriate time.

If we now consider cases with lower noise, $\epsilon=0.25$, we can see that decisions converge to the "yes" answer faster and also why one should not commit as soon as the first data event brings the posterior probability outside of the zone of unacceptable risk:



If we now consider cases with higher noise, $\epsilon=0.4$, we can see that it takes longer for the information to converge (longer than the 21 events depicted) and therefore a responsible decision-maker would wait to commit to the decision:



In the last example, the decision-maker might take a gamble after data event 18, but to be sure the commit should only happen after a couple of events in which the posterior probability was outside the zone of unacceptable risk..

Deciding when to commit to a decision is as important as the decision itself; precipitate decisions come from committing too soon, analysis paralysis from a failure to commit when appropriate.

Friday, November 22, 2019

Fun with numbers for November 22, 2019

How lucky can asteroid miners be?



So, I was speed-rereading Orson Scott Card's First Formic War books (as one does; the actual books, not the comics, BTW), and took issue with the luck involved in noticing the first formic attack ship.

Call it the "how lucky can you get?" issue.

Basically, the miner ship El Cavador (literally "The Digger" in Castilian) on the Kuiper belt had to be incredibly lucky to see the formic ship, since it wasn't in the plane of the ecliptic, and therefore could be anywhere in the space between 30 AU (4,487,936,130 km) and 55 AU (8,227,882,905 km) distance from the Sun.

The volume of space between $r_1$ and $r_2 $ for $r_2 < r_1$ is $4/3\, \pi (r_1 - r_2)^3$, so the volume between 30 and 55 AU is 219,121,440,383,835,000,000,000,000,000 cubic kilometers.

Let's say the formic ship is as big the area of Manhattan with 1 km height, i.e. 60 km$^3$. What the hay, let's add a few other boroughs and make it 200 km$^3$. Then, it occupies a fraction $9 \times 10^{-28}$ of that space.

To put that fraction into perspective, the odds of winning each of the various lotteries in the US are around 1 in 300 million or so; the probability of the formic ship being in a specific point of the volume is slightly lower than the probability of winning three lotteries and throwing a pair of dice and getting two sixes, all together.

What if the ship was as big as the Earth, or it could be detected within a ball of the radius of the Earth? Earth volume is close to 1 trillion cubic kilometers, so the fraction is 1/219,121,440,383,835,000, or $4.56 \times 10^{-18}$; much more likely: about as likely as winning two lotteries and drawing the king of hearts from a deck of cards, simultaneously.

Let us be a little more generous with the discoverability of the formic ship. Let's say it's discoverable within a light-minute; that is, all El Cavador has to do observe a ball with 1 light-minute radius that happens to contain the formic ship. In this case, the odds are significantly better: 1 in 8,969,717. Note that one light-minute is 1/3 the distance between the Sun and Mercury, so this is a very large ball.

If we make an even more generous assumption of discoverability within one light-hour, the odds are 1 in 42. But this is a huge ball: if centered on the Sun it would go past the orbit of Jupiter, with a radius about 1 1/3 times the distance between the Sun and Jupiter. And that's still just under a 2.5% chance of detecting the ship.

Okay, it's a suspension of disbelief thing. With most space opera there's a lot of things that need to happen so that the story isn't "alien ship detected, alien weapon deployed, human population terminated, aliens occupy the planet, the end." So, the miners on El Cavador got lucky and, consequently, a series of novels exploring sociology more than science or engineering can be written.

Still, the formic wars are pretty good space opera, so one forgives these things.



Using Tribonacci numbers to measure Rstats performance on the iPad


Fibonacci numbers are defined by $F(1) = F(2)= 1$ and $F(n) = F(n-1) + F(n-2)$ for $n>2$. A variation, "Tribonacci" numbers ("tri" for three) uses $T(1) = T(2) = T(3) = 1$ and $T(n) = T(n-1) + T(n-2) + T(n-3)$ for $n>3$. These are easy enough to compute with a cycle, or for that matter, a spreadsheet:


(Yes, the sequence gets very close to an exponential. There's a literature on it and everything.)

Because of the triple recursion, these numbers are also a simple way to test the speed of a given platform. (The triple recursion forces a large number of function calls and if-then-else decisions, which strains the interpreter; obviously an optimizing compiler might transcode the recursion into a for-loop.)

For example, to test the R front end on the iPad nano-reviewed in a previous FwN, we can use this code:


Since it runs remotely on a server, it wasn't quite as fast as on my programming rig, but at least it wasn't too bad.

Note that there's a combinatorial explosion of function calls, for example, these are the function calls for $T(7)$:


There's probably a smart mathematical formula for the total number of function calls in the full recursive formulation; being an engineer, I decided to let the computer do the counting for me, with this modified code:


And the results of this code (prettified on a spreadsheet, but computed by RStudio):


For $T(30)= 20,603,361$ there are 30,905,041 function calls. This program is a good test of function call execution speed.


Charlie's Angels and Rotten Tomatoes



Since the model is parameterized, all I need to compute one of these is to enter the audience and critic numbers and percentages. Interesting how the critics and the audience are in agreement in the 2019 remake, though the movie hasn't fared too well in the theaters. (I'll watch it when it comes to Netflix, Amazon Prime, or Apple TV+, so I can't comment on the movie itself; I liked the 2000 and 2003 movies, as comedies that they were.)



Late entry: more fun with Tesla



15-40 miles of range, using TSLA's 300 Wh/mile is 4.5 kWh to 12 kWh. Say 12 hours of sunlight, so we're talking 375 to 1000 W of solar panels. For typical solar panels mounted at appropriate angles (150 W/m2), that's 2.5 to 6.7 square meters of solar panels…

Yeah, right!



No numbers: some Twitterage from last week


Smog over San Francisco, like it's 1970s Hell-A


Misrepresenting nuclear with scary images


Snarky, who, me?



Alien-human war space opera – a comprehensive theory