Si Tacuisses, Philosophus Mansisses: RStats

Showing posts with label RStats. Show all posts

Sunday, December 1, 2019

Fun with Numbers for December 1, 2019

007: GoldenEye gets an orbit right

I was reading the book 007: GoldenEye and noticed that Xenia Onatopp's description doesn't match Famke Janssen's looks; oh, and also this:

At first glance, the book appears to be playing fast and loose with orbits; after all, the ISS, which orbits around 400 km, is also on a roughly 90-minute orbit. So, let us check the numbers.

The first step is computing the acceleration of gravity $g_{100}$ at 100 km altitude. Using Newton's formula we can compute it from first principles (radius and mass of the Earth, gravitational constant... too many things to look up), or we can use the precomputed $g=$ 9.8 m/s$^2$ and solve for the altitude using a ratio of two Newton's formulas at different radii (using 6370 km as the radius of the Earth):

$ g_{100} = 9.8 \times \left(\frac{6370}{6470}\right)^2 = 9.5$ m/s$^2$

This acceleration has to match the centripetal acceleration of a circle with radius 6470 km, $a = v^2/r = g_{100}$, yielding a orbital speed of 7.84 km/s.

The circumference of a great circle at 100 km altitude is $2 \times \pi \times 6470$ km = 40,652 km, giving a total orbit time of 5180 s, or 1 hour, 26 minutes, and 19 seconds. So close enough to ninety minutes for a general.

So, yes, GoldenEye's orbit makes sense (-ish). Even though it's much lower than that of the ISS, which also has around 90 minute orbital period (92 minutes, and it's on a very mildly elliptical orbit).

On the other hand, a 100 km orbit would graze the atmosphere (it's inside the thermosphere layer, near the bottom) and therefore lose energy over time, so not a great orbit to place an orbital weapon masquerading as a piece of space debris, because you can't boost up "space debris."

Here are the circular orbital times for different altitudes; because of the approximation of $g=9.8$ m/s$^2$ and radius of the Earth as 6370 km, there are increasing errors with altitude, which are obvious for the GEO orbit (in yellow), still not bad since GEO shows that errors will be less than 2 minutes 38 seconds on all the other orbits:

There's no True(x) function for the internet (or anywhere else)

(Ignore the bad grammar, it was a long day.)

What happens if we feed the [putative social media lie-detector] function $\mathrm{TRUE}(x)$ the statement $x=$"the set of all sets that don't contain themselves contains itself"?

Let's take a short detour to the beginning of the last century...

Most sets one encounters in everyday math don't contain themselves: the set of real numbers $\mathbb{R}$ doesn't contain itself, neither does the set $\{$chocolate, Graham cracker, marshmallow$\}$, for example. So one could collect all these sets that don't contain themselves into a set $S$, the set of all sets that don't contain themselves. So far so good, until we ask whether $S$ contains itself.

Well, one would reason, let's say $S$ doesn't contain itself; then $S$ is a set that doesn't contain itself, which means it's one of the sets in $S$. Oops.

Maybe if we start from the other side: say $S$ contains itself; but in that case $S$ is a set that contains itself, and doesn't belong in $S$.

This is Russell's set paradox and it shows that there are propositions for which there is no possible truth value.

On the price of micro-SD cards

Browsing Amazon for Black Friday deals (I saved 100% on Black Friday with coupon code #DontBuyUnnecessaryStuff and you can too), I saw these micro-SD cards:

Instead of buying them, I decided to analyze their prices, first computing the average cost per GB (as seen above) and then realizing that there's a fixed component to the price apart from the cost per GB, which a simple linear model captures:

All the electricity California needs is about 6 kilos of antimatter

I was reading a report on how much it costs to decommission (properly) a wind farm and realized that if we just had some antimatter lying around (!), California energy needs would be met with small quantities.

Okay, antimatter is a bit dangerous, so how about we develop that cold fusion people keep talking about? Here:

(Divide that by an efficiency factor if you feel like it.)

Relativity misconceptions and the reason I restarted blogging

I was listening to a podcast with Hans G Schantz, author of the The Hidden Truth trilogy (so far… fans eagerly await the fourth installment; highly recommended) and he had to correct the podcast host on what I've noticed is a very common misconception: that "near" the speed of light relativistic effects are very large.

Which is true, for an appropriate understanding of "near."

Time dilation, space contraction, and mass increase are all regulated by a function $\gamma(v) = (1 -(v/c)^2)^{-1/2}$, a very non-linear function. For the type of effects that people typically think about, like tenfold increases, we're talking about speeds near $0.995 c$; for the type of effect that would be noticeable in small objects or short durations, one needs to go significantly above that:

Interestingly, the decision to restart blogging (first under the new name "Fun with numbers," then back to the admonition to keep one's thoughts to oneself by Boetius) was due to a number of calculations I had been tweeting regarding relativistic effects in the Torchship trilogy by Karl K Gallagher (highly recommended as well). Here are some examples, from Twitter:

And it's always heartwarming to see an author who keeps the science fiction human: that in a universe with mass-to-energy converters, wormhole travel, rampaging artificial intelligences, and AI-made trans-Oganesson-118 elements, there's a place for the problem-solving power of a wrench:

Computerphile has a simple data analysis course on YouTube using R

Link to the playlist here.
Download RStudio here.

Another promising lab rig that I hope will become a product at scale

The Phys.org article is here and the actual Science Advances paper is here.

Strictly speaking, what the paper describes is a successful laboratory test rig, but let's be generous and consider it a successful tech demo, also known in the low-tech world as a proof-of-concept. Note that though not all successful lab test rigs become successful tech demos, the ratio is much higher than the number of lab rigs (successful and otherwise) that become tech demos, so it's not that big a leap in the technology development process.

Friday, November 22, 2019

Fun with numbers for November 22, 2019

How lucky can asteroid miners be?

So, I was speed-rereading Orson Scott Card's First Formic War books (as one does; the actual books, not the comics, BTW), and took issue with the luck involved in noticing the first formic attack ship.

Call it the "how lucky can you get?" issue.

Basically, the miner ship El Cavador (literally "The Digger" in Castilian) on the Kuiper belt had to be incredibly lucky to see the formic ship, since it wasn't in the plane of the ecliptic, and therefore could be anywhere in the space between 30 AU (4,487,936,130 km) and 55 AU (8,227,882,905 km) distance from the Sun.

The volume of space between $r_1$ and $r_2 $ for $r_2 < r_1$ is $4/3\, \pi (r_1 - r_2)^3$, so the volume between 30 and 55 AU is 219,121,440,383,835,000,000,000,000,000 cubic kilometers.

Let's say the formic ship is as big the area of Manhattan with 1 km height, i.e. 60 km$^3$. What the hay, let's add a few other boroughs and make it 200 km$^3$. Then, it occupies a fraction $9 \times 10^{-28}$ of that space.

To put that fraction into perspective, the odds of winning each of the various lotteries in the US are around 1 in 300 million or so; the probability of the formic ship being in a specific point of the volume is slightly lower than the probability of winning three lotteries and throwing a pair of dice and getting two sixes, all together.

What if the ship was as big as the Earth, or it could be detected within a ball of the radius of the Earth? Earth volume is close to 1 trillion cubic kilometers, so the fraction is 1/219,121,440,383,835,000, or $4.56 \times 10^{-18}$; much more likely: about as likely as winning two lotteries and drawing the king of hearts from a deck of cards, simultaneously.

Let us be a little more generous with the discoverability of the formic ship. Let's say it's discoverable within a light-minute; that is, all El Cavador has to do observe a ball with 1 light-minute radius that happens to contain the formic ship. In this case, the odds are significantly better: 1 in 8,969,717. Note that one light-minute is 1/3 the distance between the Sun and Mercury, so this is a very large ball.

If we make an even more generous assumption of discoverability within one light-hour, the odds are 1 in 42. But this is a huge ball: if centered on the Sun it would go past the orbit of Jupiter, with a radius about 1 1/3 times the distance between the Sun and Jupiter. And that's still just under a 2.5% chance of detecting the ship.

Okay, it's a suspension of disbelief thing. With most space opera there's a lot of things that need to happen so that the story isn't "alien ship detected, alien weapon deployed, human population terminated, aliens occupy the planet, the end." So, the miners on El Cavador got lucky and, consequently, a series of novels exploring sociology more than science or engineering can be written.

Still, the formic wars are pretty good space opera, so one forgives these things.

Using Tribonacci numbers to measure Rstats performance on the iPad

Fibonacci numbers are defined by $F(1) = F(2)= 1$ and $F(n) = F(n-1) + F(n-2)$ for $n>2$. A variation, "Tribonacci" numbers ("tri" for three) uses $T(1) = T(2) = T(3) = 1$ and $T(n) = T(n-1) + T(n-2) + T(n-3)$ for $n>3$. These are easy enough to compute with a cycle, or for that matter, a spreadsheet:

(Yes, the sequence gets very close to an exponential. There's a literature on it and everything.)

Because of the triple recursion, these numbers are also a simple way to test the speed of a given platform. (The triple recursion forces a large number of function calls and if-then-else decisions, which strains the interpreter; obviously an optimizing compiler might transcode the recursion into a for-loop.)

For example, to test the R front end on the iPad nano-reviewed in a previous FwN, we can use this code:

Since it runs remotely on a server, it wasn't quite as fast as on my programming rig, but at least it wasn't too bad.

Note that there's a combinatorial explosion of function calls, for example, these are the function calls for $T(7)$:

There's probably a smart mathematical formula for the total number of function calls in the full recursive formulation; being an engineer, I decided to let the computer do the counting for me, with this modified code:

And the results of this code (prettified on a spreadsheet, but computed by RStudio):

For $T(30)= 20,603,361$ there are 30,905,041 function calls. This program is a good test of function call execution speed.

Charlie's Angels and Rotten Tomatoes

Since the model is parameterized, all I need to compute one of these is to enter the audience and critic numbers and percentages. Interesting how the critics and the audience are in agreement in the 2019 remake, though the movie hasn't fared too well in the theaters. (I'll watch it when it comes to Netflix, Amazon Prime, or Apple TV+, so I can't comment on the movie itself; I liked the 2000 and 2003 movies, as comedies that they were.)

Late entry: more fun with Tesla

15-40 miles of range, using TSLA's 300 Wh/mile is 4.5 kWh to 12 kWh. Say 12 hours of sunlight, so we're talking 375 to 1000 W of solar panels. For typical solar panels mounted at appropriate angles (150 W/m2), that's 2.5 to 6.7 square meters of solar panels…

Yeah, right!

No numbers: some Twitterage from last week

Smog over San Francisco, like it's 1970s Hell-A

Misrepresenting nuclear with scary images

Snarky, who, me?

Alien-human war space opera – a comprehensive theory

Saturday, November 9, 2019

Fun with numbers for November 9, 2019

Science illustrations made by people without quantitative sensibility

From a tweet I saw retweeted by someone I follow (lost the reference), this is supposed to be a depiction of the Chicxulub impact:

My first impression (soon confirmed by minor geometry) was that that impact was too big; yes, the meteor was big for a meteor (ask the dinosaurs…), but the Earth is really really big compared to meteors. Something that created such a large explosion on impact wouldn't just kill 75% of the species on Earth, it would probably kill everything on the planet down to the last replicating protein structure, boil the oceans, and poison the atmosphere for millions of years.

Think Vorlon planet-killer, not Centauri mass driver. 🤓

Using a graphical estimation method (fit a circle over that segment of the Earth to get the radius in pixels, so that we can translate pixels into kilometers), we can see that this is an overestimation of at least 6-fold in linear dimensions (the actual crater diameter is ~150km):

6-fold increase in linear dimensions implies 216-fold increase in volume (and therefore mass); using the estimated energy of the actual impact from the Wikipedia, the energy of the impact above would be between $2.81 \times 10^{26}$ and $1.25 \times 10^{28}$ J or up to around 22 billion times the explosive power of the largest H-bomb ever detonated, the Tsar Bomba.

The area of the Earth is 510.1 million square kilometers, so that's 43 Tsar Bombas per square kilometer --- which is a lot, considering that the one Tsar Bomba that was detonated had a complete destruction radius in excess of 60 km (or an area of 11,310 square kilometers) and partial destruction (of weaker structures) at distances beyond 100 km (or an area of 31,416 square kilometers). And, again, that's 43 of those per square kilometer; so, yeah, that would probably have been the end of all life as we know it on Earth, and I wouldn't be here blogging about it.

A more accurate measurement, using a bit of trigonometry (though still using Eye 1.0 for the tangents):

Because of the eye-based estimation, it's a good idea to do some sensitivity analysis:

(Results are slightly different for the measured case because of full-precision calculation as opposed to dropped digits in the original, hand-calculator and sticky notes-based calculation.)

It gets worse. In some depictions we see the meteor, and it's rendered at the size of a planetoid (using the graphical method here too, because it's quick and accurate enough):

To be clear on the scale, that image is 442 pixels wide, the actual Chicxulub meteor at the same scale as the Earth would be 1-7 pixels wide, which is smaller than the dots in the dotted lines.

For additional context, the diameter of the Moon is 3,474 km, so the meteor in the image above is almost 1/3 the diameter of the Moon (28% to be more accurate) and that impact crater is over 1/2 the diameter of the Moon (60% to be more accurate).

Solar energy density in context

2 square kilometers for 100 MW nameplate capacity… and they're in the shade in that photo, so not producing anything at the moment.

Capacity factor for solar is [for obvious reasons] hard bound at 50%. For California, our solar CF is 26%; let's give Peter Mayle's Provence slightly better CF at 30%, and those 2 square km of non-dispatchable capacity become about 1/20 of a single Siemens SGT-9000H (fits in 1200 square meters with a lot of space to spare for admin offices and break room, works 24/7).

Nano-review of R Programming Compiler for the iPad

Basics: Available on the iOS app store; uses a remote server to run the code, so must have a net connection. Free for the baseline but seven dollars for plots and to use packages, which I paid. The extended keyboard is very helpful considering the limitations of the iPad keyboard. (Also runs on the iPhone and the iPod touch, though I haven't used it on them yet.)

I wouldn't use it to develop code or even to run serious models, but if there's a need to do a quick simulation or analysis (or even as a matrix calculator), it's better than Numbers. Can also be used offline to browse (and edit) code, though not to run it.

The programmer-joke code snippet in the above screen capture run instantly over a free lobby internet in a hotel conference center, so the service is pretty efficient for these small tasks, which are the things I'd be using this for.

Some retailers plan to eat the losses from tariffs

From Bain and Company on Twitter:

My comment (on twitter): Yeah, these are well-behaved cost and demand functions so when a tariff is added to the cost, typically the quantity drops and the price rises, unless there's some specific strategic reason to incur short-term opportunity costs.

Rationale (from any Econ 101 course, but I felt like drawing my own, just for fun):

Note that Bain's breakpoint at 50% of the tariff is the solution to the problem under linear demand with constant marginal cost, but other shapes of demand can make that number much bigger, for example, this exponential leads to 74% (numbers rounded in the diagram but not in the computation):

The demand function is nothing awkward or surprising, just a nice decreasing exponential:

On the other hand, if the marginal cost decreases with quantity, particularly if marginal cost is strongly convex, there's a chance the actual price increase from a tariff is higher than the tariff, even with linear demand:

Note that this is different from lazy markup pricing. Lazy markup pricing always raises the price by more than the tariff, so in places where such outdated pricing practices [cough Portugal /cough] are common, tariffs have a disproportionate negative impact on the economy and general welfare.

Late non-numerical entry: Another news item based on not understanding the life cycle of technologies

From Bloomberg (among many others) we learn that there's a new solar energy accumulator technology, and as usual the news write it up as if product deployment at scale is right around the corner, whereas what we have here is a lab testing rig… that's a lot of steps before there's a product at scale. And many of those steps are covered with oubliettes.

Tuesday, June 18, 2019

Hidden factor correlation

Correlation is not causation; everyone learns to say that. But if there's a correlation, there's probably some sort of causal relationship hiding somewhere, unless it's a spurious correlation.

If two variables, $A$ and $B$ are correlated, the three simplest causal relationships are: $A$ causes $B$; $B$ causes $A$; or $A$ and $B$ are caused by an unseen factor $C$. There are many more complicated causation relationships, but these are the three basic ones.

The third case, where an unseen variable $C$ is the real source of the correlation, is what we're interested in this post. To illustrate the case let's say $C$ is a standard normal random variable, and $A$ and $B$ are noisy measures of $C$,

$ \qquad A = C + \epsilon_A$ and $ B = C + \epsilon_B$,

where the $\epsilon_i$ are drawn from a normal distribution with $\sigma_{\epsilon} = 0.05$.

To illustrate we generate 10,000 draws of $C$ and create the 10,000 $A$ and $B$ using R:

hidden_factor = rnorm(10000)
var_A_visible = hidden_factor + 0.05 * rnorm(10000)
var_B_visible = hidden_factor + 0.05 * rnorm(10000)

Now we can plot $A$ and $B$, and the correlation is obvious

And we can regress $A$ on $B$ to get the correlation and test statistics for the estimates using a linear model,

model_no_control = lm(var_A_visible~var_B_visible)
summary(model_no_control)

With the result:

Call:
lm(formula = var_A_visible ~ var_B_visible)

Residuals:
Min 1Q Median 3Q Max
-0.271466 -0.047214 -0.000861 0.047400 0.302517

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0005294 0.0007025 0.754 0.451
var_B_visible 0.9975142 0.0006913 1442.852 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.07025 on 9998 degrees of freedom
Multiple R-squared: 0.9952, Adjusted R-squared: 0.9952
F-statistic: 2.082e+06 on 1 and 9998 DF, p-value: < 2.2e-16

So, both the model and the graph confirm a strong correlation ($p < 0.0001$) between $A$ and $B$. And in many real-life cases, this is used to support the idea that either $A$ causes $B$ or $B$ causes $A$.

Now we proceed to show how the hidden factor is relevant. First, let us plot the residuals, $A-C$ against $B-C$:

The apparent correlation has now disappeared. And a linear model including the hidden factor confirms this:

model_with_control = lm(var_A_visible~var_B_visible+hidden_factor)
summary(model_with_control)

With the result

Call:
lm(formula = var_A_visible ~ var_B_visible + hidden_factor)

Residuals:
Min 1Q Median 3Q Max
-0.18347 -0.03382 -0.00021 0.03410 0.17780

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0007378 0.0004986 1.480 0.139
var_B_visible 0.0004082 0.0100560 0.041 0.968
hidden_factor 0.9997573 0.0100707 99.274 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.04985 on 9997 degrees of freedom
Multiple R-squared: 0.9976, Adjusted R-squared: 0.9976
F-statistic: 2.072e+06 on 2 and 9997 DF, p-value: < 2.2e-16

Hidden factors are easy to test for, as seen here, but they are not always apparent. For example, in nutrition papers there's often an hidden factor relating to how health-conscious an individual is that is more often than not causing both observables (say exercising regularly and eating salads; high correlation, but exercising doesn't cause eating salads and eating salads doesn't cause exercise).

Correlation is not causation, but generally one can find a causal relationship behind a correlation, possibly one that involves hidden factors or more complex relationships.