Saturday, October 24, 2009

Online books on work-related subjects

A quick round-up of some recent good books, generously posted to the internets by their authors:

Noam Nisan, Tim Roughgarden, Eva Tardos, and Vijay V. Vazirani's Algorithmic Game Theory is a good introduction to algorithmic game theory, which explores the impact of computational cost on game-theoretic results. (At least that's how my game-theory-biased brain perceives AGT.)

Trevor Hastie, Robert Tibshirani, and Jerome Friedman's Elements of Statistical Learning, Second Edition. I haven't read this edition yet, but I have the first edition (on dead tree) and it is a very good book. This second edition will probably be even better.

David Easley and Jon Kleinberg's Networks, Crowds, and Markets summarize some important results of economics, graph theory, and computer science as they relate to, unsurprisingly, networks, crowds, and markets.

Christopher D. Manning, Prabhakar Raghavan and Hinrich Sch├╝tze's Introduction to Information Retrieval is a book on information retrieval. I've only read it lightly, but it appears to both illustrate the issues of large scale data retrieval and the current best practices clearly and with good detail.

Yes, these are work-related books, which sort of goes against my desire to keep work out of this personal blog. But sometimes one does these things even as one knows they go against one's blog's mission.

Monday, October 12, 2009

You're using that number wrongly!

Sometimes, when reading a statistics paper on economic or business matters, I feel a desire, nay, a moral obligation, to track down the authors and beat some sense into them with an econometrics tome. The latest edition of Greene's introductory graduate textbook, weighing in at 4lbs, would be a good choice. (I don't act on these feelings.)

That's probably similar to what statisticians feel when they see something like this stylized version of a discussion I read online. I stylized it so that it's short and doesn't identify the culprits or the forum.
Poster X: I'm never having children because one in four paternity tests show that the husband is not the father.

[Arguments over the "one in four" statistic. Number is acceptably sourced.]

Poster X: Since paternity fraud is one in four...
I couldn't read any more. Holy innumeracy!

Paternity fraud is not one in four, even if the number of failed paternity tests is one in four (by "failed" paternity test I mean that the presumed male parent is not the father).

Thinking like a statistician, I would question the extrapolation of the sample to the population: only a small set of parents test the paternity of their children; is the selection random?

Of course not. For a test to be done there's probably some doubt there. Maybe the child looks a little to much like mom's fitness trainer, for example. You know, the one with the muscles and the hair and the free time.

Econometricians are a subset of statisticians: they too build models over a foundation of probability theory. The subset-ing is on how they build models: typically beginning with a behavioral description of the process underlying the data, said description being based on economic theory.

An econometrician would suggest that the doubt has to be substantial. The costs are high: paying for the test, recriminations (on his part if it fails, on her part if it doesn't), and the general lack of trust that testing indicates. These costs are sure if the test is taken.

What are the benefits? If the test passes, there's guaranteed genetic continuity; but what happens if the test fails? Divorce? Strained relationship? These benefits are probabilistic, with the probability being an a-priori function of the lack of trust.

For a test to happen, the expected benefits have to outweigh the sure cost. So, either the probability of test failure is high or the father places a high differential value on genealogy vs. divorce/relationship strain.

My estimation: /* = guess */ the value of genetic continuity must be high enough to cover the sure costs in the "test passes" case, otherwise there wouldn't be any tests. How much more than the costs, I can't say. For the "test fails" case there's a secondary decision point. Value of divorce may be positive, since there was clear infidelity and attempted deception, but it may also be a net negative. Value of continued relationship given the failed test, positive or negative? I leave these questions to the philosophers and assume that, integrating over the space of possibilities, the value will be small in the "test fails" case.

So high probability will be the main reason for testing.

Using this inference, the rate of failed tests must be much higher than the actual rate of paternity fraud. The difference decreases with the value fathers place on accurate genealogy and with the availability of muscular long-haired fitness trainers.

(Predicting comparative statics is one of the advantages of always thinking in behavioral models.)

I see some potential here for an Exec-Ed exercise, but to pass the Gauleiters it must be gender-neutral. Huh.

Yes, Wrongly. Adverb, not adjective. Modifies "using," the verb, not "number," the noun. It's not the wrong number, it is the right number used for the wrong purpose. Used wrongly.