Saturday, December 10, 2011

Why analytics practitioners need to worry about "abstruse" statistical/econometric issues

Because these issues are plentiful in data!

Let's say a computer manufacturer has a web site where its customers can configure laptops. The manufacturer notes that a lot of people don't complete the process – some just look at the page, some configure bits and pieces, some go all the way to the end but don't buy, and some configure and buy a laptop – but it gets a lot of data nonetheless.

Analysts salivate over this big data and the multiple measures and the available level of disaggregation and use machine-learning tools to find patterns. But sometimes they fail to check for basic data problems. Here are some customer-behavior-related sources of data problems:

1. Customers configuring laptops are likely to each have a budget, even if it's only a mental one. This makes their choices of variable values in their laptop configurations, say X memory and Y processor speed, interdependent via an unobserved variable. (When they configure the laptop their choices of these dimensions are driven partially by the budget but that budget is not observed by the analysts.) This will create collinearities in the right-hand side variables of the data that would be detected by traditional statistical tools (like factor or principal component analysis, or more simply the non-significant coefficients in a choice model estimation) but are obscured by some machine learning algorithms.

2. Many of the dependent measures used, other than choice-related ones, like customer time-on-page, number of options clicked, number of pages seen, number of menus browsed, number of re-entries into the same page, Facebook likes, page tweets, etc. are highly collinear as well. Often these measures are presented as independent and corroborating measures of interest. This is misleading: they are measures of the same thing using different proxies. (This can be identified with factor or principal component analysis; if two variables are really independent measures of interest – which would be necessary for them to be corroborating – then PCA or FA would separate them as such, given enough data.)

3. Different customers will have different degrees of expertise about laptops, so their choices are likely to have different degrees of idiosyncratic error, which becomes an individual variance (different from other individuals') for their stochastic disturbances. In other words, the data is probably plagued with heteroskedasticity. That's not a big problem per se since it's easily corrected in estimation, but it becomes a problem when, on the rare occasion that standard errors of estimates are shown, the analysts fail to use robust standard error estimation.

4. Often a customer will configure multiple versions of the same laptop to see the price of different feature combinations. This is likely to create serial correlation in the stochastic disturbances: if Bob comes in today after a friend told him how important memory was for a Windows computer, that idiosyncratic error will propagate across all configurations Bob creates today; if tomorrow Bob hears disk space is the key issue, that idiosyncratic error will propagate across all configurations Bob creates tomorrow.

Added Dec 12: Serial correlation is an even more likely problem when analyzing clickstreams, as web links are unidirectional (the reverse motion via "back" button being unobserved in many cases by the clickstream collection system; also not used very often in the middle of long clickstreams), and idiosyncratic factors on one page may drive an long sequence of browsing down one branch of the pages tree rather than another branch. (End of addition.)

5. If Nina configures a laptop from her home computer and another from her work computer, without signing in first, the two configurations will share all of Nina's individual factors but will count as two separate individuals for estimation purpose, giving her individual factors twice the weight in the final estimation. (Also giving the variance of her idiosyncratic stochastic disturbances twice the importance.)

I've been told more than once by people who work in analytics that these are "minor" or "abstruse" statistical points; the people in question "learned" them in a statistics or econometrics class in their far past, but proceeded to forget them, at least operationally, in their careers.  Of course, these "minor" "abstruse" points are the difference between results being informative and being little more than random noise.

I'm pro-analytics, but I want them done correctly.