Wednesday, May 18, 2011

Averages and trends

Some lessons from data processing for corporate planning (with, possibly, other applications).

In the 14th Century, when I sat on the student side of a MBA classroom, there was this discipline called "Strategic Planning And Corporate Policy," in which we spent several sessions learning the twin dark arts of forecasting and scenario planning. Fast forward to today and add some minor statistical sophistication, and here we are.*

Suppose there's a variable of interest that is an input to major strategic decisions and for which we have historical data with varying accuracy; say, population density. And to make decisions we need to get some sense of how it's changing, say an average trend.

The problem with density is that in some places it will make sense to measure it over areas and in other places, say city centers, over volume. So, how does one average two different metrics of the variable of interest ($people/km^2$ and $people/km^3$)?

For basic logistics planning one could flatten the volume and project it on the surface; but that doesn't work for other applications, like trends in the design of living space, where the average space in Manhattan is really volume. So, how to solve the general problem of finding an average that is a metric for trend computation?

Data reduction methods like Principal Component Analysis and Factor Analysis can take vectors of heterogeneous measures and find the common elements in them, therefore apparently solving this problem. Apparently -- if you don't know how these techniques work.

The trouble is that FA and PCA both select for variance, meaning that places that have the most variance in population density will be overweighed in the final metric. And this will lead to a trend estimate that is more volatile than the actual trend. (And most likely over-estimate any underlying trend as well, depending on the model formulation of trend as a function of metric.)

So, if we want to use averages for trend estimation, we have to end up with some sort of "manual data cleaning and weighing" in the model, which is done by judgment of the strategic planner's analysts. It's either that or use a correction for volatility in the estimated trend, but almost no one knows how to do that correctly and in many cases it's not possible.

A second problem with this trend estimation is that the locations where the data are collected depend on the actual variable being measured: areas with high density will have more census workers than empty areas, simply as a result of standard sampling approaches. And, if the population trends include migration, that migration changes the sampling strategy of demographers so it too will create additional volatility in the measurement of the trend. This can be controlled for with appropriate statistical techniques, but pretty much never is.

A third problem is the historical data. We trust our demographics data now, but in order to get trends, perhaps our strategic analysts needed to use historical records, and then indirect, proxy, variables. So they choose proxy variables that are well-correlated with the target variable and extrapolate the past. But to extrapolate the past the analysts need a strategy to cope with extrapolation error, which on average increases with extrapolation distance. Typically this will include some bootstrap method that uses the "good data" estimate of the trend as a starting point for the "bad data" part of the trend.

By using bootstrap methods, the analysts will smooth out any proxy variable effects that move away from the trend; in other words, if the historical data contradicts the trend, it will have only a small effect (depending on the smoothing technique and the bootstrapping strategy) on the final trend estimate, but if it is neutral or supports the trend, it will increase trend volatility.

Of course, we marketers at this point interject that this is all pre-1900 stuff. After all, the average is much less important than the marginal effects on important segments. For example, the effect of changes in Manhattan or Montana would be very unimportant, since for corporate purposes one is already highly dense and the other is empty. What matters is what happens in marginal places like Topeka, KS (marginal in the sense that they are highly sensitive to the choice of business strategy, no offense intended to the Topekans).

Marketers would want access to the disaggregate data, with separate data sets for the proxy variables. Instead of looking for a big number that summarizes some trend and then applying it blindly everywhere, we'd build two sets of models: local trends (as in how the population of Topeka evolves over time) and trends over the space of travel matrices (as in where do the people come from into Topeka and where do Topekans leave to). Then we could find policy implications that mattered.

Imagine that there was a very strong trend towards increasing density. If that trend was all in high or low density places (meaning people from Montana moving to Manhattan), this would not affect our strategy. But if cities at the cusp of a phase-change were either increasing or decreasing in density, that would have major policy implications.

(That is the marketing secret: whenever possible, unpack and disaggregate. Applies to many important things in life.)

At this point the corporate governance types in the classroom would interrupt to say that this is not how things actually work in the real world. (In the class where I was a student they'd be in trouble, as the teacher was on the board of several large companies; but let's sidestep that point.) The real corporate world, they'd say, is about power.

According to these corporate governance types, in the real world the analysts and the planners would put out a report, filled with the analysts' footnotes and appendices (which no one would read) and summarized by the planners in short sentences of small words (but no nuance and sometimes contradicting the analysts).

Then the executives would either cherry-pick the parts that supported their pre-determined plans or simply ignore the actual report and say that it supported their pre-determined plans. Then they'd sell that idea to the board of directors (who never read the report) and to the shareholders (who have no access to the report, let alone the data).

Everyone (except the more ethical among the analysts, but those could be conveniently fired or ostracized) would be happy: the planners would get bonuses for keeping their mouths shut; the executives would continue their unchecked rule; the board would continue to avoid serious outside challenges; and the shareholders would keep being fleeced.

Perhaps the governance guys were onto something. Having data is great, but the link between reality and action is not as obvious as one might think.

(I miss being a MBA student. It was fun.)

-- -- -- --
* The class had little statistics. We spent most of the time making probabilistic decision trees and conditional NPV calculations; then we discussed corporate governance and implementation of these plans via incentive systems. Not much 7S going on in that class. The professor had some funny stories of board of directors vs executives shenanigans, though.