Thursday, April 9, 2009

Three Easy Pieces

1. DATA VERSUS INFORMATION

Many people use "information" and "data" interchangeably. This lack of precision creates problems for knowledge workers --- those who extract information from data and act upon the extracted information (perhaps by communicating the information to a decision-maker).

To illustrate the difference between data and information, consider the case of Lucky, a degenerate gambler. Lucky wants to know whether the Giants beat the spread; this information drives Lucky's choice of whether to search for his bookie or to hide from the bookie's goons. That's it: One bit of information, yes or no. A play-by-play narration of the Giants game has a lot of data, but for Lucky there's only one bit of information in it; the rest is noise.


(How we get information from data is the question that divides data-miners from model builders: two professions separated by a common problem.)

To take an extreme case, consider Jorge Luis Borges's Library of Babel: It contains all possible books, each with exactly as much data as the next. But most of these books are entirely noise, having no information. Vast data stores without the knowledge to extract information from the data is as unusable as the Library of Babel.

Information is what we care about; data is only an input. In the sense that brain function is what life is about and oxygen is only one input: It is necessary, but in no way sufficient.



2. BANDWIDTH OF REAL LIFE

Bandwidth measures flow of data. But I'm going to abuse the term to measure information flow for real life. Otherwise we would have to consider the astronomical numbers involved into the massively parallel data acquisition and processing system that is the human brain and its attached senses.

Bandwidth came to my mind as I purchased a couple of audiobooks and paperbacks.

Books (and the written word in general, including blog posts) have a high potential bandwidth, and the pace of information acquisition is controlled by the reader. Audiobooks (and podcasts) have a much lower bandwidth and listeners have much less control over the pace. So why listen to these?

My personal reason is that audiobooks (and podcasts) can be "read" while doing things in which a printed book (or on-screen ebook or blog) is not as convenient: While running on a treadmill, taking mass transit (like air travel), shopping in brick-and-mortar stores, for example.

The high bandwidth of books has a counterpart requirement: attention. That is why so many people who read technical books fast -- while "multitasking" -- learn almost nothing from them. Then they blame the book.



3. KNOWING MEANS DIFFERENT THINGS TO DIFFERENT PEOPLE

What does it mean when someone says they understand Gibbs sampling, or customer service level analysis with option pricing, or -- to use a simpler example -- the importance of validating models after calibration?

At the most base level it means that they heard of the concept. This is the level I found many students coming from a marketing modeling class into my product management class: They didn't understand the need for validation at all and therefore they mostly didn't do it unless an external force (the professors' arbitrary demands) imposed it. These students had no knowledge, even though they thought they did.

Then there are those who were able to follow the explanation, and understood it as it was done by their marketing models professor, but couldn't repeat it later. At the time of learning, they could see how each step developed into the next and how the whole made a coherent picture. These students did validate models after calibration; they knew that it was important to, but didn't really know why. This is a reasonably operational level of knowledge, but it is not very flexible.

A few students were able to redo the explanation they had been taught (more likely the one they learned from self-study, when most learning in MBA programs happens). They really understood the reason for validation, not just its importance. They could concoct examples of how things could go wrong in the absence of validation. This is the level of knowledge that makes for proficient modelers and clear-thinking product managers in the age of analytics.

Finally, there's a higher level of understanding, the one where the student (or more likely a professional, a consultant, or a researcher) figures out the explanation from first principles -- discovers the concept in fact. This is a level of subject matter knowledge that goes beyond the proficient and becomes the basis for professional creativity.