The First Law of Data Science: Do Umbrellas Cause Rain?


This was something I saw in

It is from Dr. Michael L. Brodie (CSAIL, MIT) and was originally published in June 2014.

Dr. Brodie discusses the first law of data science, the role of data curation in Big Data analysis, and Thomas Piketty economic theories.

This excerpt is from the forthcoming article by Michael Brodie, “Data Curation: Tools and Techniques for the Emerging Discipline of Data Science”.



The First Law of Data Science

Since there is a strong, direct correlation between rain and umbrellas being unfurled one might conclude that unfurling umbrellas causes rain. Or does rain cause umbrellas to be unfurled? How could data analysis tell which is true? Unfortunately, correlations derived by data-driven models do not imply causation. At best data analysis could suggest a higher probability of rain causing umbrellas to be unfurled.

Umbrellas and RainIt all depends on the analytical model, on the data, and on data curation – what data was used and how it was prepared for analysis. A more detailed data analysis revealed that rain typically precedes umbrellas being unfurled leading to the obvious conclusion that it is not umbrellas that cause rain; it is a previously unknown unfurling force emitted by the umbrella unfurling crowd that causes rain.

As Richard Feynman said in 1974: Confirmation bias is one of the easiest ways for a scientist to fool himself.

The best outcome of Big Data analysis, or of any (computational data-driven) model, is a number of correlations each with a level of confidence that the correlation holds true in the real world – at least the world represented by the data.

Correlation does not imply causation.

To determine if a correlation is true in the real world, it must be verified empirically. This First Law of Data Science applies to all data-driven modelling and analysis. It is a fundamental law of science, of the Scientific Method, of political science, and of Journalism.

On a more serious note, one might consider Piketty’s highly controversial economic theories [1] based on the analysis of historical economic data from a data science perspective. Not only might one question the data science and data curation used [2] more fundamentally one might question the extent to which they are true.

[1]  Piketty, Thomas, and Arthur Goldhammer. Capital in the Twenty-first Century, Harvard University Press, March 2014

[2]  Picking holes in Piketty: The latest controversy around Thomas Piketty’s blockbuster book concerns its statistics, The Economist, May 31, 2014


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s