Category Archives: Correlation

The First Law of Data Science: Do Umbrellas Cause Rain?


This was something I saw in

It is from Dr. Michael L. Brodie (CSAIL, MIT) and was originally published in June 2014.

Dr. Brodie discusses the first law of data science, the role of data curation in Big Data analysis, and Thomas Piketty economic theories.

This excerpt is from the forthcoming article by Michael Brodie, “Data Curation: Tools and Techniques for the Emerging Discipline of Data Science”.



The First Law of Data Science

Since there is a strong, direct correlation between rain and umbrellas being unfurled one might conclude that unfurling umbrellas causes rain. Or does rain cause umbrellas to be unfurled? How could data analysis tell which is true? Unfortunately, correlations derived by data-driven models do not imply causation. At best data analysis could suggest a higher probability of rain causing umbrellas to be unfurled.

Umbrellas and RainIt all depends on the analytical model, on the data, and on data curation – what data was used and how it was prepared for analysis. A more detailed data analysis revealed that rain typically precedes umbrellas being unfurled leading to the obvious conclusion that it is not umbrellas that cause rain; it is a previously unknown unfurling force emitted by the umbrella unfurling crowd that causes rain.

As Richard Feynman said in 1974: Confirmation bias is one of the easiest ways for a scientist to fool himself.

The best outcome of Big Data analysis, or of any (computational data-driven) model, is a number of correlations each with a level of confidence that the correlation holds true in the real world – at least the world represented by the data.

Correlation does not imply causation.

To determine if a correlation is true in the real world, it must be verified empirically. This First Law of Data Science applies to all data-driven modelling and analysis. It is a fundamental law of science, of the Scientific Method, of political science, and of Journalism.

On a more serious note, one might consider Piketty’s highly controversial economic theories [1] based on the analysis of historical economic data from a data science perspective. Not only might one question the data science and data curation used [2] more fundamentally one might question the extent to which they are true.

[1]  Piketty, Thomas, and Arthur Goldhammer. Capital in the Twenty-first Century, Harvard University Press, March 2014

[2]  Picking holes in Piketty: The latest controversy around Thomas Piketty’s blockbuster book concerns its statistics, The Economist, May 31, 2014


DataViz History: Charles Minard’s Flow Map of Napoleon’s Russian Campaign of 1812 – Retreat from Moscow


Since Minard’s map is in French, I have provided an English language version for us to use as we discuss the flow of Napoleon’s retreat in detail. [10]

Minard Map- English Translation

Retreat from Moscow, October 18, 1812 [11]Retreat from Moscow

After Napoleon’s victory at Borodino led to the French capture of Moscow, Prince Mikhail Kutuzov’s Russian army retreated to Tarutino, south and slightly to the west of Moscow. Adam Zamoyski describes this as ‘a good position.’ [1] It was a sufficient distance from Moscow to be safe from a major French attack, threatened the French lines of communication and protected the routes to the south.

The French cavalry, commanded by Marshal Joachim Murat, and Marshal Josef Poniatowski’s V Corps were near Tarutino. Some Russian generals, notably Count Levin Bennigsen, wanted to attack them, but Kutuzov realised that his army needed time to rest, recuperate and receive reinforcements.

The rest of the French army was around Moscow. Much of the city was destroyed by a fire that started on 15 September and lasted for three days. City Governor Count Fyodor Rostopchin had made preparations to burn any stores useful to the French and city and had ordered Police Superintendent Voronenko to set fire to not only the stores, but to anything that would burn. Rostopchin had also withdrawn all the fire fighting pumps and their crews from the city.

Zamoyski suggests that the fires started by Voronenko and his men were further spread by local criminals and French soldiers engaged in looting, and by the wind. He contends that the fire left many French troops without shelter. Other historians who believe that the fires were started deliberately by the Russians include David Bell and Charles Esdaile. [2]

David Chandler agrees that Rostopchin ordered the fires, but says that most supplies and enough shelter for the 95,000 French troops remained intact. He argues that a complete destruction of the city would have actually been better for the French, as it would have forced them to retreat earlier. Instead, Napoleon stayed in the hope that he could persuade Tsar Alexander to come to terms. [3]

On the other hand, Leo Tolstoy claims in his novel War and Peace, the most famous book on the 1812 Campaign, that the fire was an inevitable result of an empty and wooden city being occupied by soldiers who were bound to smoke pipes, light camp fires and cook themselves two meals a day. [4]

On 5 October Napoleon sent delegations to attempt to negotiate a temporary armistice with Kutuzov and a permanent peace with Alexander. Kutuzov, who wanted to gain time to strengthen his forces, received the French delegates politely and gave them the impression that Russian soldiers wanted peace.

However, Kutuzov refused to allow the delegation to proceed to St Petersburg to meet the Tsar. He sent their letters on to the Tsar, with a recommendation that Alexander refuse to negotiate, which the Tsar accepted. According to Chandler, Napoleon refused to believe that the Tsar would not negotiate until a second French delegation also failed. [5]

The balance of power was moving against Napoleon as time passed. Chandler says by 4 October Kutuzov had 110,000 men facing 95,000 French at Moscow and another 5,000 at Borodino. The Russians had an even greater advantage on the flanks. [6]

Napoleon had been sure that Alexander would negotiate once Moscow fell and had not planned what to do if the Tsar refused to make peace. According to Zamoyski, Napoleon had studied weather patterns and believed that it would not get really cold until December, but did not realise how quickly the temperature would drop when it changed. [7]

Chandler argues that he had six options:

  1. He could remain at Moscow. His staff thought that there were sufficient resources to supply his army for another six months. However, he would be a long way from Paris, in a position that was hard to defend and facing an opponent who was growing stronger. His flank forces would have greater supply problems than the troops in Moscow.
  2. He could withdraw towards the fertile region around Kiev. However, he would have to fight Kutuzov and would move away from the politically most important parts of Russia.
  3. He could retreat to Smolensk by a south-westerly route, thus avoiding the ravaged countryside that he had advanced through. This would also mean a battle with Kutuzov.
  4. He could advance on St Petersburg in the hope of winning victory, but it was late in the year, his army was tired and weakened and he lacked good maps of the region.
  5. He could move north-west to Velikye-Luki, reducing his lines of communication and threatening St Petersburg. This would worsen his supply position.
  6. He could retreat to Smolensk, and if necessary, Poland the way that he had come. This would be admitting defeat and would mean withdrawing through countryside already ravaged by war.

There were major objections to each option, so Napoleon prevaricated, hoping that Alexander would negotiate. On 18 October Napoleon decided on the third option, a retreat to Smolensk via the southerly route, which would entail a battle with Kutuzov. He ordered that the withdrawal should begin two days later. [8]

Also on 18 October, however, Kutuzov decided to attack Murat’s cavalry at Vinkovo. An unofficial truce had been in operation, so the French were taken by surprise. Murat was able to fight his way out, and Kutuzov did not follow-up his limited success.

However, the Battle of Vinkovo, also known as the Battle of Tarutino, persuaded Napoleon to bring the retreat forward a day. Around 95,000 men and 500 cannon left Moscow after 35 days, accompanied by 15-40,000 wagons loaded with loot, supplies, wounded and sick soldiers and camp followers. [9]

In an attempt to distract Kutuzov, Napoleon sent another offer of an armistice and told his men that he intended to attack the Russian left flank, expecting this false intelligence to reach Kutuzov.

Next: Retreat from Moscow to Smolensk


[1] A. Zamoyski, 1812: Napoleon’s Fatal March on Moscow (London: HarperCollins, 2004), p. 333.

[2] D. A. Bell, The First Total War: Napoleon’s Europe and the Birth of Modern Warfare (London: Bloomsbury, 2007), p. 259; C. J. Esdaile, Napoleon’s Wars: An International History, 1803-1815 (London: Allen Lane, 2007), p. 478; Zamoyski, 1812, pp. 300-4.

[3] D. Chandler, The Campaigns of Napoleon (London: Weidenfeld & Nicolson, 1966), pp. 814-15.

[4] L. Tolstoy, War and Peace, trans., A. Maude, Maude, L. (Chicago IL: Encyclopaedia Britannica Inc., 1952). Book 11, p. 513.

[5] Chandler, Campaigns, p. 814.

[6] Ibid., pp. 815-16.

[7] Zamoyski, 1812, p. 351.

[8] Chandler, Campaigns, pp. 817-19.

[9] Ibid., pp. 819-20; Zamoyski, 1812, pp. 367-68.

[10] Mike Stucka, English translation of Minard’s classic chart of Napoleon’s March,, November 4, 2006,

[11] Martin Gibson, Napoleon Retreats from Moscow, 18 October 1812, War and Security Blog, October 17, 2012,

DataViz History: Charles Minard’s Flow Map of Napoleon’s Russian Campaign of 1812

DataViz History: Edward Tufte, Charles Minard, Napoleon and The Russian Campaign of 1812 – Part 5

Charles Minard’s Flow Map of Napoleon’s Russian Campaign of 1812

Charles Minard's Flow Map of Napoleon's Russian Campaign of 1812

[Click on map to see full size version]

The chart above also tells the story of a war: Napoleon’s Russian campaign of 1812. It was drawn half a century afterwards by Charles Joseph Minard, a French civil engineer who worked on dams, canals and bridges. He was 80 years old and long retired when, in 1861, he called on the innovative techniques he had invented for the purpose of displaying flows of people, in order to tell the tragic tale in a single image. Edward Tufte, whose book, “The Visual Display of Quantitative Information” is a bible to statisticians, calls it “the best statistical graphic ever drawn”. [SOURCE]

Minard’s chart shows six types of information: geography, time, temperature, the course and direction of the army’s movement, and the number of troops remaining. The widths of the gold (outward) and black (returning) paths represent the size of the force, one millimetre to 10,000 men. Geographical features and major battles are marked and named, and plummeting temperatures on the return journey are shown along the bottom.

The chart tells the dreadful story with painful clarity: in 1812, the Grand Army set out from Poland with a force of 422,000; only 100,000 reached Moscow; and only 10,000 returned. The detail and understatement with which such horrifying loss is represented combine to bring a lump to the throat. As men tried, and mostly failed, to cross the Bérézina river under heavy attack, the width of the black line halves: another 20,000 or so gone. The French now use the expression “C’est la Bérézina” to describe a total disaster.

In 1871, the year after Minard died, his obituarist cited particularly his graphical innovations: “For the dry and complicated columns of statistical data, of which the analysis and the discussion always require a great sustained mental effort, he had substituted images mathematically proportioned, that the first glance takes in and knows without fatigue, and which manifest immediately the natural consequences or the comparisons unforeseen.” The chart shown here is singled out for special mention: it “inspires bitter reflections on the cost to humanity of the madnesses of conquerors and the merciless thirst of military glory”.

What does the map show us [1]

  • Forces visual comparisons (the upper lighter band showing the large army going to Moscow vs. the narrow dark band showing the small army returning).
  • Shows causality (the temperature chart at the bottom).
  • Captures multivariate complexity (size of army, location, direction, temperature, and time).
  • Integrates text and graphic into a coherent whole.
  • Illustrate high quality content (complete and accurate data, presented to support Minard’s  argument against war).
  • Place comparisons adjacent to each other, not sequentially (people forget if they have to go from page to page ).
  • Use the smallest effective differences (i.e., avoid bold colors, heavy lines, distracting labels and scales).

Let’s look at the map in detail

Since Minard’s map is in French, I have provided an Englsh language version for us to use as we discusss the flow of Napoleon’s march in detail. [2]


Crossing the Niemen River – So It Begins

5-26-2013 8-37-35 AMAs Napoleon concentrated his enormous coalition army in preparation for the invasion of Russia,  three Russian armies were positioned to guard the western frontier: the 1st Western Army, under Mikhail Barclay de Tolly, the 2nd Western Army, under Prince Pyotr Bagration, and the 3rd Western Army, under Alexander Tormasov. In June 1812, the 1st Western Army was stationed along the frontier with East Prussia and the Duchy of Warsaw. The 2nd was placed further south in modern Belarus. The 3rd stood yet further south, but still in Belarus. The overall commander of these three armies was Alexander himself, who was installed in Barclay de Tolly’s headquarters near Vilna.

On 23 June, the Prussian major (and later military theorist) Karl von Clausewitz, who had recently entered Alexander’s service, reached the Drissa camp (northwest of Polotsk on the Dvina, near modern Verkhniadzvinsk in Belarus) to inspect the site and report on the progress being made on its defensive works and fortifications. He remained unconvinced of its defensive qualities and said so to Alexander on 28 June. Despite the fact that the camp had appeared central to Russian strategy pre-invasion, it would prove of little worth once the Russian forces had withdrawn from the western frontier.

News of the Grande Armée’s advance guard crossing the Niemen (24 June, 1812) reached Alexander and Barclay de Tolly that same day, late in the evening. The order to withdraw to the Drissa camp was issued shortly afterwards, and Barclay’s units fell back.

Between 26 and 27 June, the order to retreat back from borders spread to each of the Russian corps commanders. Although most of the 1st Western Army’s withdrawal was relatively untroubled, General Dokhturov’s 6th corps, stationed between Lida and Grodno, was almost cut off by the Grande Armée’s crossing of the Niemen and Davout’s troops making for Minsk. Only by force marching did the 6th corps avoid the advancing French troops and reach Drissa unmolested. It was also on 26 June that Alexander dispatched a letter proposing talks with Napoleon, provided that the French emperor retired back over the border. The messenger was held up by Davout and only succeeded in reaching Berthier and Napoleon at the end of the month. The evacuation of Vilna began late on 26 June: by the time Napoleon received Alexander’s messenger and letter, Vilna had been occupied by the Grande Armée. Barclay de Tolly left the city early on 28 June, having destroyed the remaining depots as well as the bridge across the Dvina. Napoleon’s advance troops arrived about an hour later.

Next: The March Continues


[1] Dr. Daniel Churchill, MITE6323  – Interactivity, Visualization, Emerging Technologies and Paradigms, The University of Hong Kong, February, 2007.

[2] Mike Stucka, English translation of Minard’s classic chart of Napoleon’s March,, November 4, 2006,

[3] Napoleon’s Russian campaign: From the Niemen to Moscow,,

DataViz History: The Ghost Map: Index Case at 40 Broad Street

40broadstreet_smallDid the index (or first) case of the Broad Street Pump outbreak live at 40 Broad Street, close to the pump? Reverend Henry Whitehead thought so after a detailed investigation of cholera cases in 1854 following the outbreak. [SOURCE]

The woman living at 40 Broad Street (Sarah Lewis, wife of police constable Thomas Lewis) lost both her five-month old child, Frances, and husband to cholera. In the four to five-day interval between her child’s onset of diarrhea on August 28-29, 1854 and subsequent death on September 2, 1854, Mrs. Lewis had soaked the diarrhea-soiled diapers in pails of water. Thereafter she emptied the pails in the cesspool opening in front of her house.

Likely baby Lewis had Vibrio cholerae which contaminated the napkin used to absorb diarrhea. Reverend Whitehead conveyed his suspicion concerning the possible index case to the Medical Committee of the Board of Guardians responsible for the public health of the area. The Board sent a surveyor to assess the situation. He created a diagram of the home and cesspool and reported that decayed brickwork in the cesspool resulted in seepage of fecal debris to the Broad Street pump which was about three feet away (see picture).

house40aThe death certificate for baby Frances was filled out by Dr. William Rogers, a local physician. doctor who had attended baby Frances at 40 Broad Street opined in a detailed letter to Reverend Whitehead that the cause of death was acute diarrhea, not cholera, an opinion that he repeated at a meeting of the London Epidemiological Society. Since Vibrio cholerae was not discovered until 1884, it is doubtful that Dr. Rogers could have accurately distinguished by signs and symptoms alone non-cholera acute diarrhea and cholera diarrhea. Thus Whitehead’s theory is certainly plausible that the infant at 40 Broad Street was the index case.

Thomas Lewis, the baby’s father, came down with a fatal attack of cholera on September 8, 1854, the same day that the Board of Guardians had the Broad Street pump handle removed. Assuming wife Sarah Lewis poured water from his soiled garments into the household cesspool, it is likely that water of the Broad Street pump would have remain a source of further infection, if the handle had not been removed.


Why was the cesspool at 40 Broad Street not maintained? Such neglect was increasingly common in London, due in part to economic circumstances. At the time of the Broad Street pump outbreak, London had about two hundred thousand cesspools. For many years, the contents of the cesspools were sold as agricultural manure to be used as fertilizer in the many farms that surrounded London. The money earned from manure sales would then be used to maintain the cesspools. Yet during the nineteenth century as London’s population grew ever more rapidly, farms were forced to move further from the central city. Transportation costs increased, adding to the expense of acquiring cesspool-based manure. Starting in 1847, another change took place that undercut the sale of cesspool manure. Solidified bird droppings (or guano) were brought in as fertilizer from South America at a price far below cesspool manure.

With no economic incentive to sell their feces, poor people would empty human wastes into the streets, or directly into the London waterways. Most lacked public health understanding of how disease was spread, as did many medical and health officials of the times. In the absence of manure sales, cesspools became expensive to clean. As a result, they were poorly maintained and infrequently emptied. Over time this neglect lead to cracks and crevices, which offered opportunities for the spread of enteric pathogens. Such spread of Vibrio cholerae probably occurred at 40 Broad Street.

Given the diarrhea symptoms of the young infant and the assessment of the cesspool by the surveyor, Reverend Whitehead likely determined the index case that started the infamous Broad Street pump outbreak.


Boylan, D. Personal Communication, 2009.

Brody H et al. The Pharos 62(1), 2-8, 1999.

Chave, SPW. Medical History 11(2), 92-109, 1958.

Halliday, S. The Great Stink of London: Sir Joseph Bazalgette and the Cleansing of the Victorian Metropolis, 1999.

Paneth N. et al. American Journal of Public Health 88(10), 1545-1553, October 1998.

Vinten-Johansen, P et al. Cholera, Chloroform, and the Science of Medicine. A life of John Snow, 2003.

Internet Explorer versus Murder Rates

From the Twitter hashtag #dataviz, I saw this in my Twitter feed yesterday.

An interesting example where correlation does not really equate to causation. Or does it? In either case the world is a better place.



Get every new post delivered to your Inbox.

Join 359 other followers