Big Data: Microchip or Printing Press?

Rise and fall of Google Flu Trends

OÖ
4 min readJun 29, 2017
Image Credit: “Big data on the big screen, the center of the Milky Way galaxy imaged by NASA’s Spitzer Space Telescope”

“The original revolution in information technology came not with the microchip, but with the printing press”

In 2009, engineers at Google published a paper in the scientific journal Nature [1], where they claimed they could predict the spread of the winter flu in the United States, not just at the national level, down to the specific regions and states [2]. They could achieve this by looking at what people search for on the internet. They had a lot of data to make any kind of inference, as they receive millions of search queries a day and save all of them. This endeavor, as they put, served as one of the early examples of nowcasting based on search trends [3].

Of course, this attracted a great deal of attention among computer scientists and medical officials but was overlooked otherwise until the H1N1 outbreak in 2009 [2]. Google compared 50 million of search terms with CDC (Centers for Disease Control and Prevention) data on the spread of seasonal flu between 2003 and 2008. The idea was to locate the areas infected by the virus by what people searched on the internet.

The point was not getting searches directly about the flu, like “medicine for fever”, but looking into the correlations between the frequencies of certain queries and the spread of flu over time and space. After trying over millions of different models, they came up with 45 search terms, that when used together in the model, they were aligned with the national statistics of CDC [2].

What was wrong with CDC anyway?

There was no vaccination against H1N1 (something like a combination of viruses that cause bird flu and swine flu) available when they discovered it in 2009. In a couple of weeks, all world agencies were alarmed about a pandemic. They only hoped to stop its spread, if it did spread somewhere [2]. In the USA, CDC asked doctors to inform them of new flu cases. Yet that meant at least 2 weeks of lag on the information! First of all people might feel sick for days but wait for consulting to a physician, plus when the physician sends the information taken from the patient, CDC tabulated the numbers only once a week [2]. However by utilizing Google Flu Trends, public health agencies would be able to see the spread in the real time (At least till it’s discontinued in 2013 [3].)

We Got a Big Data Hubris!

“Big data hubris” is the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis.[4]

In 2008 Google Flu Trends was rolled out in 29 countries, and thought as performing quite successful by computer scientists/researchers. However in the 2009 H1N1 outbreak, Google badly underestimated the flu spread [5] — could be the change of the search behaviors of people, due to high media coverage, people who are not even ill searched for the flu on the internet. So this created a huge load of misinformation that Google’s algorithm couldn’t handle.

On top of this, Google Flu Trends seemed to overlook the ignorance of people about flu. If people really didn’t know anything about flu (that they didn’t actually), how would they create useful and meaningful information in the first place? As Steven Salzberg puts, like it or not, a big pile of dreck can only produce more dreck [6].

Another problem was hidden in the very beginning, there were 45 search terms that were correlated with the national statistics, that it had a 90% correlation with CDC flu data, from 0 to 1, where 1 means a very strong correlation [1]. However correlation does not need to imply causation even if it looks like the opposite [9]! And this was also a big failure in the GFT algorithm. As flu is mainly seasonal, it tends to have an increased prevalence in the winter, the initial version of GFT was part flu detector, part winter detector, scientists noted [7].

Storytelling or just number-crunching?

The fruits of the information society are easy to see, with a cell phone every pocket, a computer in every backpack […] but less noticeable the information itself. [2]

Google Flu Trends is a failed experiment [4][5][6][7]. Yet it is one of the most important adventures of the big data history from the public healthcare perspective, definitely, plenty is forthcoming in this field. There are many problems attached with the GFT algorithm, as widely documented by scientists, but maybe the biggest problems in GFT algorithm are still to be uncovered. As also this adventure showed us, making inferences on data is still not that about your mathematical model, but your know-how in the field. Data do not tell its story by itself and in this mess of numbers it is hard to uncover the information hidden between numbers. Therefore “big data” still is far away to be a substitute for the traditional statistical methods, it is merely a supplement, as the rate of information incurred with the statistical models do not increase as fast as the technology advances. As Nate Silver puts [8]:“The original revolution in information technology came not with the microchip, but with the printing press”.

Note: Google discontinued Google Flu Trends, yet its historical data is still available on their website.

If you enjoyed reading this, please consider tapping on the green heart below.

--

--

No responses yet