Today’s post is a comment on this article, originally from Nature, titled The Parable of Google Flu: Traps in Big Data Analysis.

In a very rough summary, Google Flu Trends (hereby shortened to GFT) was Google’s attempt around fifteen years ago at predicting the number of doctor visits for influenza-like illness. The hope was that social media data could be used for this, in contrast to the more traditional data the CDC used to forecast the same thing. In fact, GFT was built specifically to predict the CDC’s reports.
It suffered from some common “Big Data” ills, like trying to use social media data to replace more traditionally collected data instead of merely supplementing it. The issue with this is the algorithm; Google’s algorithm is not stable, and is subject to change by Google’s engineers at any time. This constantly throws a spanner in the works, as each change to improve Google Search also changes the data generation process. On top of that, the 45 search terms used have never been documented (as of the time of the Nature article). This, to me, raises even more questions about repeatability and trust. Setting aside the question of how to reproduce research on an ever changing algorithm, it is hard for me to blindly accept the word of a corporation if they hold back the information needed to reproduce their findings, and leaving something as serious as public health in the hands of a secretive company strikes me as unwise.
The other glaring issues are overfitting and overparameterization. The way the Google team originally structured GFT was finding the best matches among 50 million search terms that correlated with 1152 data points. The chances of finding terms that correlated strongly with the previous data but did not have a strong ability to predict the flu were pretty high. As such, the original GFT was strongly seasonal and missed the nonseasonal “swine flu” pandemic of 2009. After this, Google updated the GFT algorithm, but it continued to significantly over-predict flu cases compared to the CDC for a few years.
The parable here is: just because you have an amount of data so vast that the human mind can barely comprehend it, doesn’t mean you are right. You still have to think about how the variables you select, and how you selected them, will affect the outcome of your research. Sometimes a small data perspective is needed.