key: cord-0001160-ctpgylnr
authors: Yuan, Qingyu; Nsoesie, Elaine O.; Lv, Benfu; Peng, Geng; Chunara, Rumi; Brownstein, John S.
title: Monitoring Influenza Epidemics in China with Search Query from Baidu
date: 2013-05-30
journal: PLoS One
DOI: 10.1371/journal.pone.0064323
sha: 20b5fc51aef8b077346c4587aad3811eebf8b11b
doc_id: 1160
cord_uid: ctpgylnr

Several approaches have been proposed for near real-time detection and prediction of the spread of influenza. These include search query data for influenza-related terms, which has been explored as a tool for augmenting traditional surveillance methods. In this paper, we present a method that uses Internet search query data from Baidu to model and monitor influenza activity in China. The objectives of the study are to present a comprehensive technique for: (i) keyword selection, (ii) keyword filtering, (iii) index composition and (iv) modeling and detection of influenza activity in China. Sequential time-series for the selected composite keyword index is significantly correlated with Chinese influenza case data. In addition, one-month ahead prediction of influenza cases for the first eight months of 2012 has a mean absolute percent error less than 11%. To our knowledge, this is the first study on the use of search query data from Baidu in conjunction with this approach for estimation of influenza activity in China.

Seasonal influenza epidemics result in an estimated three to five million cases of severe illness and 250,000 to 500,000 deaths worldwide each year [1] . In order to prepare for the next severe pandemic and better control seasonal influenza epidemics, researchers have proposed several approaches to achieve near real-time surveillance of the emergence and spread of influenza. Some novel approaches for rapid disease outbreak detection and surveillance include online surveillance systems utilizing informal sources such as news reports [2] , social media data [3±16] , and search query data [17±20] . The idea of using search query data for detecting outbreaks was first introduced in 2006 [17] . Ginsberg et al [18] later discussed how monitoring search queries on Google could be used to detect influenza outbreaks in the United States. Several studies followed, which pointed to the effectiveness and limitations of detecting influenza epidemics using search query data [19] , [20] . Although there are limitations, such as the lack of Internet access in some regions of the world and the noise of irrelevant information, Internet search query data is being explored as a low-cost approach to estimating disease activity in near real-time.

Besides influenza surveillance, search query data has also been widely used for research in fields such as, economics and finance. In the same year as the Ginsberg's publication [18] , several studies investigated the usefulness of Google searches for forecasting unemployment in various countries [21±25] . Several papers also used search query data to predict consumption [26] , [27] , house pricing and sales [28] , and travel and consumer confidence [27] . Though studies using web search query data have achieved good results in empirical practice, the field is still young and rapidly developing, with room for discussion and improvement.

We introduce a novel method for estimating influenza activity using search query data from Baidu. Data on Internet searches are available on a daily basis, while routine surveillance data from China's Ministry of Health (MOH) are typically reported with a one to two-weeks lag. The objective is therefore to estimate present influenza activity based on previously observed laboratory surveillance data plus timely search query data before official reports from China's MOH. Beyond the use of search query data in a new geographic region and the use of a different search engine, this study is an improvement on other research in this area in that, the keyword selection and composition approach presented is more economical in terms of computational resources and cost compared to the original method by Ginsberg et al [18] . Unlike the United States, in China alternative search engines such as Baidu are more widely used than Google. The market share of Google in China is less than 20%, while that for Baidu is more than 80% [29] . The wide use of Baidu in China makes it a more representative search query source for this analysis.

Several methods have been proposed for detecting and predicting trends of influenza epidemics in China [30±32]. However, most of these techniques solely use influenza-like-illness (ILI) or influenza case data. In this study, we use a combination of influenza case counts and real-time search query for modeling and detection of current influenza activity. Improving methods for surveillance, modeling, detection and prediction of influenza epidemics in China is extremely important. Two of the three pandemics of the 20 th century are thought to have started in China [38] , [39] . In addition, the severe acute respiratory syndrome (SARS) of 2002 had its origins in the Guangdong Province of China. Therefore, refining approaches for rapid detection of outbreaks of influenza and other respiratory illnesses in China should benefit global public health.

Given data on influenza activity from an official source, the approach in this paper can be summarized as follows: (i) search for keywords or terms which might be related to influenza; (ii) process keywords by eliminating those unrelated to influenza epidemics, those with an interrupted time-series representing search query volume and those not correlated to the influenza epidemic curve; (iii) define weights and composite search index, and (iv) fit regression model using selected keyword index to influenza case data. Whereby, the fitted model uses both the influenza case data and the search index.

Official case counts. The counts shown in Table 1 reflect monthly aggregated influenza case counts from March 2009 to August 2012 for China. The data is publicly available on China's Ministry of Health (MOH) site (http://www.moh.gov.cn/) and typically released 1±2 week after the end of each month. A network of physicians report laboratory confirmed cases to the MOH on a daily basis. However the data is only released to the public at a monthly resolution. The data is solely laboratory confirmed influenza cases and does not include ILI cases. Furthermore, during the 2009 H1N1 pandemic, infections resulting from the new influenza strain were reported separately from cases resulting from circulating seasonal influenza strains in China [40] . The data in this study is solely for seasonal influenza.

No ethics committee approval is required to obtain the data since it is publicly available. In addition, only count data is presented, no personal information is revealed, thereby maintaining confidentiality.

Search query data from baidu. Baidu's database (http:// index.baidu.com/) contains logs of online search query volume submitted from June 2006. However, since the influenza case count data is available from March 2009, we use Baidu's data from March 2009 to August 2012. Unlike the case data from the Ministry of Health, Baidu's search query data is available on a daily basis. The data is therefore converted to monthly counts for analysis. User confidentiality is also maintained, since only the combined term frequency data is available. In addition, Baidu releases search query volume for the entire country.

Different keywords have different search frequency and can therefore produce diverse modeling outcomes. So keywords are carefully selected to reflect terms most likely associated with influenza epidemics. Note, observations from previous studies such as Ginsberg et al [18] , have indicated that more keywords do not necessarily assure better model fit. The marginal contribution of adding terms to a``saturated'' model is limited, but costly. Ginsberg et al [18] only selected 45 significant keywords from 50 million. The method of exhaustion employed by Ginsberg et al [18] is computationally expensive and not easily reproducible by researchers with limited resources [27] . In some cases, researchers have solely relied on keywords recommended by Google [23] , [24] , [26] . Keywords recommended by search engines tend to be comprehensive, but not always relevant to the subject. Therefore, further analysis is required to extract keywords, which are most pertinent to the study.

Keywords used in this study are obtained from the following Chinese website: http://tool.chinaz.com/baidu/words.aspx (hereafter referred to as keyword tool). Keywords suggested by the keyword tool include recommendations from Baidu, and others mined using semantic correlation analysis from portal websites, blogs, and online reports.``Flu'' (``A'' in Chinese) is the core keyword in this study. Upon entering``A'' into the keyword tool, we obtain 94 related keywords (Table 2) . Although recommended by the keyword tool, some of the 94 keywords are not related to influenza epidemics in China. We therefore filter the keywords as follows: (i) the selected keywords should represent factors that might influence the influenza epidemic. (ii) The search query data for each keyword should be represented as a sequential Keywords that remain after the filtering analysis are considered for inclusion in the composite search index. The goal of search index composition is to build the most correlative and stable indicator for the influenza case data based on the available information. The search index is composed in two steps. First, we define synthetic weights for each of the keywords. Next, we combine the weighted time series for the keywords.

We consider two approaches for defining synthetic weights: the method of systematic assessment and the strength of the correlation coefficient. The method of systematic assessment [34] , [35] involves rating the selected indicator according to the principle of prior evaluation and defining the ratings as weights. The method is comprehensive but highly subjective. Alternatively, the correlation coefficient between the influenza epidemic curve and the keyword frequency curve can be used to represent the weight [18] , [33] . This approach is usually combined with Analytic Hierarchy Process (AHP) [36] for better performance. However, solely using the correlation coefficient without adjustments appears to be sufficient for this study.

The search index is defined as: index j~P j i~1 v i x i l , where v i is the weight of the i th keyword and x l i represents the sequence after alignment. Although the definition of the composite index allows for alignment, it is not required for combining the time series in this study since maximum correlations are observed at lag 0. The final set of keywords is selected using the following model:

In (1), index j represents the search index for j keywords, y denotes influenza case counts, a 0 ,a 1 ,e denote the intercept, coefficient and error term respectively.

Using a stepwise approach generally used in the selection of variables in a multiple regression framework, keywords are selected based on their contribution to the model's goodness of fit. Partial F test is used to evaluate the goodness of fit after adding data for each keyword to the index. A significant F-statistics implies that the keyword should be added to the composite index, and vice versa. The search index is defined based on the model with the best goodness of fit statistics. The initial model is based on the keyword with the highest correlation with the influenza case data. In this case,``A2'' (prevent influenza) has the highest correlation at 0.93 at lag 0. Keywords are then added sequentially based on the correlation coefficient and the partial F test is examined for improved fit. The process is repeated until the goodness of fit can no longer be improved.

As stated, the objective of this paper is to present a method for faster detection of influenza activity in China using search query data. China's MOH typically releases monthly influenza case data 1±2 weeks into the next month. We therefore aim to provide estimates of case data before the MOH data is publicly available.

The most significant correlations between the composite index and the case data are observed at lag 0 (P = 0.959) and lag 1 (P = 0.658). Correlations at lags 2 and 3 are 0.491 and 0.227 respectively. We therefore fit the following model: ICD represents influenza case data, b 0 ,b 1 ,b 2 are the coefficients, index is the composite search index and eis the error term. The model estimates ICD at time t based on ICD at time t-1 and the composite search index at time t and t-1. For example, case counts for February 2012 are estimated at the end of February based on the composite search index for February and January, and the case count for January. We also examine the residuals to evaluate the adequacy of the model.

The influenza case data is divided into a fitting and validation set. Data from March 2009 to December 2011 is used for model fitting, while data from January 2012 to August 2012 is used for validation. We also consider models with second and third order lags. Models are evaluated based on R-squared, AIC and significance of the coefficients. Studies have suggested that solely using an extrapolation of the influenza activity curve for predictions usually results in a higher error rate [32] , [33] . The analysis is performed using the Eviews software.

Based on the filtering analysis, 14 out of the 94 keywords are not related to influenza epidemics, 20 keywords do not have sequential time series due to low search volume and only 40 keywords are significantly correlated to the case data (see Table 2 ). With the stepwise approach, only 8 of the 40 keywords are used in the composite search index (see Table 3 ). The estimated crosscorrelation coefficient between the search index and influenza case data is 0.96 at lag 0 ( Figure 1 ). Influenza epidemics are observed in the spring and winter as expected. Note that the search index clearly captures the peaks and troughs of the influenza time series curve, thereby making it a good indicator for influenza activity in China.

The coefficients b 0 b 1 b 2 for model (2) are 0.56 (P = 0.001), 0.25 (P,0.001) and 20.14 (P = 0.004) respectively. Note the model's Rsquared is 0.95 and the AIC is 18.50. In addition, the Durbin-Watson test statistic is 1.89 suggesting that autocorrelation is not an issue (see Table 4 ). The null hypothesis of the Durbin-Watson test is that the autocorrelation parameter is zero.

The model is validated by predicting influenza cases one month at a time, from January 2012 to August 2012. The results are listed in Figure 2 and Table 5 . The mean absolute percent error of prediction for the consecutive eight months is 10.6% (see Table 5 ). We also consider models with second order lags and third order lags but neither of their statistical results are better than that of model [2] (see Tables S1 and S2).

We develop a comprehensive method for pre-processing Internet search data for modeling and detecting influenza epidemics in China. The combined keyword index is significantly correlated to the case data and mean absolute percent error of predicting 2012 monthly influenza cases is less than 11% based on one-step predictions for eight months. Although the monthly search query data and influenza case data are almost synchronous, the search query data can still be used in detecting influenza cases because of the time delay of official reports.

This study contributes to the pool of novel sources of data, such as web-based data, used as early indicators for disease outbreaks. To our knowledge, this is the first study utilizing Baidu search query data in conjunction with this approach for estimating influenza activity in China. Baidu has a significantly higher market share than Google in China, thereby making it a better search query source for this study. The proposed approach is not meant to replace actual estimates of influenza cases, rather it is an indicator of influenza activity, which is freely available in near real-time. This is especially relevant for a country such as China, which has been coined the``epicenter of influenza'' [39] by some.

However, there are several limitations to using search query data. Although the selected keywords perform well at capturing the temporal trend of the epidemic curve, there is no guarantee that this would be consistent in future dates. Individual behavior is constantly changing and different factors influence keywords queried by individuals. Another limitation is the unavailability of Internet access in rural regions. The China Internet Network Information Center (CNNIC) currently estimates Internet penetration in China at 39.9%. Surveillance using web-query data depends on adequate Internet access. In addition, not all searches on influenza-related terms are necessarily linked to influenza morbidity. Search queries can be a result of panic during a novel respiratory outbreak, coverage of influenza-related deaths in the media, fear or curiosity. Using several years of data in modeling should hopefully mitigate occurrences of panic induced searches since the weight of various keywords is likely to deviate from one influenza season to another. Furthermore, correlation does not imply causation, which suggests that predictions made using such novel data sources should be carefully evaluated.

Limitations also exist in the data used in this study. Influenzalike-illness data might be a better indicator of influenza activity since influenza cases are not always confirmed and case data might underestimate the true burden of the disease. However, China's Ministry of Health only releases influenza case data for the entire country. In addition, there are likely to be major differences in timing and duration of epidemics from province to province. Analysis at the province level would therefore be more beneficial.

Unfortunately, both the case data and search query volume are only available for the entire country. Though, the model can be easily extended to detect influenza activity at a province level. Although limitations exist, having more methods and resources geared towards infectious disease surveillance provides a step towards rapid detection and control of emerging and re-emerging outbreaks. Public health scientists and epidemiologists could use observations from such approaches as an indicator for further investigations. These tools are freely available in near real-time and can be especially valuable in regions where official reports of case counts are delayed. 

Conceived and designed the experiments: JSB QY. Analyzed the data: QY. Contributed reagents/materials/analysis tools: GP BL. Wrote the paper: QY EON. Developed and evaluated the model: QY EON. Edited and revised the manuscript: JSB QY EON RC.

Influenza (Seasonal), WHO website

HealthMap: global infectious disease monitoring through automated classification and visualization of Internet media reports

Pandemics in the age of twitter: content analysis of tweets during the 2009 H1N1 outbreak

The use of twitter to track levels of disease activity and public concern in the U.S. during the Influenza A H1N1 pandemic

Flu detector: tracking epidemics on twitter

Towards detecting influenza epidemics by analyzing twitter messages

Twitter catches the flu: detecting influenza epidemics using twitter

You are what you tweet: analyzing twitter for public health

Twitter Informatics: Tracking and Understanding Public Reaction during the 2009 Swine Flu Pandemic

Digital Disease Detection ± Harnessing the Web for Public Health Surveillance

Using social media for disease surveillance

Influenza A (H1N1) virus, 2009 -online monitoring

Social and news media enable estimation of epidemiological patterns early in the 2010 Haitian cholera outbreak

Trending Now: Using social media to predict and track disease outbreaks

Vision: towards real time epidemic vigilance through online social networks: introducing SNEFT±social network enabled flu trends

Social network sensors for early detection of contagious outbreaks

Infodemiology: tracking flu-related searches on the web for syndromic surveillance

Detecting influenza epidemics using search engine query data

Improving the timeliness of data on influenza-like illnesses using Google search data

Google trends: a web-based tool for real-time surveillance of disease outbreaks

Google econometrics and unemployment forecasting

Predicting Initial Claims for Unemployment Benefits. Google technical report. Google user content website

Predicting unemployment in short samples with internet job search query data

Google it!' Forecasting the US unemployment rate with a Google job search index. ISER Working Paper Series

Predicting the present with Google Trends

Forecasting Private Consumption: Survey-based Indicators vs. Google Trends. Ruhr Economic Papers 0155

Google Searches as a Means of Improving the Nowcasts of Key Macroeconomic Variables. Discussion Papers of DIW Berlin 946

The Future of Prediction: How Google Searches Foreshadow Housing Prices and Quantities

Prediction of influenza incidence by using ARIMA. China Tropical Medicine

Prediction of Influenza-like Illness Using Auto-regression Model

Surveillance of influenza in Zhejiang

A preprocessing method of Internet search data for prediction improvement: application to Chinese stock market. Knowledge Discovery and Data Mining

Indicators of Business Expansions and Contractions

The Contribution of Economic Indicator Analysis to Understanding and Forecasting Business Cycles

Decision-making with the AHP: Why is the principal eigenvector necessary

Seowhy forum website

Global epidemiology of influenza: Past and present

Is China an influenza epicentre?