key: cord-348847-53s19r16 authors: Lu, T.; Reis, B. Y. title: Internet Search Patterns Reveal Clinical Course of Disease Progression for COVID-19 and Predict Pandemic Spread in 32 Countries date: 2020-05-06 journal: nan DOI: 10.1101/2020.05.01.20087858 sha: doc_id: 348847 cord_uid: 53s19r16 Effective public health response to COVID-19 relies on accurate and timely surveillance of local pandemic spread, as well as rapid characterization of the clinical course of disease in affected individuals. De novo diagnostic testing methods developed for emergent pandemics are subject to significant development delays and capacity limitations. There is a critical need for complementary surveillance approaches that can function at population-scale to inform public health decisions in real-time. Internet search patterns provide a number of important advantages relative to laboratory testing. We conducted a detailed global study of Internet search patterns related to COVID-19 symptoms in multiple languages across 32 countries on six continents. We found that Internet search patterns reveal a robust temporal pattern of disease progression for COVID-19: Initial symptoms of fever, dry cough, sore throat and chills are followed by shortness of breath an average of 5.22 days [95% CI 3.30-7.14] after symptom onset, matching the precise clinical course reported in the medical literature. Furthermore, we found that increases in COVID-19-symptom-related searches predict increases in reported COVID-19 cases and deaths 18.53 days [95% CI 15.98-21.08] and 22.16 days [95% CI 20.33-23.99] in advance, respectively. This is the first study to show that Internet search patterns can be used to reveal the detailed clinical course of a disease. These data can be used to track and predict the local spread of COVID-19 before widespread laboratory testing becomes available in each country, helping to guide the current public health response. Accurate real-time surveillance of local disease spread is essential for effective pandemic response, informing key public health measures such as social distancing and closures, as well as the allocation of scarce healthcare resources such as ventilators and hospital beds. 1 2 During the current COVID-19 pandemic, surveillance has primarily relied on direct testing of individuals using a variety of de novo laboratory testing methods developed specifically for the emergent pandemic. 3 While laboratory testing serves as an important gauge of epidemic spread, it suffers from a number of important limitations: 1. During the early stages of a novel pandemic, laboratory tests specific to the pandemic do not yet exist and therefore must be developed de novo , leading to significant delays before test-based surveillance can begin. 2. Even once a test has been developed, scaling of manufacturing, distribution and test processing capacity takes a significant amount of time, resulting in limited testing capacity in many countries. 4 It is therefore difficult to achieve population-scale surveillance with laboratory testing in the crucial early stages of an emergent pandemic. 3. Even after tests are widely available, delays in test administration and processing make laboratory testing a lagging indicator relative to disease onset. 5 4. Laboratory testing often requires individuals to leave home and congregate at testing centers, increasing exposure both for those being tested and for the health workers administering the tests. 5. Laboratory tests are typically invasive, involving blood draws or deep nasal swabs. 6 . Laboratory testing can be expensive, especially when testing large numbers of people. 5 Alternative surveillance approaches are needed to overcome these limitations and serve as a complement to laboratory testing, especially during the critical early stages of a pandemic. Aggregated de-identified Internet search patterns have been used to track a wide range of health phenomena, including influenza 6 , MERS 7 , measles 8 , abortion 9 and immunization compliance, 10 and are a potential alternative source of information for surveilling pandemic spread. Previous uses of these data have yielded valuable lessons in their appropriate use, with an emphasis on avoiding non-specific search terms and avoiding complex models that tend towards overfitting. 11 6 12 When harnessed appropriately, Internet search patterns possess a number of powerful advantages relative to laboratory testing: 17 These studies relied on a limited number of patients, and were published weeks and months after the initial spread of the pandemic. It would thus be beneficial to pandemic tracking, case diagnosis and treatment if these clinical patterns could be ascertained earlier and at population-scale. We conducted a detailed global study across 32 countries on six continents to determine whether Internet search patterns can provide reliable real-time indicators of local COVID-19 spread, and whether these data can reveal the clinical progression of COVID-19. We selected 32 countries from diverse regions of the world (Table 1 ), in which robust search data were available for the search terms of interest. We obtained data on reported COVID-19 cases and deaths for each of these countries from a publicly available dataset maintained by the Center for Systems Science and Engineering at Johns Hopkins University. 18 We collected daily relative search volume (RSV) data on a per-country basis for the period of January 1, 2020 through April 20, 2020, from Google Trends 19 using the pytrends API 20 . Google has limited availability in China, we accessed search trend data from the search engine Weibo. 21 We accessed data for the following common symptoms of COVID-19: "fever", "cough", "dry cough", "chills", "sore throat", "runny nose" and "shortness of breath", as well as the general terms "coronavirus", "coronavirus symptoms" and "coronavirus test". We also looked at other less common symptoms such as loss of smell and loss of taste, but the search data on those terms were too sparse for many countries. Since daily search data are inherently noisy, all search data were smoothed with a 5-day moving average. 22 and confirmed that robust data were available for these translated search terms on Google Trends. 19 We conducted temporal correlation studies to characterize the relationships between search data and reported COVID-19 cases and deaths. For each country and search term, we calculated the Pearson correlation coefficient between the time series of search volumes for that search term and the time series of COVID-19 cases. We then shifted the search term data by a variable lag, and identified the lag that yielded the highest correlation. We computed the mean and standard deviation of these optimal lags for each search term across all countries. We then repeated these analyses substituting reported COVID-19 deaths for reported COVID-19 cases. We then investigated whether Internet search data can be used to characterize the clinical course of COVID-19 symptoms over time. We used "coronavirus symptoms" as the index search term, since it peaked first among all other search terms. For each country, the date of peak search volume for the index search term was defined as the index date, and the dates for all other search data for that country were defined in relation to this date (Day 0, Day 1, Day 2, etc.). With the data from all countries thus aligned, cross-country ensemble average curves were calculated for each search term. (e.g. Day 1 values for "fever" searches in each of the 32 countries were averaged together to calculate the Day 1 value of the ensemble average "fever" curve. The same was done for Day 2, etc.) The ensemble average curves for all search terms were then overlaid on one plot, providing a search-data-based view of the clinical course of illness. As above, all results were calculated using search data smoothed with a 5-day moving average. For . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 6, 2020. . visualization purposes a 10-day moving average provided slightly clearer plots; plots based on 5-day moving average are included in the Supplementary Materials. We begin by presenting examples from individual countries. Figure 1 shows search volumes for the terms "fever" and "dry cough", alongside reported COVID-19 cases and deaths for China, Iran, Italy, United States and India. Even though outbreaks occurred at different times in each country, the temporal relationships between the search terms and reported COVID-19 cases and deaths are similar across countries. Figure 2a shows the lags between search volumes for the term "fever" and COVID-19-related deaths across 32 countries, along with a histogram showing the distribution of these lags. Figure 2b shows the same information for the term "dry cough". The average lags between searches and reported cases are shorter than those between searches and reported deaths since cases are usually diagnosed and reported before deaths. The cross-country variability of the average lags between searches and reported cases is greater (higher standard deviation and larger confidence intervals) than that for reported deaths, likely since case reporting is more sensitive to local testing capacity and rates which vary greatly between countries. We found that the general term "coronavirus" has greater variability in its lags to reported cases and deaths (largest standard deviation and confidence intervals), compared to other symptom-specific terms, as would be expected for a non-specific search term. We examined the progression of symptom-related search terms over time in order to characterize the clinical course of COVID-19. Figure 3 shows examples from individual . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 6, 2020. We calculated ensemble average curves for each search term across 32 countries. Figure 4a shows the ensemble average search volumes for "fever", "cough", "dry cough" and "shortness of breath", indexed by searches for "coronavirus symptoms", alongside reported COVID-19 cases and deaths. Figure 4b shows this same analysis for additional search terms "sore throat", "runny nose", "chills", and "coronavirus test", also indexed by searches for "coronavirus The clinical progression that emerges from these data presents the following picture: As the pandemic begins to take hold in a country, people search for "coronavirus symptoms" and "coronavirus test", followed by initial symptoms "fever", "cough", "runny nose", "sore throat" and "chills", followed by searches for "shortness of breath" about 5 days after the search for initial symptoms. The medical literature reports the clinical progression of COVID-19 in terms of the number of days between initial symptom onset and shortness of breath (dyspnea). Therefore, we examined a range of possible search-term-based definitions for initial symptom onset, based on various combinations of the earliest-peaking search terms "fever", "cough", "coronavirus symptoms" and "coronavirus test". Table 3 shows the lags between these different definitions for initial symptom onset and searches for "shortness of breath". The average lag between the searches for "fever" and "shortness of breath" was 5.22 days [95% CI 3.30-7.14]. For "cough" it was 5.16 days [95% CI 3.13-7.18]. These lags, and lags deriving from other symptom onset definitions, are all around 5 days, precisely matching the clinical course of the disease reported in the literature. 14 15 16 17 This is the first study to show that Internet search patterns can be used to reveal the detailed clinical progression of a disease. During emergent pandemics, this level of detail can provide . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 6, 2020. . public health officials not only with predictions of local pandemic spread, but also with a detailed understanding of the stage of illness and the manifestations of the disease in the local environment. Recent studies have indicated that the spread and severity of the disease can be affected by local conditions, 23 24 25 and search volume data can be a valuable complementary tool in studying potential local variations in disease presentation. We found that increases in symptom-related searches such as "fever" predict increases in reported The use of search data is subject to a number of important limitations. 11 6 12 Internet infrastructure and digital access levels differ across countries and communities. Some countries currently lack sufficient search volume to support robust search-based tracking, though in the long-term, digital access rates are increasing worldwide. Search data may be subject to socio-economic, geographic, or other biases inherent in the local digital divide. 13,26 , 27 Even though selecting specific search terms increases the signal to noise ratio, changes in search volumes for symptom-related terms such as "fever" could result not only from increases in COVID-19 cases, but also general curiosity about the pandemic, other diseases (e.g. influenza, Lassa fever 28 ), news coverage, or other factors. Due to privacy considerations, search data are provided as aggregated relative search volumes rather than raw counts, so while it is possible to predict an increase in cases, it may be difficult to infer the magnitude of such an increase. While search data have important limitations, laboratory testing is subject to a wide range of limitations listed above, in addition to reliability issues which cause certain types of laboratory tests to need to be repeated in order to increase reliability. 29 30 Recent studies have examined Internet search data related to the current COVID-19 pandemic. Some studies looked only at the search term "coronavirus", 31 , 32 , 33 , 34 while others looked at additional search terms such as "handwashing", "face masks", 35 36 "quarantine", "hand disinfection", 37 "SARS", "MERS", 38 "antiseptic", and "sanitizer". 39 None of these studies . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 6, 2020. . https://doi.org/10.1101/2020.05.01.20087858 doi: medRxiv preprint examined specific symptom-related search terms. In our analysis, we found that general non-symptom-specific search terms such as "coronavirus" have a greater variability in their relation to reported cases and deaths. This is likely due to the fact that individuals seeking general information on the pandemic may search for "coronavirus" even if they are not experiencing specific symptoms themselves at that time. A few studies have examined Internet search data using terms related to COVID-19 symptoms, but these covered only one country or a small number of countries: One study looked at "coronavirus" and "pneumonia" only in China and found that Internet searches were correlated with daily incidences of COVID-19. 40 Another looked at "COVID", "COVID pneumonia", and "COVID heart" only in the US, and found that these terms were correlated with COVID-19 daily incidence and deaths. 41 Yet another study looked at "loss of smell" in 8 countries, and found these to be correlated with COVID-19 cases . 42 Another study developed a model incorporating a large selection of search terms, including some symptom-related terms, in 8 countries; however this latter study focused on optimizing the performance of a predictive model, rather than studying the detailed temporal relationships between patterns of symptoms . 43 The present study is the first to conduct a detailed investigation of multiple COVID-19 symptom-related search terms across a large number of countries. It is also the first to conduct a detailed analysis of the temporal relationships between different symptom-related searches in order to characterize the clinical course of illness for COVID-19. Future work includes training a robust predictive model with various machine learning techniques to provide more granular predictions for increases in COVID-19 cases and deaths. Such models can also take into account additional sources of data such as news reports, testing capacity, public health mitigation measures, climatological and air quality variables, among others. The ability of search data to not only predict future increases in cases, but also reveal the clinical course of symptoms in emergent pandemics is significant. Given the numerous limitations of laboratory testing, search data are a valuable complementary source for population-scale tracking of pandemics in real time. These data can be used today to guide the public health response to the COVID-19 pandemic in countries worldwide. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 6, 2020. . https://doi.org/10.1101/2020.05.01.20087858 doi: medRxiv preprint All data used in this study are publicly available through the sources referenced in the Methods section. B.Y.R. and T.L conceived of the study, supervised its conduct, and oversaw data collection. B.Y.R. and T.L. designed the study, conducted statistical analysis, drafted the manuscript and formulated the implications of the results. All authors contributed substantially to the revision of the manuscript and approved the final manuscript as submitted. Table 1 . Countries included in the study, categorized by region. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 6, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 6, 2020. . https://doi.org/10.1101/2020.05.01.20087858 doi: medRxiv preprint Table 3 . Average lag in days from search-based symptom onset to searches for "Shortness of Breath" across 32 countries. Different search-term-based definitions for symptom onset were examined by looking at different combinations of early-peaking search terms. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 6, 2020. . Figure 1 . Search volumes (purple) for the terms "fever" (left) and "dry cough" (right), alongside reported COVID-19 cases (cyan) and deaths (orange) for China, Iran, Italy, US and India. Even though outbreaks occur at different times in different countries, the relationships between the search terms and reported COVID-19 cases and deaths are similar across countries. To highlight the temporal relationships between the curves, the magnitude of each curve was independently normalized to fit the vertical dimensions of the plot. . (a) Search volumes for the terms "fever" , "cough", "dry cough", "shortness of breath" (black), indexed by searches for "coronavirus symptoms", shown alongside COVID-19 cases (dashed line cyan) and deaths (dashed line orange). (b) Search volumes for the terms "sore throat", "runny nose", "chills", and "coronavirus test", algonside "shortness of breath" (black), indexed by searches for "coronavirus symptoms", shown alongside COVID-19 cases (dashed line cyan) and deaths (dashed line orange). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 6, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 6, 2020. . https://doi.org/10.1101/2020.05.01.20087858 doi: medRxiv preprint Pandemic preparedness and response--lessons from the H1N1 influenza of Pandemic (H1N1) 2009 surveillance for severe illness and response Covid-19 mass testing facilities could end the epidemic rapidly The burning building Fast, portable tests come online to curb coronavirus pandemic Reappraising the utility of Google Flu Trends High correlation of Middle East respiratory syndrome spread with Google search and Twitter trends in Digital epidemiology: assessment of measles infection through Google Trends mechanism in Italy Measuring the impact of health policies using Internet search patterns: the case of abortion Internet activity as a proxy for vaccination compliance Google Trends: Opportunities and limitations in health and health policy research Big data. The parable of Google Flu: traps in big data analysis Assessing Ebola-related web search behaviour: insights and implications from an analytical study of Google Trends-based query volumes Clinical Characteristics of 138 Hospitalized Patients With 2019 Novel Coronavirus-Infected Pneumonia in Wuhan Clinical and epidemiological characteristics of Coronavirus Disease Clinical features of patients infected with 2019 novel coronavirus in Wuhan Interim Clinical Guidance for Management of Patients with Confirmed Coronavirus Disease (COVID-19) CSSEGISandData/COVID-19 High Temperature and High Humidity Reduce the Transmission of COVID-19 Effects of temperature and humidity on the spread of COVID-19: A systematic review Exposure to air pollution and COVID-19 mortality in the United States Profiles of a Health Information-Seeking Population and the Current Digital Divide: Cross-Sectional Analysis of the 2015-2016 California Health Interview Survey Beyond access: barriers to internet health information seeking among the urban poor Mass media reportage of Lassa fever in Nigeria: a viewpoint Development and clinical application of a rapid IgM-IgG combined antibody test for SARS-CoV-2 infection diagnosis Covid-19: testing times The second worldwide wave of interest in coronavirus since the COVID-19 outbreaks in South Korea, Italy and Iran: A Google Trends study Tracking COVID-19 in Europe: Infodemiology Approach Association of the COVID-19 pandemic with Internet Search Volumes: A Google TrendsTM Analysis Online Information Search During COVID-19 Applications of google search trends for risk communication in infectious disease management: A case study of COVID-19 outbreak in Taiwan Google searches for the keywords of 'wash hands' predict the speed of national spread of COVID-19 outbreak among 21 countries Perception of emergent epidemic of COVID-2019 / SARS CoV-2 on the Polish Internet Infodemiological Study Using Google Trends on Coronavirus Epidemic in Wuhan Predicting COVID-19 Incidence Through Analysis of Google Trends Data in Iran: Data Mining and Deep Learning Pilot Study Retrospective analysis of the possibility of predicting the COVID-19 outbreak from Internet searches and social media data, China Trends and prediction in daily incidence and deaths of COVID-19 in the United States: a search-interest based model The use of google trends to investigate the loss of smell related searches during COVID-19 outbreak Tracking COVID-19 using online search Symptom Onset Search-Term-Based Definition Days to Fever Fever The authors declare no competing interests.