key: cord-0852295-9fhn2sug authors: Liu, Zhimin; Jiang, Zuodong; Kip, Geoffrey; Snigdha, Kirti; Xu, Jennings; Wu, Xiaoying; Khan, Najat; Schultz, Timothy title: An Infodemiological Framework for Tracking the Spread of SARS-CoV-2 Using Integrated Public Data date: 2022-04-26 journal: Pattern Recognit Lett DOI: 10.1016/j.patrec.2022.04.030 sha: 75830ff36ae95dac97339eff5663764feba27bee doc_id: 852295 cord_uid: 9fhn2sug The outbreak of the SARS-CoV-2 novel coronavirus has caused a health crisis of immeasurable magnitude. Signals from heterogeneous public data sources could serve as early predictors for infection waves of the pandemic, particularly in its early phases, when infection data was scarce. In this article, we characterize temporal pandemic indicators by leveraging an integrated set of public data and apply them to a Prophet model to predict COVID-19 trends. An effective natural language processing pipeline was first built to extract time-series signals of specific articles from a news corpus. Bursts of these temporal signals were further identified with Kleinberg's burst detection algorithm. Across different US states, correlations for Google Trends of COVID-19 related terms, COVID-19 news volume, and publicly available wastewater SARS-CoV-2 measurements with weekly COVID-19 case numbers were generally high with lags ranging from 0 to 3 weeks, indicating them as strong predictors of viral spread. Incorporating time-series signals of these effective predictors significantly improved the performance of the Prophet model, which was able to predict the COVID-19 case numbers between one and two weeks with average mean absolute error rates of 0.38 and 0.46 respectively across different states COVID-19, the disease caused by the SARS-CoV-2, has been rapidly spreading across the globe and has become a substantial health threat worldwide. As of March 20, 2021 , an estimated 122 million people worldwide have been infected with the virus, with an estimated 2.7 million deaths [1] . Accurately monitoring and forecasting regional progression of COVID-19 can help (1) healthcare systems to ensure sufficient supply of equipment and personnel to reduce fatalities, (2) the pharmaceutical industry to perform clinical trials for vaccines or medicines, and (3) world governments to make or adjust nonpharmaceutical interventions or vaccination plans. Internet sources and data have been employed to inform public health and policies; this application is referred to as Infodemiology (i.e., information epidemiology) [2] . These data sources have in the past nowcasted and forecasted outbreaks and epidemics of various infectious diseases [3] - [8] . During a pandemic, leveraging infodemiological data, especially in the early phase of the pandemic when there is not enough infection data to generate accurate models, can be a practical way to monitor viral transmission and help the governments to take action more quickly. Of particular interest to infodemiology as applied to COVID-19 is news media, which serves as a crucial communication medium that can significantly affect individuals' behavior. News media data can be used to study the sentiment of the society in response to COVID-19-related policies and vaccinations [9] . Media coverage on these topics and the corresponding sentiments can also potentially be useful predictive factors for COVID-19 cases. To capture specific news in unstructured formats, Natural Language Processing (NLP) techniques are required; however, commonly used topic modeling methods like Latent Dirichlet Allocation (LDA) [10] have poor performance when analyzing COVID-19-related news as articles tend to repeat very similar vocabularies. This makes it difficult to parse specific subjects (e.g., COVID-19-related school reopening vs. lockdown). Thus, a new NLP method is needed. Google Trends (GT) is another popular infodemiology data source that is actively used in health and medicine to track and forecast diseases and epidemics [11] . Several papers have used GT data to monitor, track, and forecast COVID-19 in the US [12] - [14] . These studies consistently identified a high correlation between GTs of COVID-19 related terms and new COVID-19 cases for a lag period ranging from 12 to 16 days, demonstrating the strong predictive power of GT for COVID-19 progression. However, most of these studies were conducted in the early stages of the pandemic; at this time, the pandemic has lasted for more than a year, with many regions in US having experienced at least two waves. People's behaviors, such as online search activities, may change as the pandemic evolves. For example, familiarity with COVID-19-related information increases since the beginning of the pandemic and therefore certain search terms may fall out of interest. Therefore, it is necessary to re-examine the leader-follower relationship between GT and COVID-19 case numbers with the more recent and comprehensive data. An emerging data source to track the spread of SARS-CoV-2 comes from wastewater monitoring [15] - [17] . Monitoring sewage for viral RNA concentrations enables effective population-level surveillance, providing a sensitive signal of its circulation throughout communities. This data unbiasedly captures circulation without the need for conducting PCR testing or unaccounted asymptomatic cases. It has been shown that viral concentrations of wastewater are 0-10 days ahead of clinically diagnosed new COVID-19 cases [15] , [17] , suggesting another predictor to forecast COVID-19 cases. As the US government initiates the National Wastewater Surveillance System in response to the COVID-19 pandemic, more data from different regions will be collected and reported. While several models have been used to forecast COVID-19 cases, many of them cannot integrate multiple timer series signals. Auto Regressive Integrated Moving Average (ARIMA) model has been used by researchers around the world to forecast the spread of this pandemic and generates accurate predictions [18] - [20] . ARIMA works best when data is stationary, meaning that the variance and the mean of the data remain constant over time. In addition, ARIMA can only be implemented on univariate time series. Kalman filtering is another algorithm used to forecast COVID-19 cases but only produces satisfying short term (daily) predictions [21] , [22] . Prophet has been widely used and accepted due to its accuracy and ease of usability [23] , [24] . Its automatic nature gives flexibility to time series data that have dramatic changes so that users do not have to worry about their data being not suited for the model [23] . More importantly, Prophet provides the option to integrate other time-series covariates and thus serves as an ideal model to combine various digital data streams. In this study, we extracted informative signals from three available public datasets (i.e., news websites, Google Trends, and wastewater SARS-COV-2 measurements), and for the first time integrated these predictive signals into a model to forecast COVID-19 trends. We first built an effective NLP pipeline using the Word2Vec embeddings from pre-trained deep neural network [25] on Google News to identify news on specific topics. We validated the pipeline by successfully distinguishing different groups of news and identifying "school reopen" and "lockdown" related COVID-19 news. We further identified the time points when specific news broke out abruptly using a burst detection model. We then aligned various signals with new COVID-19 cases, checked their correlations and synchronies, and identified several signals as early indicators of enhanced spread. Finally, we integrated the selected signals into a Prophet model to predict future COVID-19 cases and demonstrated that these signals could significantly improve the base model's performance. The number of daily cumulative confirmed cases in 19 states of United States were obtained for the period until Dec 31, 2020 from the COVID Tracking project (https://covidtracking.com/data/download). The column of "positive" that contains total number of confirmed plus probable cases of COVID-19 reported by the state was used. For Massachusetts data, there is one time point that has a smaller cumulative case number than that of the previous week. That week's case number was replaced with the average case number within that month. The public news data in this study were obtained from NewsAPI.org, which allows to search public news and articles from over 30,000 news sources in 54 countries, including ABC, BBC, Australian Financial Reviews and others. COVID related news in each state were acquired by keywords searching of COVID terms AND a specific state's name in each article's title. COVID terms include "coronavirus", "COVID-19", "COVID19", "SARS-CoV-2", "2019-nCov". Eventually, 33,083 relevant news and articles were collected from 19 states with a period of Dec. 1, 2019 to Dec. 31, 2020. Google Trends Google Trends provides the relative search volume for each keyword. This value is calculated by dividing the total number of searches for a keyword by the total searches within a geographic and time range. Keywords can be filtered by location with a resolution from worldwide to a specific city and time span. Time series data are presented on a normalized scale of 0 to 100, where 0 represents no search and 100 represent peak search activity for a particular keyword or string. Google Trends' daily base data were mined in this study from February 1, 2020, to December 31, 2020. The following keywords were searched: COVID symptoms, COVID testing, covid rapid testing, school opening, lockdown. Data for each keyword with in each of 19 selected states were obtained. Wastewater COVID-19 measurement Wastewater COVID-19 data in Massachusetts (MA), Ohio (OH), and Arizona (AZ) were obtained through the following links: MA: https://www.mwra.com/biobot/biobotdata.htm OH: https://coronavirus.ohio.gov/wps/portal/gov/covid-19/dashboards/other-resources/wastewater AZ: https://data.tempe.gov/datasets/covid-wastewater-resultspublic-view/data?selectedAttribute=Day Data in Arizona was limited to Tempe City, while MA and OH data came from multiple sites across the states. MA has the most consistent daily measurements since March 2020, while the measurements in OH and AZ were less frequent since July and April in 2020. The units of all the data were converted into number of copies per liter of wastewater. Data of MA and AZ were averaged by week, while in OH data were aggregated weekly by taking the median measurements across multiple sites and different days to the reduce the effect of some extreme measurements. Sentiment The sentiment score for each news was calculated by aggregating the polarity scores from each word in the title using the vaderSentiment package from NLTK. The title, headline, and content of each news item were concatenated and all characters were converted to lower case. Punctuation, text in square brackets, words containing numbers, stop words from NLTK, and state names were removed. The remaining words were tokenized and lemmatized with NLTK packages. The Word2Vec embeddings from pre-trained deep neural network on Google News (https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz) were imported and used to vectorize each news item with the mean of word embeddings for each word in that article. Similarly, specific topics represented by some keywords were vectorized by the mean word embeddings for each word (e.g., "school reopen": "school reopen reopening schools operating schools students teachers"; "lockdown": "lockdown restrict restrictions"). The similarity between each news item and a specific topic was measured by calculating the cosine of the angle between the two vectors that represent them: Where A and B represent the vectors of the news and the topic, respectively. Cosine similarity scores of 1 and -1 represent two overlapping vectors and two exactly opposite vectors, respectively. The distributions of cosine similarity scores between each news item and the target topic were examined and a threshold of mean + 2 × standard deviations was chosen to qualitatively identify news related to the target topic. Table Table 1 The sum of squared distances of samples to their closest cluster center and the mean Silhouette Coefficient of all samples were calculated with functions in scikit-learn package: K-Means (default settings) and silhouette_score (metric = "cosine") when vectorized news item were split into different numbers of clusters (1-50) using K-Means. The optimal number of clusters leading to a low intra-cluster distance and a high mean Silhouette Coefficient was used. The most frequent words within each cluster were used to represent each cluster. Aggregate data by week Daily COVID-19 news numbers were summed by week since the daily news numbers were generally small (< 10) in most states. The fraction of news related to a specific topic was calculated by dividing the number of target news items by the total number of news items in each week for each state. New weekly COVID-19 cases were derived by differencing accumulative case numbers reported in each state. The weekly average Google Trends of specific terms and wastewater COVID-19 measurement were used. The "burst_detection" package (https://github.com/nmarinsek/burst_detection) that implements Kleinberg's burst detection algorithm for batched data [26] was used in this study. In this model, there are two possible states: baseline state (lower probability) and bursty state (higher probability). The probability of baseline state (p 0 ) is the overall proportion of target events: where R is the sum of daily target news items (e.g., "school reopen" or "lockdown" related news or news with negative sentiments) and D is the sum of all daily news items in each week. The bursty state probability (p 1 ) is equal to the baseline probability multiplied by some constant s. Based on the news data, s = 2 was used to detect bursts for "school reopen" or "lockdown" related news and s = 1.2 was used for COVID-19-related news with negative sentiments. Two things could determine which state the system is in at any given time: the difference between the observed proportion and the expected probability of each state denoted by sigma: where i is the state (0: baseline state; 1: bursty state), d t and r t are the number of target news and total news in each week. the difficulty of transitioning from the previous state to the next state denoted by the transition cost, tau: where n is the number of weeks and is the difficulty of transitioning to higher states. Note that there is no cost associated with staying in the same state or returning to a lower state. is critical to exclude false bursts generated from a small number of news, and thus a specific was chosen to ensure that only time points with enough news (more than the median of total news number across all time points) were identified as bursts. The total cost of transitioning from one state to the other is equal to the sum of two functions above. The optimal state sequence q that minimized the total cost would be characterized by the Viterbi algorithm. The weight of a burst that begins at t1 and ends at t2 can be estimated with the following function: Burst weight demonstrates how much cost is reduced when the system is in a burst state versus the baseline state during the burst period. The greater the weight, the stronger the burst would be. Correlation analysis Spearman's rank correlation coefficient was calculated to determine the correlation between weekly COVID-19 case numbers and various weekly aggregated signals. Timelagged cross-correlation (TLCC) was used to identify the directionality between two time-series signals such as a leader-follower relationship in which the leader (e.g., Google Trends and news volume) initiates a response which is repeated by the follower (COVID-19 case numbers). TLCC was measured by incrementally shifting one time series vector and repeatedly calculating the correlation coefficient between two signals. Specifically, it was implemented with pandas functionality (datax.corr(datay.shift(lag)), where datax and datay are two time series signals and lag is the shifting window). The correlation analyses were made for each state individually. In this study due to the nature of weekly aggregated data, the component of daily/weekly/yearly seasonality or holiday was not used in the Prophet model [23] . Only time-series signals that were shown as early predictors of COVID-19 trends (Google Trends of "COVID testing", "covid rapid testing", "COVID symptoms", and "lockdown", and COVID-19 related news volume: "news count") were shifted and used as extra regressors in the Prophet model via the "add_regressor" function. The extra regressor must be known for both historical and future dates. Therefore, it must either be something that has known future values or something that can be forecasted elsewhere. Here, the time series data of these early predictors were shifted by one or two week(s) to generate the future values since they were leading COVID-19 case numbers for 1-3 weeks from the correlation analyses. Columns with these extra regressor values were put into both the fitting and prediction data frames. The default settings were used in the Prophet model since performance was not improved by tuning parameters like "mode" and "prior_scale". The Prophet model was applied to data from each state. The data from the last five weeks were used as the test data to measure its performance. The following statistical measures were used: ∑ ̂ Where denotes the actual value and ̂ denotes predicted value for the k th week. N is the total number of weeks in the test data. The NLP pipeline to extract news on a specific topic like "school reopen" was shown in Figure 1A . Each news and target topic were first preprocessed and then vectorized by averaging word embeddings in the article and topic, respectively. A cosine similarity score was calculated using these two vectors. Thresholding the cosine similarity scores could identify news related to the target topic. The identified news as well as clustering news based on the vectors were manually examined to validate the pipeline. 33,083 vectorized COVID-19 news from 19 states were divided into 13 clusters based on K-means clustering algorithm ( Figure S1 ) and each cluster was represented by the most frequently occurring words ( Figure 1B and Table S1 ). Different groups of news like "election" (cluster 2), "school" (cluster 3), "prison" (cluster 5), "sports" (cluster 7), and "vaccine" (cluster 11) were successfully characterized by clustering their vectors. Furthermore, the times of occurrence of these clusters were consistent with real situations as shown in Figure 1B . For example, "school"-related news (cluster 3) occurred in June 2020 when the second wave of COVID-19 arrived in US, leading to numerous discussions on school opening/closing. The occurrence of "sports" news (cluster 7) was centered in September when major sports leagues started or resumed their seasons (NFL: 09/10/2020; NBA: 07/30/2020; MLB: 07/23/2020). Most of "vaccine" related news was reported in the end of the year when the first two COVID-19 vaccines were approved for emergency use (Pfizer-BioNTech: 12/11/2020; Moderna: 12/18/2020). These results demonstrated vectors generated from Word2Vec embeddings accurately captured the information in the news. To identify the news of a specific topic, a cosine similarity score was calculated between the vectors representing the topic (e.g., "school reopen") and the news. As shown in Figure 1C , news in cluster 3 representing "school" had significantly high cosine similarity scores with "school reopen" compared to other clusters. The distributions of cosine similarity scores of all news articles with "school reopen" and "lockdown" were shown in Figure S2A and S2B, respectively. An arbitrary threshold of mean + 2 × SD shown by the red dashed lines was used to identify the target news items. Titles of some examples related to "school reopen" and "lockdown" were shown in Figure 1D and Figure S2C , respectively. While news with higher similarity scores were more related to the topic, most selected news items that passed the threshold was associated with the chosen topic. Therefore, this pipeline is very efficient to sort out news based on search terms. It is noteworthy that search terms can be tweaked to obtain a more accurate vector to represent a specific topic. Besides "school reopen"-and "lockdown"-related news, many other signals were extracted from news, GT data, and viral measurements in wastewater as shown in Figure 2A . Since the signals extracted from news were generally related to government policies, pivotal events, or public opinions which could influence COVID-19 trends, Kleinberg's burst detection model [26] was used to detect "bursts of activity" when these signals increased sharply (see Burst model in Experimental), aiding in the monitoring of epidemic spread ( Figure S3 ). All signals were then aligned with COVID-19 case numbers and their correlations were examined in each state ( Figure 2B and S4) . Across different states, GT of COVID-19 related terms (e.g., "COVID testing", "covid rapid testing", "COVID symptoms") and wastewater COVID-19 measurement correlated well with COVID-19 cases ( Figure 2B and Figure S4 ), while signals obtained from news and their bursts had variable correlations with COVID-19 cases ( Figure S3 and S4) . In some states like Massachusetts and Arizona, correlation of news volume with COVID-19 cases were high ( Figure 2B and S4 ). In addition, many redundant signals correlated well with each other as shown in Figure S5 . It is noteworthy the counts of various specific news generally correlated well with total news count, indicating count signals of news of various topics were biased by total number of news items and thus the ratio signals that were normalized by total count of news items were used in the following analyses. To further identify signals that could be predictors of COVID-19 case numbers, Time-Lagged Cross-Correlation (TLCC) between each of these signals and COVID-19 case numbers was calculated (see Correlation Analysis in Experimental). Essentially, correlations between two time series signals were repeatedly examined when one signal was incrementally shifted. If the peak correlation is at the center (offset = 0), two signals are most synchronized at the same time. However, the peak correlation may be at a different offset if one signal leads another. While there were large variations for the offsets between signals derived from news and COVID-19 cases, the offsets of GT of "COVID testing", "covid rapid testing", "COVID symptoms", and "lockdown", and COVID-19 related news volume ("news count") with COVID-19 case numbers were generally consistent and had median values of -3 to -1 across different states ( Figure 3A -G and Figure S6 ), indicating a leader-follower relationship between these signals and COVID-19 cases. With small sample sizes and inconsistent samplings, wastewater COVID-19 measurement ("mean_copies_per_liter") in three states synchronized with the COVID-19 case number, consistent with the previous study [15] . These analyses indicated that GT of "COVID testing", "covid rapid testing", "COVID symptoms", and "lockdown", and news volume ("news count") could potentially serve as early predictors for COVID-19 cases. Figure 4A demonstrated the mean absolute percentage errors (MAPEs) of predicting COVID-19 cases in one week in each state using Prophet models incorporating different signals. As shown in Figure 4B , though including some signals as single extra regressor did not improve the model's prediction accuracy, the error rate was reduced when all chosen signals were integrated into the model. When only two signals (GT of "COVID testing" and "COVID symptoms") that led to a smaller MAPE as a single extra regressor were used, the mean error rate was almost the same with that of the model using all metrics. However, in states with large mean absolute errors (MAEs) like California, Ohio, Tennessee, Illinois, Massachusetts, and Arizona, the model using two selected signals performed much worse than the model including all signals ( Figure S7 ). Including wastewater measurements in the model could significantly improve the model's accuracy even though the data was limited to three states ( Figure 4C ). The MAPEs of predicting COVID-19 cases in different number of weeks were shown in Figure 4D and Figure S8 . Although the mean prediction errors generally became larger as the time span of prediction increased, the median MAPE of two weeks were the same with that of one week. More importantly in many states like California, Texas, and Florida that had large number of COVID-19 cases, MAPEs of the prediction in two weeks were much smaller compared to in one week. These results demonstrated that these signals could be used to predict COVID-19 cases two weeks in advance, consistent with the results of synchrony analysis. In this article, we aimed to extract more granular signals from available public data to forecast future SARS-CoV-2 spread. We built a simple NLP pipeline that could accurately extract news of a specific topic. With the pretrained Word2Vec embeddings, this pipeline could identify any specific news item, including public sentiment on various containment policies, and build a time series dataset for that topic, which is novel and extensible for a variety of scenarios. On top of these time series signals, we applied a burst model to mark bursty time points when special events broke out. From the signal of news with negative sentiment the burst model successfully captured the time point when the first confirmed death from COVID-19 was reported in several states (data not shown). We did not find good correlations between the signals obtained from news data with COVID-19 cases, indicating that there were many other confounding factors underlying these signals. For example, the amount of news was small even for large states due to the strict filtering step (i.e., the title must contain the state name and COVID-19 related terms) in obtaining regional COVID-19 news. In the future, we may use alternative approaches such as a classification model [27] to obtain more news data. In addition, this pipeline can be applied to other related but more abundant data sources like Twitter and Facebook posts to extract informative signals. In contrast to news signals, Google Trends (GT) correlated well with COVID-19 case numbers and, more importantly, led 1-2 weeks ahead of the COVID-19 case numbers in almost every state that examined, consistent with previous studies [13] , [14] . Interestingly, the lag between GT and COVID-19 case number has a very similar range with the virus incubation period [28] , suggesting people may search for COVID-19 because they suspect that they may have been exposed to the virus even though they are asymptomatic. We incorporated some GT signals as extra regressors in the Prophet model to predict the future COVID-19 case numbers, and found they significantly improved the model's performance. Based on these results, it would be worthwhile to integrate these signals with other COVID-19 forecasting models [29] , [30] . With limited data, we also demonstrated that the signal from wastewater COVID-19 measurement aligned with COVID-19 case number and could further reduce the Prophet model's prediction errors, especially in Massachusetts, which provided an adequate and consistent number of measurements. As more states/counties have been applying this technology [17] , more data will be generated and reported, providing another valuable dataset that can be incorporated into the COVID-19 forecasting model. As this framework is scaled up to cover more regions or larger time spans, one issue that we might confront is the sparse time series data, especially for news signals. For example, some regions might be under-reported in the news. One potential solution is to make the data less granular by deriving time sequence using month and week instead of actual date, though the resultant prediction will also be less granular and hence less useful. Another solution worth exploring is predicting missing data with latent factor models that have been commonly used in the recommender system to quantify user-item preference [31] , [32] . This will enable many further analyses, though the prediction of these missing values must be carefully validated. Here, we integrated various signals into an additive Prophet forecasting model, which might not make the best use of these data. For example, bursty signals as a binary signal did not correlate well with continuous COVID-19 case numbers and thus were not included as a regressor in the model. Therefore, a new model specific for integrating these signals like [33] may generate more accurate predictions. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. An interactive web-based dashboard to track COVID-19 in real time Infodemiology and Infoveillance: Framework for an Emerging Set of Public Health Informatics Methods to Analyze Search, Communication and Publication Behavior on the Internet Associations of Topics of Discussion on Twitter With Survey Measures of Attitudes, Knowledge, and Behaviors Related to Zika: Probabilistic Study in the United States Comparing Social media and Google to detect and predict severe epidemics Risk of MERS importation and onward transmission: a systematic review and analysis of cases reported to WHO SARS and Population Health Technology The internet and the anti-vaccine movement: Tracking the 2017 eu measles outbreak Too Far to Care? Measuring Public Attention and Fear for Ebola Using Twitter CoronaTracker: World-wide COVID-19 Outbreak Data Analysis and Prediction Online Learning for Latent Dirichlet Allocation Assessing the Methods, Tools, and Statistical Approaches in Google Trends Research: Systematic Review COVID-19 predictability in the United States using Google Trends time series Correlations Between COVID-19 Cases and Google Trends Data in the United States: A State-by-State Analysis Association of the COVID-19 pandemic with Internet Search Volumes: A Google TrendsTM Analysis Measurement of SARS-CoV-2 RNA in wastewater tracks community infection dynamics Temporal dynamics in viral shedding and transmissibility of COVID-19 SARS-CoV-2 titers in wastewater foreshadow dynamics and clinical presentation of new COVID-19 cases Application of the ARIMA model on the COVID-2019 epidemic dataset ARIMA and NAR based prediction model for time series analysis of COVID-19 cases in India ARIMA modelling & forecasting of COVID-19 in top five affected countries Short-term forecasts and long-term mitigation evaluations for the COVID-19 epidemic in Hubei Province, China Kalman filter based short term prediction model for COVID-19 spread Forecasting at scale Time series analysis and forecasting of coronavirus disease in Indonesia using ARIMA model and PROPHET Efficient Estimation of Word Representations in Vector Space Bursty and Hierarchical Structure in Streams Using Reports of Symptoms and Diagnoses on Social Media to Predict COVID-19 Case Counts in Mainland China: Observational Infoveillance Study The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application Forecasting COVID-19 and Analyzing the Effect of Government Interventions Interpretable Sequence Learning for COVID-19 Forecasting A Multilayered-and-Randomized Latent Factor Model for High-Dimensional and Sparse Matrices An α -β -Divergence-Generalized Recommender for Highly Accurate Predictions of Missing User Preferences A machine learning methodology for real-time forecasting of the 2019-2020 COVID-19 outbreak using Internet searches, news alerts, and estimates from mechanistic models We thank Jose Zamalloa, Peter Z. Shen for the discussion on the methods and results. We thank Alexandra Jacunski for revising the manuscript carefully. This research is supported by Janssen Research & Development.