key: cord-322337-4xhwm3k4 authors: Desai, P. S. title: Sentiment Informed Timeseries Analyzing AI (SITALA) to curb the spread of COVID-19 in Houston date: 2020-07-24 journal: nan DOI: 10.1101/2020.07.22.20159863 sha: doc_id: 322337 cord_uid: 4xhwm3k4 Coronavirus disease (COVID-19) has evolved into a pandemic with many unknowns. Houston, located in the Harris County of Texas, is becoming the next hotspot of this pandemic. With a severe decline in international and inter-state travel, a model at the county level, as opposed to the state or country level, is needed. Existing approaches have a few drawbacks. Firstly, the data used is the number of COVID-19 positive cases instead of positivity. The former is a function of the number of tests carried out while the latter is normalized by the number of tests. Positivity gives a better picture of the spread of this pandemic as with time more tests are being administered. Positivity under 5% has been desired for the reopening of businesses to almost 100% capacity. Secondly, the data used by models like SEIRD lacks information about the sentiment of people with respect to coronavirus. Thirdly, models that make use of social media posts might have too much noise. News sentiment, on the other hand, can capture long term effects of hidden variables like public policy, opinions of local doctors, and disobedience of state-wide mandates. The present study introduces a new AI model, viz., Sentiment Informed Timeseries Analyzing AI (SITALA), that has been trained on COVID-19 test positivity data and news sentiment from over 2750 news articles for the Harris county. The news sentiment was obtained using IBM Watson Discovery News. SITALA is inspired by Google-Wavenet architecture and makes use of TensorFlow. The mean absolute error for the training dataset of 66 consecutive days is 2.76 and that for the test dataset of 22 consecutive days is 9.6. The model forecasts that in order to curb the spread of coronavirus in Houston, a sustained negative news sentiment will be desirable. Public policymakers may use SITALA to set the tone of the local policies and mandates. lacks information about the sentiment of people with respect to coronavirus. Thirdly, models which make use of social media posts might have too much noise. This study attempts to develop a multivariate artificial intelligence (AI) model to analyze timeseries of COVID-19 positivity and news sentiment. The AI model is inspired by Google's Wavenet (11) architecture and uses IBM Watson Discovery News (12) to mine COVID-19 sentiment in the news articles. 5 The COVID-19 test positivity data for Harris county was obtained from the website of Texas Department of State Health Services (https://dshs.texas.gov/coronavirus/additionaldata.aspx). Couple of instances of bad or missing data were filled using linear interpolation. IBM Watson Discovery was used to mine the news sentiment in 2867 news articles over the period of 3 months. 10 The query used in this study is provided in the supplementary material. The entire dataset is also provided in the supplementary material. Model ( . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.22.20159863 doi: medRxiv preprint squares) can be +1.0 for maximum positive sentiment and -1.0 for maximum negative sentiment. Overall, on most of the days the news sentiment about the spread of coronavirus in Houston, Harris county, Texas has been negative. A big positive spike in news sentiment is seen around the time of social unrest in Houston (05/30 to 06/02). The focus of news might have shifted away from COVID-19 during this timeframe. An upward trend is visible in the COVID-19 positivity data 5 (connected red dots). Around 75% of the data (green window) was used for training SITALA, of which 10% was reserved for validation. SITALA was tested on remaining 25% of the data (blue window) for which the mean absolute error (MAE) was 9.6. SITALA forecast (gray window) shows how maintaining a negative sentiment in the news about the spread of COVID-19 can be beneficial to control and 15 eventually decrease test positivity. The data was divided into training and test datasets comprised of 75% and 25% of the entire data (i.e., 04/21 to 07/17) respectively. 10% of the training data was used for validation. The continuous black line with shadow shows the predictions of trained SITALA over the entire dataset. SITALA 20 is able to capture the response of the COVID-19 positivity data with a mean absolute error (MAE) of 2.76 for training dataset and 9.6 for test dataset. SITALA is unable to capture the highest spikes encountered in both the datasets of COVID-19 positivity. This may have been due to the smaller number of observations in the total dataset (a total of 88 days' worth of observations is not at the level of the requirements of big data) and smoothing out effect that is introduced by the time 25 window of 16 days. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.22.20159863 doi: medRxiv preprint A maximum positive sentiment of 0.7 and negative of -0.7 were used as the bounds for future news sentiment. With these as the inputs to SITALA, two extreme forecasts were obtained till 08/07. These are also shown in Fig. 2 with dotted red and dotted green lines with shadow, respectively. SITALA forecasts the positivity of COVID-19 in Houston to lie within this uncertainty cone. A sustained positive sentiment, e.g., "masks are optional", may prove disastrous for the spread of 5 coronavirus in Houston. On the contrary, a sustained negative sentiment, e.g., "death count for COVID-19 is growing at an alarming rate in Houston", may help to discourage social gatherings and to keep the COVID-19 positivity under check. Discussion: This study highlights the multivariate nature of COVID-19 positivity. The unknowns 10 about the disease have not yet been thoroughly understood. However, public policy makers can benefit from models like SITALA which add the dimension of news sentiment to the positivity data to make forecasts. The long-term effect of sentiment due to the virus incubation period of 14 days can be captured using an AI making use of dilated causal convolutions. News publishers also have a bigger role to play in curbing the spread of coronavirus in Houston. SITALA is a constantly 15 evolving AI and should be enhanced with newer data, as and when available, using transfer learning. SITALA may be deployed at other similar crisis-stricken counties in New York, Florida, and California. The query searched for the articles having 'houston' in the url may have caused 20 omission of few relevant articles that did not have Houston in the url. During the initial few days of training dataset, there were hardly any articles relevant to the IBM Watson query and thus the sentiment during this period was assumed to be neutral, i.e., a value of 0. Ethics ( . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. News sentiment data The data for news sentiment was obtained by querying IBM's Watson Discovery News tool. The tool provides 200 free queries per user per month. More information can be found here: https://www.ibm.com/watson/services/discovery-news/. Discovery News employs natural language processing to return answers to the queries. It also analyzes the sentiment of the news articles. The query used in this study was: 10 Filter which documents you query publication_date::"2020-05-20 29",url:"houston",(enriched_text.keywords.text:"coronavirus"|enriched_text.ke ywords.text:"COVID-19"|enriched_text.keywords.text:"2019-nCoV") A sample query along with the output from Watson Discovery News is shown in Fig. S1 . 25 Fig. S1 . Sample query output from IBM Watson News Discovery. The present study used the exact same query to determine the sentiment of news (-1 implies maximum negative sentiment and +1 implies maximum positive sentiment) over varying publication dates. Location-specific (viz., Houston) articles were filtered usingurl:"houston" 30 The entire dataset is reproduced in Table S1 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.22.20159863 doi: medRxiv preprint Modelling and predicting the spatio-temporal spread of Coronavirus disease 2019 (COVID-19) in Italy Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions Application of the ARIMA model on the COVID-2019 epidemic dataset Neural Network aided quarantine control model estimation of global Covid-19 spread United State Department of Transportation The effect of travel restrictions on the spread of the 2019 novel coronavirus (COVID-19) outbreak Pandemic politics: Timing state-level social distancing responses to COVID-19 Forecasting Models for Coronavirus Disease (COVID-19): A Survey of the State-of-the-Art Mining Coronavirus (COVID-19) Posts in Social 20 Media Predicting COVID-19 incidence through analysis of google trends data in Iran: data mining and deep learning pilot study Wavenet: A generative model for raw audio The era of cognitive systems: An inside look at IBM Watson and how it works Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems Digital technology and COVID-19 Multivariate temporal convolutional network: A deep neural networks approach for multivariate time series forecasting Conditional time series 35 forecasting with convolutional neural networks Everyday ethics for artificial intelligence On the responsible use of digital data to tackle the COVID-19 pandemic Preprint. Desai, P. S.