key: cord-0149395-vedw3pwj authors: Suprem, Abhijit; Pu, Calton title: EDNA-Covid: A Large-Scale Covid-19 Tweets Dataset Collected with the EDNA Streaming Toolkit date: 2020-10-06 journal: nan DOI: nan sha: d57107aae4ec87719b9d54a0f5a6f1757299e854 doc_id: 149395 cord_uid: vedw3pwj The Covid-19 pandemic has fundamentally altered many facets of our lives. With nationwide lockdowns and stay-at-home advisories, conversations about the pandemic have naturally moved to social networks, e.g. Twitter. This affords an unprecedented insight into the evolution of social discourse in the presence of a long-running destabilizing factor such as a pandemic with the high-volume, high-velocity, high-noise Covid-19 Twitter feed. However, real-time information extraction from such a data stream requires a fault-tolerant streaming infrastructure to perform the non-trivial integration of heterogenous data sources from news organizations, social feeds, and authoritative medical organizations like the CDC. To address this, we present (i) the EDNA streaming toolkit for consuming and processing streaming data, and (ii) EDNA-Covid, a multilingual, large-scale dataset of coronavirus-related tweets collected with EDNA since January 25, 2020. EDNA-Covid includes, at time of this publication, over 600M tweets from around the world in over 10 languages. We release both the EDNA toolkit and the EDNA-Covid dataset to the public so that they can be used to extract valuable insights on this extraordinary social event. Covid-19, also known as the 2019 Novel Coronavirus disease, is caused by the SARS-CoV-2 virus. It is a rapidly spreading disease that was first reported in Wuhan, China, in December 2019, and has since spread to all continents across the globe. It was first picked up as a viral pneumonia in the Wuhan region on December 31, 2019. WHO signaled the agency's highest level of alarm on January 30, 2020, by labeling the outbreak as a Public Health Emergency of International Concern, and subsequently declared it a pandemic on March 11, 2020 after reports of 118K cases in 114 countries and a 13x increase in cases outside China within 2 weeks. Since then, over 33M people have contracted the disease, with over 1M fatalities and 24M recoveries, as of September 30, 2020. The response to the pandemic has run the gamut, from complete nationwide lockdown with strict enforcement, as in the case of China and several Western European nations like Italy, France, and Spain, to stay-at-home advisories with decentralized enforcement in the United States, to no lockdown combined with extensive testing and contact tracing, in South Korea, to no lockdown with no federal contact tracing . In conjunction, both WHO and national CDCs have both recommended a slew of guidelines to slow down the spread and flatten the curve to ensure the healthcare industry is not strained with surge in infections. These guidelines, which include public mask mandates, social distancing, work-from-home, cancellation of public events, and shutdown of schools, have naturally led to increasing online participation as people turn to social media to carry out the conversation [9] . This sustained increase in online engagement in reference to a single event provides an unprecedented insight into a slew of areas in natural language processing, such as social communication modeling, credibility analysis, topic modeling, and fake news detection. Our EDNA-COVID dataset, which contains over 600M tweets from over 10 languages, would be an excellent source for research into the social and language dynamics of the pandemic. Our dataset demonstrates concept drift (see Figure 5 in subsection 3.3), making it ideal for testing streaming models of analytics. Data exhibits concept drift when its underlying distribution changes over time, usually over several years. Under concept drift, machine learning models and conventional offline analytics will degrade as their prediction data desynchronized from their training data model. Concept drift is a natural part of real data; several examples of drift abound in nature, from changing seasons [15] , which can degrade performance of computer vision systems, to lexical drift [8] , which can degrade performance of NLP models over different geographical regions. An important requirement inn concept drift research is data that exhibits such drift to enable development and testing of drift detection and adaptation mechanisms. With EDNA-COVID, we present a dataset that exhibits concept drift. The online discourse on the Covid-19 pandemic has taken root in a dizzying array of online communities, such as sports [12] , academia [18] , and politics [1] . This allows us a firsthand look at a real-world example of concept drift as the online conversations change over time to accommodate new actors, knowledge, and communities. This yields a high-volume, high-velocity data stream with noise and drift as the underlying conversations about the pandemic transition from confusion to information to misinformation [5] and today, with the US election nearing, disinformation [11] . We will first present EDNA, our toolkit for consuming and processing streaming data. Then we present EDNA-COVID dataset, the streaming methods we employed, and some salient statistics about the dataset. EDNA is an end-to-end streaming toolkit for ingesting, processing, and emitting streaming data. EDNA is based on our prior work with LITMUS [14] and ASSED [16] , and incorporates a slew of improvements for faster deployment, fault tolerance, and end-to-end management. EDNA's initial use was a test-bed for studying concept drift detection and recovery. Over time, it has grown to a toolkit for stream analytics. We are continuing to work on it to mature it for production clusters. We have released an alpha version at https://github.com/asuprem/edna. In this section, we describe some high-level details about the toolkit. The central abstraction in EDNA is the ingest-process-emit loop, implemented in an EDNA Job. We show an EDNA Job in Figure 1 . EDNA Job. Each component of the loop in an EDNA Job is an abstract primitive in EDNA that is extended to create powerful operators. 1. Ingest primitives consume streaming records. 2. Process primitives implement common streaming transformations such as map and filter [2] . Multiple process primitives can be chained in the same job. 3. Emit primitives generate an output stream that can be sent to a storage sink, such as a SQL table, or to another EDNA Job. EDNA Application. An EDNA Application consists of several jobs in a DAG, as in Figure 2 . We apply the EDNA Job abstraction to the application as well, where an EDNA Application consists of: 1. At least 1 Ingest-Job to ingest a stream from external sources, such as the Twitter Streaming API. The Ingest-Job should not do any processing to reduce backpressure and ensure the highest throughput for consuming an external stream. 2. Process-Jobs that process the stream. Each Process-Job runs its own ingest-process-emit loop to transform the stream 3. At least 1 Emit-Job to emit a stream to external sinks, such as a SQL table, S3 bucket, or distributed file system such as Hadoop. The EDNA stack ( Figure 3) Apache Kafka [19] , a durable message broker with built-in stream playback to connect jobs, and Redis [3] to share information between jobs. The EDNA runtime manages and executes jobs on the applied deployment. EDNA Jobs use the ingest, process, and emit APIs to implement the ingest-process-emit loop, with the appropriate plugin for complete the job. The next section describes our EDNA-COVID dataset and the EDNA Application we use to generate the dataset. We have collected the EDNA-COVID dataset since January 25, 2020 using Twitter's streaming API. Over time, we have also enriched our dataset with other similar datasets, such as [4] and [13] . EDNA-COVID is similar ins cale to [4] ; in addition, we also provide our data download method by releasing EDNA. We also perform additional data cleaning to remove irrelevant and deleted messages. Furthermore, our dataset is an order of magnitude larger than [13] , with over 60M tweets between January and March 2020, compared to 6M for [13] . We show our EDNA Application to stream tweets for the EDNA-COVID dataset in Figure 4 . It consists of the following jobs: • Twitter Ingest: This job connects to the Twitter v2 sampled stream endpoint and consumes records for the application. We use the Twitter Sampled Stream API, available at [17] . This API provides a real-time stream of 1% of all tweets. • Archive: We immediately archive the raw objects to disk. • Metadata extractor: This job extracts the tweet object from the streaming record and performs some data cleaning in discarding malformed, empty, or irrelevant tweets. Tweets are kept if they contain coronavirus related keywords: coronavirus, covid-19, ncov-19, pandemic, mask, wuhan, and virus. To capture Chinese social data, we also include these keywords in Mandarin. We initially included the keyword china during data collection in January and February, but decided to omit the phrase since it introduced significant noise, and any tweets with the keyword that were relevant to coronavirus already include the above keywords. • Sentiment analysis: We use an off-the-shelf tweet sentiment analysis model from [10] to record text sentiment. We plan to replace this with an EDNA application that will automatically generate and retrain a sentiment analysis model with data from Twitter's own streaming sentiment operators. • SQL Upsert: This job inserts or updates (if the tweet already exists) the tweet object into our database. We record the original fields provided by Twitter, plus sentiment and misinformation keywords. • Windows Group: We group 1 minute's worth of tweets for faster misinformation keyword checking and to record misinformation keyword statistics on a per-minute window. • Misinformation: We check whether the grouped tweet objects contain any of the misinformation keywords extracted by the Extract Misinformation job (see below). The job regularly updates its cache of keywords from Redis. • Misinformation Keywords Ingest: We obtain a collection of misinformation keywords from Wikipedia [20] and from [6] ; the keywords from the former are obtained with a Wikipedia ingest plugin that reads the misinformation article each day. The keywords from the latter are provided directly to the job since they are not updated and do not need to be retrieved repeatedly. • Extract Misinformation: This job parses the misinformation article from Misinformation Keyword Ingest and extracts keywords from headlines in the Conspiracy section. All keywords are updated in the Redis cache. Even with 1% of the Twitter stream, we are able to collect a largescale dataset of tweets. We show in Table 1 the tweets collected since January. We converted to parallelized Metadata Extractor jobs near the end of June to improve our data collection and reduce instances of dropped tweets. We also updated our keyword filtering approach to keep tweets that are retweets of matching tweets. Our data is skewed towards English language tweets, as we show in Table 2 with the top 5 language categories. We also included Chinese and Japanese tweets with keywords in the corresponding languages; including Chinese keywords nets us ∼25K tweets per month, which is less than 0.1% of the collected tweets, and including Japanese keywords adds ∼50K tweets per month. This includes enrichment with tweets from [4, 13] . Since EDNA-COVID is a real-time, multilingual stream of a current event, it is ideal for studying concept drift. We show an example of drift in Figure 5 , which displays the fraction of the stream that matches our keywords. Initially, wuhan is a strong keyword for the stream, since the origin was an important topic of discussion. In addition, conversations about wuhan peak in the first few weeks and then decline, signifying the shift in conversations towards the disease itself. More recently, conversations about the pandemic itself have taken precedence over conversations about the disease, as pandemic and mask see increased weight in the stream. As masks have become a more contentious issue, there has been a resurgence in conversations about masks, necessitating adjustments to data collection, cleaning, and misinformation detection starting July, 2020. Due to Twitter TOS regarding release of tweets, we are releasing only the Tweet IDs of the dataset to the public through a registration method. We have provided a form at https://forms.gle/ dFYhuMzyPMunY17H9 for dataset requests. We will then provide Tweet IDs of our collected tweets. Tweets need to be hydrated first using tools like twarc [7] . We have also provided a sample of TweetIDs for the first few months at https://github.com/asuprem/ EDNA-Covid-Tweets. Risk perception through the lens of politics in the time of the covid-19 pandemic Apache flink: Stream and batch processing in a single engine Redis in action Covid-19: The first public coronavirus twitter dataset The covid-19 social media infodemic CoAID: COVID-19 Healthcare Misinformation Dataset Diffusion of lexical change in social media Social media use spikes during pandemic Like it or not: A survey of twitter sentiment analysis methods Going viral: How a single tweet spawned a COVID-19 conspiracy theory on Twitter COVID-19 Spit Tests Used by NBA Are Now Authorized by Understanding the perception of COVID-19 policies by mining a multilanguage Twitter dataset LITMUS: Landslide detection by integrating multiple sources Beyond Artificial Reality: Finding and Monitoring Live Events from Social Sensors Assed: A framework for identifying physical events through adaptive social sensor data filtering Twitter Sampled Stream School closures caused by Coronavirus (Covid-19 Building a replicated logging system with Apache Kafka Wikipedia. 2020. Misinformation related to the COVID-19 pandemic -Wikipedia, The Free Encyclopedia This research has been partially funded by National Science Foundation by CISE/CNS (1550379, 2026945, 2039653), SaTC (1564097), SBE/HNDS (2024320) programs, and gifts, grants, or contracts from Fujitsu, HP, and Georgia Tech Foundation through the John P. Imlay, Jr. Chair endowment. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or other funding agencies and companies mentioned above.