key: cord-0495844-8zcexgsj authors: DeVerna, Matthew; Pierri, Francesco; Truong, Bao; Bollenbacher, John; Axelrod, David; Loynes, Niklas; Torres-Lugo, Cristopher; Yang, Kai-Cheng; Menczer, Fil; Bryden, John title: CoVaxxy: A global collection of English Twitter posts about COVID-19 vaccines date: 2021-01-19 journal: nan DOI: nan sha: 59c6fa4144e6015e56989809340890a0248fcc54 doc_id: 495844 cord_uid: 8zcexgsj With a large proportion of the population currently hesitant to take the COVID-19 vaccine, it is important that people have access to accurate information. However, there is a large amount of low-credibility information about the vaccines spreading on social media. In this paper, we present a dataset of English-language Twitter posts about COVID-19 vaccines. We show statistics for our dataset regarding the numbers of tweets over time, the hashtags used, and the websites shared. We also demonstrate how we are able to perform analysis of the prevalence over time of high- and low-credibility sources, topic groups of hashtags, and geographical distributions. We have developed a live dashboard to allow people to track hashtag changes over time. The dataset can be used in studies about the impact of online information on COVID-19 vaccine uptake and health outcomes. The COVID-19 pandemic has killed two million people and infected 93 million around the world as of mid-January, 2021 (Dong, Du, and Gardner 2020) .Vaccines will be critical in our fight to end the COVID-19 pandemic (Orenstein and Ahmed 2017) . It is estimated that around 60-70% of the population will need to be vaccinated against COVID-19 to achieve herd immunity so that virus spread can be effectively suppressed (Aguas et al. 2020 ). However, recent surveys have found that only 40-60% of American adults reported that they would take a COVID-19 vaccine (Funk and Tyson 2020; Hamel, Kirzinger, and Brodie 2020) . With these currently predicted levels of vaccine hesitancy, it is unlikely we will reach herd immunity; COVID-19 will remain endemic in our population. A possible driver for vaccine hesitancy is the antivaccination movement. This movement has been on the rise in the U.S. for two decades, beginning with unfounded fears over a Measles, Mumps and Rubella (MMR) vaccine (Hussain et al. 2018) . The vocal online presence of the antivaccination movement has undermined confidence in vaccines. Worse, resistance to the COVID-19 vaccines is cur-* These authors contributed equally to this work † These authors contributed equally to this work rently much more prevalent than resistance to the MMR vaccine. Since COVID-19 vaccine hesitancy and its drivers remains understudied, a goal of our project is to help address this gap. There is a growing body of evidence linking social media and the antivaccination movement to vaccine hesitancy (Broniatowski et al. 2018; Burki 2019; Johnson et al. 2020) . Studies show that vaccine hesitancy in one's peer group is associated with future vaccine refusal (Brunson 2013) , and that misinformation spread on social networks is linked to poor compliance with public health guidance about COVID-19 (Roozenbeek et al. 2020) . Based on these findings, the core hypothesis behind this project is that the social spread of vaccine misinformation and vaccine hesitancy will impact public health outcomes such as vaccine uptake and COVID mortality rates. Here we present a collection of English posts related to the COVID-19 vaccines on Twitter. The collection is exempt from IRB review as it only includes tweet IDs of public messages. This allows us to comply with the Twitter Terms of Service while making the data available to both researchers and the general public. Although there has been previous work presenting COVID-19 Twitter datasets (Chen, Lerman, and Ferrara 2020; Huang et al. 2020; Lamsal 2020) , our work focuses specifically on discussion of COVID-19 vaccines and related public health outcomes. The CoVaxxy dataset will enable researchers to study vaccine misinformation and hesitancy, and their relationship to public health outcomes. We will use established techniques to track vaccine misinformation within the data, along with misinformation superspreaders, coordinated campaigns, and automated accounts (Yang, Hui, and Menczer 2019; Yang et al. 2020; Pierri, Piccardi, and Ceri 2020a,b; Pacheco et al. 2020 ). We will also relate this social media data to geographic public health data (such as COVID-19 mortality and vaccine uptake rates) by using geolocation data within the dataset. This paper describes relevant aspects of the CoVaxxy dataset, which cover data collection, descriptive analyses of the data and its potential usage, and a live dashboard intended for the public to track key insights drawn from the data. Opportunities and limitations of the dataset are dis-arXiv:2101.07694v1 [cs.SI] 19 Jan 2021 cussed as we draw conclusions. Our key data collection goal is to download a complete a set of Twitter posts related to COVID-19 vaccines. In this section we describe our methodology for selecting appropriate keywords to achieve such a coverage. We then describe our architecture with server redundancy to maintain an unbroken stream of Twitter data containing these keywords. To create as complete a set of Twitter posts related to COVID-19 vaccines as possible, we carefully select a list of keywords through a snowball sampling technique (Conover et al. 2012; Yang, Hui, and Menczer 2019) . We start with the two most relevant keywords, i.e., covid and vaccine, as our initial seeds. Note that keywords also match hashtags, URLs, and substrings. For example, covid matches "cnn.com/covid" and "#covid19." Next, we gather tweets utilizing the filtered stream endpoint of the Twitter API 1 for three hours. From these gathered tweets, we then identify potential keywords that frequently co-occur with the seeds, adding them to our seed list only after manually ensuring they are closely related to our topic. This process was repeated six times between Dec. 15, 2020 and Jan. 2, 2021 with each iteration's data collection taking place at different times of the day to capture tweets from different geographic areas and demographics. The seed list serves as our initial keyword list. We further refine the keyword list by manually combining certain keywords into composites, leveraging the query syntax of Twitter's filtered stream API. For example, using covid19 pfizer as a composite matching phrase will capture tweets that contain both "covid19" and "pfizer." On the other hand, including covid19 and pfizer as separate keywords will capture tweets that contain "covid19" or "pfizer." Constructing various composites of relevant keywords in this way ensures the dataset is broad enough to include most relevant (English) conversations while excluding tweets that are not related to the vaccine discussion. To demonstrate the effectiveness of the snowball sampling technique introduced above, we calculate the popularity of each (single or composite) keyword by the number of unique tweets and unique users associated with it. Figure 1 shows the effect of adding new keywords into the list of streaming filters. The keywords are ranked by popularity. The diminishing growth of popularity suggests that the inclusion of additional keywords is redundant for coverage of users and tweets. The diminishing returns are due to the co-occurrence of multiple keywords and hashtags in a single tweet, especially for the most popular terms. Thus, we believe that our set of keywords provides reasonable coverage and is representative of tweets communicating about COVID-19 vaccines. As the collection of tweets is intended to persist over time, new keywords will emerge. To ensure that the keyword list remains comprehensive throughout the data collection period, our team will continue to monitor the ongoing public discussion related to COVID-19 vaccinations, should it become necessary to update the list with important emerging keywords. System architecture Our server architecture (Figure 2 ) is designed to collect and process large quantities of data. The infrastructure is hosted on Jetstream virtual machines (VMs) (Towns et al. 2014; Stewart et al. 2015) . To maintain the integrity of our tweet streaming pipeline, we have incorporated redundancy. We maintain two streamer (stream collection) VMs in different U.S. states so that if one suffers a fault we can use data from the other. These servers connect to Twitter's filtered stream API to collect tweets that match any of the keywords in real time. We use the language metadata to filter out non-English tweets. The data from the two streamers is collated on a general purpose server VM where we run data analysis. The server VM is also linked to Anonymous University's high performance computing infrastructure for running advanced analyses. We will upload data files to a public data repository on a regular basis. In compliance with Twitter's Terms, we are only able to share tweet IDs with the public. One can rehydrate the dataset by querying the Twitter API or using tools like Hydrator 2 or twarc 3 . Finally, a web server provides access to the data on the server VM through applications. An example is the interactive dashboard, described next. Existing COVID-19 visualization tools include those by Johns Hopkins University (Dong, Du, and Gardner 2020) and The Atlantic. 4 These trackers address hospitalization and mortality. Another dashboard from the Fondazione Bruno Kessler covers the infodemic, reporting on the proportions of misinformation and epidemic-related stats (confirmed and death cases) per country. 5 Finally, the Our World in Data COVID-19 vaccination dataset publishes vaccine uptake information by country. 6 A tool to concurrently explore the relationships between COVID-19 vaccine conversations, vaccine uptake, and epidemic trends is missing. We plan a web-based visualization to fill this void. The CoVaxxy dashboard will track and quantify credible information and misinformation narratives over time, as well as their sources and related popular keywords. The dashboard will focus on the U.S. at the state-level. It Tweets Hashtags URLs 1,847,067 4,768,204 39,857 983,158 Table 1 : Breakdown of the data collected between January 3rd and January 10th in terms of unique users, tweets, hashtags and URLs. will be updated daily. Figure 3 illustrates a simple prototype that displays hashtag sharing behavior at the hourly level. This data will be displayed alongside COVID-19 pandemic and vaccine trends. By highlighting the connection between misinformation and public health actions and outcomes, we hope to encourage the public to be more vigilant about the information they consume on their daily social media feeds in the fight against COVID-19. Our system started to gather tweets on Jan. 4, 2021. Table 1 provides a breakdown of the dataset (as of January 11) in terms of the number of unique users, number of tweets they shared, and numbers of unique hashtags and URLs contained in these tweets. We show in Figure 4 a time series for the number of tweets collected in our dataset, on an hourly basis. We can notice a decrease in the number of tweets after January 6, which might be driven by the increased media attention surrounding the storming of the U.S. Capitol. 7 In fact, the mean daily number of tweets decreases from 900k tweets in the period of Jan 4-6 to 400k tweets in the period of Jan 7-11. In Figure 5 we show the distribution of the tweets geolocated in the contiguous United States. We use a naive approach to match tweets to U.S. states: we first extract the user location from the profile (if present) and then match it against a dictionary of U.S. states. Finally, we compute the number of tweets for each state based on the activity of users geo-located in that state. Over 1M users in our dataset have location metadata in their profile; we were able to match approximately 40k users resulting in 600k geo-located tweets. Providing an accurate methodology to geo-locate users is outside the scope of this paper; the reader should consider these results only as an illustration of the insights that can be gained from the CoVaxxy data. Figure 6 lists the most tweeted hashtags in our dataset. We can see that they are largely related to the SARS-CoV-2 vaccine, with one ("#covidiots") referring to COVID-19 deniers. Many different conversations can occur concurrently on Twitter, using different hashtags for different topics. To cluster related hashtags, we have grouped them together using a network algorithm. We form a co-occurrence network with hashtags as nodes and edges weighted according to how often the linked hashtags co-occur within tweets. Nodes are clustered using the Louvain method (Blondel et al. 2008) . Groups with hashtags that are used the most are plotted in Figure 7 . We observe groups of hashtags associated with vaccine conspiracy theories ("#greatreset," "#billgates") as well as positive messages ("#stayhome"). In Figure 8 we show the top-10 most shared websites. We exclude "twitter.com," which accounts for over 3M tweets. These sites are comprised mostly of high-credibility infor- mation sources. However, one low-credibility source -"zerohedge.com" -also makes this list (see below for details on the classification). We also observe a large number of links to YouTube, which suggests further investigation will be needed to assess the nature of this shared content. Figure 9 provides a time series for the prevalence of lowand high-credibility information. We follow an approach widely adopted in the literature (Lazer et al. 2018; Shao et al. 2018; Bovet and Makse 2019; Grinberg et al. 2019; Yang et al. 2020) to label links to news articles based on source reliability. In particular, we use a third-party list of 675 low-credibility sources 8 and 26 hand-selected mainstream sources. Overall, links to low-credibility sources accounts for 24,841 tweets compared to 72,680 tweets linking to our sample of mainstream sources. Readers should note that these numbers do not fully capture the news circulating on Twitter, as the lists we employ cannot be exhaustive. We further list in Figure 10 the 20 most shared news websites, including both source classes. We notice several unreliable sources (cf. "zerohedge.com" and "bitchute.com") that exhibit prevalence comparable to more reliable websites. In this paper we present a new public dataset tracking discourse about COVID-19 vaccines on Twitter. We characterize the data in several ways, including prominent keywords, geographic distribution of tweets, and clusters of related hashtags. We also present a prototype data dashboard that will visualize statistics and insights from this data. In future work, we intend to explore the relationship between online discussion of COVID-19 vaccines and public health outcomes, like COVID-19 mortality and vaccine uptake. We will also leverage existing social media analysis tools to track emerging narratives and suspicious accounts, such as bots, coordinated campaigns, and troll farms (Yang, Hui, and Menczer 2019; Yang et al. 2020; Pierri, Piccardi, and Ceri 2020a,b; Pacheco et al. 2020 ). Finally, we plan to explore models to better understand how vaccine misinformation and anti-vaccine sentiment spreads on social media. No. tweets theguardian.com dailymail.co.uk reuters.com youtube.com nytimes.com washingtonpost.com nypost.com bylinetimes.com independent.co.uk zerohedge.com This dataset has a few key limitations. First and critically, Twitter users are not a representative sample of the population, nor are their posts a representative sample of public opinions (Wojick and Hughes 2020) . The Twitter filtered stream API also imposes a rate limitation of 1% of all public tweets, which could in future limit our ability to capture all the relevant content. Another potential source of bias is ally we are unable to fully exclude irrelevant material using only keyword-based content filtering. The long-term aim of this project is to tackle the ambitious challenge of linking social media observations directly to public health. We hope that researchers will be able to leverage the CoVaxxy dataset to obtain a clearer picture of how vaccine hesitancy and misinformation affect health outcomes. In turn, such insight might enable public health officials to design better strategies for confronting vaccine hesitancy and refusal. Along with this submission we will include a data file titled covaxxy.tgz. Within that file there is a collection of tweet ids, the list of keywords, and the lists of low-and high-credibility websites. This content will be shared on websites in the future but these will not be anonymous so we can't list them here. Herd immunity thresholds for SARS-CoV-2 estimated from unfolding epidemics Fast Unfolding of Communities in Large Networks Influence of fake news in Twitter during the 2016 US presidential election Weaponized health communication: Twitter bots and Russian trolls amplify the vaccine debate The Impact of Social Networks on Parents' Vaccination Decisions Vaccine Misinformation and Social Media Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set Partisan asymmetries in online political activity An interactive webbased dashboard to track COVID-19 in real time Intent to get a COVID-19 vaccine rises to 60% as confidence in research and development process increases KFF COVID-19 Vaccine Monitor Coronavirus Twitter Data: A collection of COVID-19 tweets with automated annotations The Anti-Vaccination Movement: A Regression in Modern Medicine The Online Competition between Pro-and Anti-Vaccination Views Coronavirus (COVID-19) Tweets Dataset The science of fake news Simply put: Vaccination saves lives A multi-layer approach to disinformation detection in US and Italian news spreading on Twitter Topology comparison of Twitter diffusion networks effectively reveals misleading news Susceptibility to misinformation about COVID-19 around the world The spread of lowcredibility content by social bots Jetstream: A Self-Provisioned, Scalable Science and Engineering Cloud Environment Sizing Up Twitter Users Bot electioneering volume: Visualizing social bot activity during elections The COVID-19 Infodemic: Twitter versus Facebook