key: cord-133143-ws708tsc authors: Xu, Wentao; Sasahara, Kazutoshi title: Characterizing the roles of bots during the COVID-19 infodemic on Twitter date: 2020-11-12 journal: nan DOI: nan sha: doc_id: 133143 cord_uid: ws708tsc An infodemic is an emerging phenomenon caused by an overabundance of information online. This proliferation of information makes it difficult for the public to identify trustworthy news and credible information from untrustworthy sites and non-credible sources. The perils of an infodemic debuted with the outbreak of the COVID-19 and bots (i.e., automated accounts controlled by a set of algorithms) that are suspected of involving the infodemic.Although previous research has revealed that bots played a central role in spreading misinformation during major political events, it is unclear how bots behaved during the infodemic. In this paper, we examined the roles of bots in the case of the COVID-19 infodemic and the diffusion of non-credible information such as"5G"and"Bill Gates"conspiracy theories and"Trump"and"WHO"related contents by analyzing retweet networks and retweeted items. We show the bipartite topology of their retweet networks, which indicates that right-wing self-medium accounts and conspiracy theorists may lead to this opinion cleavage, while malicious bots might favor amplification of the diffusion of non-credible information. Although the basic influence of information diffusion could be larger in human users than bots, the effects of bots are non-negligible under an infodemic situation. (WHO) Director-General on 15 February 2020 [1] . Prior to this comment, a large amount of misinformation about the new coronavirus emerged on popular social networking sites (SNSs) and SNSs began to play a major role in the diffusion of misinformation. According to [2] , in terms of information sources, top-down misinformation from politicians, celebrities, and other prominent public figures accounted for 69% of total social media engagement. Additionally, approximately 60% of the COVID-19 related information was reconfigured, twisted, recontextualized and reworked on Twitter and 38% of the misinformation was completely fabricated. The fact-checking organization, Politifact also pointed out that true and mostly true news about the coronavirus only comprised up to 10% of the total information [3] . SNS users tend to connect to like-minded users, which is known as the "birds-of-a-feather phenomenon" or homophily [4, 5] . Meanwhile, users also tend to follow influencers and celebrities on SNSs who behave like information hubs. Thus, when an influencer posts misinformation, the followers, often without any doubts about the content, tend to believe and share the post in a homophilic social network. In addition to human users, previous studies have shown that bots (i.e., automated accounts controlled by a set of algorithms) also play a key role in propagating misinformation. [6] discussed how bots engaged with humans to increase their influence. A well-known case in politics is the 2016 U.S. presidential election, during which bots were used to widely diffuse misinformation [7, 8] . It was estimated that of all tweeted links to popular websites, 66% were shared by bots [9] . [10] analyzed 43.3 million English tweets and found the preference of bots and humans in tweet contents by analyzing n-grams and hashtags. Furthermore, it was suggested that bots may play a critical role in driving the viral spread of content from low-credibility sources and may be able to amplify misinformation [11] . [12] found that social bots increased from 7.1% before to 9.9% when examining German Twitter followers' prevalence and activities of seven German parties before the 2017 electoral campaigns. Posts by bots were likely to be more polarized than humans [13] , and therefore the potential engagement between humans and bots could further amplify the misinformation diffusion. Recent research about the COVID-19 infodemic has discovered the information flow pathways between humans and bots [14] . There are several methods for bot characterization and identification. Botometer is a well-known tool for automatically detecting bots based on supervised machine learning models [15, 16] . Botometer examines six classes of features including profile, friends, social network, temporal activity patterns, language, and sentiment which is further categorized into approximately 1,200 features for a Twitter account. This tool computes a "bot score" for each user that ranges within [0, 1]. The higher the score, the higher the probability that the user is a bot. Botometer is a state of the art tool for identifying bots, and a series of research studies have used this tool to quantify the online behaviors of bots [11, 17] . Therefore, in our study we used Botometer to discriminate between bots and humans. Given this context, an important research question is how bots behaved in the spread of misinformation during the COVID-19 infodemic. To study this, we focused on Twitter retweets. Retweet is an information spreading behaviour by which any user can share messages immediately with their followers. A retweet can be both a productive communicative tool and a selfish act of attention seekers [18] . [19] found that an interesting tweet has either an interesting context or it is produced (retweeted) by an influencer. [20] pointed out that a user's high popularity does not necessarily imply high influence and vice-versa, indicating that an influencer's popularity and influence are weakly correlated. However, [21, 22, 23] considered that a user's contextual information (e.g., social network topology, tweet content, URLs) affects the retweet behavior. In this paper, we used retweets to address information sharing behavior, shedding light on how COVID-19 misinformation is shared in an information ecosystem where bots live. Misinformation is classified into several types and conspiracy theories are one type [24] . The negative effect of a conspiracy theory is to elicit emotions including avoidance, fear, anger, aggression, which further result in irrational behaviors [24] . As an example, the 5G conspiracy theory was reported on January 22 by a local Belgian newspaper, saying that a local doctor claimed 5G might be linked to the coronavirus [25] . In the UK, 5G cell phone masts came under arson attacks due to this conspiracy theory [26] . Another version of this conspiracy theory claims that 5G alters people's immune system and changes DNA structures, thus making people more susceptible to contracting the coronavirus [27, 28] . In addition, another popular conspiracy theory targeted Bill Gates, the co-founder of Microsoft Corporation. This claims was that Bill Gates supported implanting tracking chips under the pretext of a mandatory coronavirus vaccination [29, 30] . U.S. political groups were reported as showing a significant partisan bias regarding this conspiracy [31] ; compared with the left-wing, the right-wing was more inclined to believe in this conspiracy. To investigate the spreading of misinformation during the COVID-19 infodemic, we focused on conspiracy theories related to 5G and Bill Gates as mentioned above. For comparison, we also focused on other topics, such as "WHO" and "Trump" (the 45th U.S. president). These keywords were selected because health and political misinformation flourished during the COVID-19 infodemic. Recent research found that Trump was the largest driver of COVID-19 misinformation [32] . In this paper, we firstly characterized the credible and non-credible humans and bots around the four topics in the retweet networks. We then compared the retweet activities as well as other features in the four topics considered. Our results may help us understand how bots played a role during the COVID-19 infodemic, providing insights into a mitigation strategy. We used Twitter as a data source to characterize the COVID-19 infodemic. We collected 279,538,960 English tweets from Feb 20 to May 31 by querying COVID-19 related keywords: "corona virus", "coronavirus", "covid19", "2019-nCoV", "SARS-CoV-2", "wuhanpneumonia" using the Twitter Search API. As aforementioned , we focused on four topics in our analyses: "WHO", "Trump", "Bill Gates", and "5G". According to a list of non-credible websites released on MisinfoMe 1 and a list of non-credible news website domains released in [33] , 893 responsive websites from the total of 1143 domains were collected and used as the non-credible domain list. We also examined a list of rated trustworthy media released by [34] and obtained 30 (all responsive) credible media domains whose "News-Guard score" equals 100 (the highest score). A NewsGuard score was evaluated by a group of journalists who gave a score for news websites according to nine standards. In addition, we added major science journals "nature.com" and "sciencemag.org" as credible domains. Thus, in total, we obtained 32 credible domains and used these as the credible domain list. Based on the credible and non-credible domain lists, each tweet was labelled as "credible" if the tweet included a URL in the credible domain list, and as "non-credible" if the tweet included a URL in the non-credible domain list; and otherwise as "other". Then, given a topic, each user was labelled as "credible" if the user retweeted credible tweets exclusively, while "non-credible" if the user retweeted non-credible tweets exclusively. In other words, non-credible users are those who posted with URLs from the non-credible domain list at least once but never posted with URLs from the non-credible domain list. Credible users were similarly defined. Note that a user's label can change from topic to topic. For instance, a user is labeled "credible" in the WHO topic if the user retweets credible domains, exclusively in that topic, even if the user retweets non-credible domains in other topics. In this manner, we classified users into five types-credible humans, non-credible humans, credible bots, non-credible bots and other. After extracting tweets regarding the four topics, we obtained a total of 37,219,979 tweets, in which 23,1515,441 (82.8%) were retweets. The breakdown of this dataset is shown in Table 1 . We used the Botometer API to compute user bot scores. According to [8, 35] , we set the threshold to 0.43 in the human/bot classification. This means that a user was considered to be a bot if the bot score was larger than 0.43, and if otherwise, a human user. To examine the patterns of misinformation flows, we constructed a retweet network for each topic, in which nodes represent users and a directed edge was made from the source to the target if a target user is retweeted by a source user. The retweet network was visualized by the network analysis tool Gephi [36] , with the graph layout algorithm ForceAtlas2 [37] . We used different colors to represent credible and non-credible bots; red nodes are non-credible bots, green nodes are credible bots, and purple nodes are others that can be humans or unlabeled bots. Edge colors are the same as the target node colors. We highlighted users with large a indegree, including important politicians, wellknown mainstream medium, right-wing medium, and so on. Some users in our dataset could be considered as malicious and suspended based on Twitter's spam policy during the time gap between the date we collected the tweets and the date we computed the corresponding bot scores using the Botometer API. Such users were therefore not included in our analyses. Moreover, we compared temporal patterns of retweet activities among four types of users: credible humans and bots, non-credible humans and bots. We also looked at differences in contents of the retweeted URLs in each topic. Retweets in our COVID-19 dataset did not contain a sufficient amount of texts, instead having hyperlinks or URLs to online articles. Thus, we focused on tweets including URLs to online articles and collected these articles retweeted by credible/non-credible humans and bots, separately. We characterized the articles based on terms (nouns) with their importance measured by the TF-IDF score. For this analysis, we limited our research to the top 30 most popular terms in the collected articles. TF-IDF stands for Term Frequency-Inverse Document Frequency and commonly used in Natural Language Processing (NLP). TF-IDF is calculated as follows: T F is the number of a given term (noun). We used the following formula for IDF: where N represents the total number of documents, and d represents the number of documents that include the term. To compare important terms used in articles retweeted by credible and noncredible users, we summarized TF-IDF values by using the Laterality Index (LI) [38] , defined as follows: where C is the TF-IDF score for terms used in articles retweeted by credible users and N C is one for terms used in articles retweeted by non-credible users. LI compares the importance of a term between credible sites and non-credible sites. A negative LI indicates that the term is characteristic of non-credible sites; a positive LI indicates that the term is characteristic of credible sites; LI = 0 indicates that the term is equally important in both sites. Using the preprocessed COVID-19 tweets, we looked at the retweet interactions between humans and bots for each topic. The resulting retweet networks are shown in Fig. 1 . It is notable that bipartite structures emerged in all the topics considered, with dense connections inside and sparse connections in-between. In the "WHO" network (n = 88, 719), 64 non-credible bots and 790 credible bots were identified. The credible group contained official media accounts such as "@washingtonpost", "@ABC", "@BBCWorld" and "@Reugters," which were separated from the non-credible group containing "@DailyCaller", "@gatewaypundit" and "@KimStrassel" (Fig. 1a) . We found that non-credible bots were appearing around the US conservative columnist "KimStrassel" (Kimberley Strassel), a verified user with more than 448 thousand followers, as well as "@DailyCaller" (an American right-wing misinformation website) and "@gatewaypundit" (an American far-right website publishing misleading news). This result implies that the non-credible bots might be trying to interact with politically right-leaning users to increase these users' exposures to negative information. Although WHO itself is a neutral topic, partisan asymmetry is visible during the COVID-19 infodemic. Previous research has found that the retweet network of the 2010 US midterm election showed typical "left" and "right" segregated groups [39] . We thus examined whether the "Trump" reteweet network shares the similar features. Fig. 1b shows the Trump network (n = 1, 125, 366) with 694 noncredible bots and 5,400 credible bots. Here "@HillaryClinton" (Hillary Clinton) and "@JoeBiden" (Joe Biden) representing the progressive clustered together, were distant from the conservative cluster with "@realDonaldTrump" (Donald Trump). The political echo chamber was reobserved in 2020 in the context of the COVID-19 infodemic. A notable finding is that "@realDonaldTrump" was mostly retweeted by non-credible bots (shown in red), whereas "Hillary Clinton" and "Biden Joe" were less so. As far as "5G" is concerned, two separated groups were observed again in the retweet network, with 26 non-credible bots and 171 credible bots (Fig. 1c) . One side of the network includes "@davidicke" (David Icke) and "@davidkurten" (David Kurten). The former is reported as a conspiracy theorist and the latter is currently a member of the UK Independence Party (right-wing populist party) since 2016 [40, 41, 42] . They were the two most retweeted users in the 5G conspiracy topic. By contrast, the main stream British medium and @WHO were located on the other side of the network in Fig. 1c . More non-credible bots were involved on the side of "@davidicke", while there were more credible bots on the other side. Although "5G" was considered as a popular conspiracy theory in the early COVID-19 pandemic, a larger number of non-credible bots were not observed in comparison with other topics. "Bill Gates" is another conspiracy theory topic as mentioned earlier. The resulting retweet network consists of 166 non-credible bots and 467 credible bots (Fig. 1d) . The similar segregated network was observed and the noncredible bots were mainly gathering on the side of "@davidicke". "@EyesOnQ" was encompassed by non-credible bots. According to our data, this is the top retweeted user by non-credible bots in both "5G" and "Bill Gates" conspiracy topics. This account was suspended by Twitter and is no longer accessible. Then, we quantified indegrees (the numbers of retweeted posts by different users, used as a measure for engagement) as a function of the bot score. The resulting scatter plots are shown in Fig. 2 , in which the majority of users are obviously credible humans and most of them fall in the bot score range [0, 0.2]. It turns out that the indegrees tend to be inversely proportional to the bot score and on average, indegrees for humans are larger than those for bots in all the topics. Compared with humans, bots were thus less engaging in retweets in general. However, average indegrees for non-credible bots are higher than those for credible bots (t-test, p-value = 0.032) These results indicate that the basic influence of retweets by non-credible humans could be larger than that by non-credible bots, but the effects of the latter are still non-negligible. There are several exceptional cases (outliers) in Fig. 2 . For example, " @JoyaMia00" is a MAGA (Make America Great Again) user and "@AIIAmer-icanGirI" posts news for conservatives in the WHO topic ( Fig. 2(a) ). Interestingly, "@Prosperous1776" and "@steve Beno3210 in the Trump topic, "@shinethelight17" in the 5G topic, and "@zsixkillerk" in the Bill Gates topic are all MAGA users and Trump supporters. Furthermore, we looked at several outliers from non-credible bots. "@badluck jones" favors pets and political posts ( Fig. 2(a) ); in the Trump topic, "@bgood12345" is a "REAL ESTATE BROKER"; "@Navy Lady 45" is a "100% Trump Supporter"; "@CristyFairy67" is a Trump supporter favoring posts of supporting Republicans and "@Gees-DevilsTail" is an advertiser of "Gab" with just over 1k followers (Fig. 2 (b) ); we only picked up "@Dianelong22" in "5G" topic and this user is a Trump supporter as well (Fig. 2(c) ). Finally, we considered "@taxfreeok" and "@Duckyv72" as outliers and both of them are Trump supporters in the Bill Gates topic (Fig. 2(d) ). We assumed that non-credible bots were following non-credible humans rather than credible humans, because the intention of non-credible bots would be on amplifying the spread of misinformation including conspiracy theories. Thus, we quantified temporal patterns of retweet behaviors in humans and bots. For a comparison among credible/non-credible humans and bots, we scaled daily retweet counts between 0 and 1, respectively. Fig. 3 shows daily retweet series by humans and bots for each topic, in which the patterns of retweet increases follow the similar trends. To confirm this observation, we measured the correlation coefficient of temporal oscillations of retweets generated by these users. The results are summarized in Table 2 . This reveals that all the topic retweet series by non-credible bots correlated with those by non-credible humans much better than by credible humans. The above assumption is therefore partially supported. We further consider this assumption in the next section by looking at commonality in retweets generated by humans and bots. Fig. 3 : Retweet series generated by humans and bots in "WHO","Trump", "5G" and "Bill Gates" topics. Daily retweet counts are scaled between 0 to 1, respectively. Here 1 represents the maximum retweet count while 0 represents the minimum retweet count in each topic. Finally, we examined terms (nouns), domains (URLs), and users that commonly appeared in retweets generated by humans and bots. Fig. 4 shows an example comparing term importance (measured by TF-IDF) on 5G-related articles retweeted by humans and bots. In the 5G topic, "china" was a characteristic term used in the articles retweeted by non-credible humans as well as non-credible bots. In the Trump topic, 'china" is also a high-importance term @BBCWorld @BBCWorld @WorldTruthTV @WorldTruthTV 2 @guardian @BBCNews @TheOnion @shinethelight17 3 @BBCNews @guidaautonoma @BANNEDdotVIDEO @DailyPostNGR 4 @rooshv @Reuters @davidicke @EyesOnQ 5 @verge @verge @shinethelight17 @TheOnion 6 @thehill @guardiannews @SputnikInt @davidicke 7 @Reuters @thehill @newsthump @Laurel700 8 @guardiannews @rooshv @TLAVagabond @ctmaga20201 9 @Omojuwa @guardian @DailyPostNGR @freezerohedge 10 @davidicke @Exchange5g @buttscornershop @erlhel 11 @BBCTech @nuskitconsultan @freezerohedge @JeanineDeal 12 @davidkurten @davidicke @davidkurten @MagickAscension 13 @Exchange5g @BBCTech @DailyCaller @SpecialBureau 14 @MsMelChen @PCMag @Ian56789 @Mindfullee 15 @SkyNews @RichLowry @ChristophGottel @owhy3 used in the articles retweeted by non-credible users (Supplementary Information 1); many of these articles were related to the news saying that President Trump called the coronavirus "Chinese Virus". Similarly, 'china" was a key term in two other topics ( Supplementary Information 1) . Overall, the noncredible bots and non-credible humans share 71%, 50%, 80% and 50% terms (nouns) used in the retweeted articles related to the "WHO","Trump","5G" and "Bill Gates" topics, respectively. Another observation in Fig. 4 is that "electroporation", "new" and "world" were identified in the case of the 5G topic in non-credible humans. This suggests that non-credible humans might be diffusing a conspiracy theory saying that key people are involved in the "New World Order One World Government under the banner of Agenda 2030 Global Governance" plan to alter people's DNA using 5G, because "5G can do on a large scale what electroporation does on a small scale" [43] . We also found that both non-credible humans and bots exhibit high commonality in retweeted domains (URLs) and users; the same is true for credible humans and bots. Take the 5G topic as an example, the top 15 domains and users that appeared in retweets generated by humans and bots are listed in Table 3 . Here, both non-credible humans and bots tend to share the same domains and users as well. We confirmed the similar tendencies in other three other topics ( Supplementary Information 2) . Table 4 is a summary of domains and users that commonly appeared in the retweets. The non-credible humans and bots share many users and domains in the four topics considered. This indicates that both humans and bots tended to follow common influential users. Taken together, non-credible bots shared many in common with respect to the top 15 retweeted domains and the top 15 retweeted users. These findings further support the assumption that non-credible bots were following non-credible humans rather than credible humans. In addition, we found that credible medium such as "BBC" and "The Guardian" were favored news sources about "WHO " and "5G" for credible humans and bots, while US domains, such as "CNN", "The Washington Post" and "The New York Times", were preferred sources in the Trump and Bill Gates topics. By contrast, both non-credible humans and bots retweeted non-credible domains as well as unknown domains that are neither on the credible nor non-credible domain list. For example, "dailycaller.com" covers the most articles related to "WHO" and "Trump" non-credible domains. Al-though "The Daily Caller" partnering with Facebook launched a controversial fact-checking program "CheckYourFact.com" to debunk potential misinformation and stop their propagation on Facebook [44] , "The Daily Caller" itself is considered to be a right-wing misinformation spreader and is used to employ anti-immigrant narratives that echoed sentiments from the alternative right and white nationalists but without explicitly racist and pro-segregation language" [45] . However, the Guardian also pointed out that "The Gateway Pundit" has gained a "White House press credential" [46] . In this paper, we investigated the roles of bots by analyzing retweet networks, temporal patterns of retweets as well as retweeted contents and users during the COVID-19 infodemic. For analysis, we focused on misinformation and conspiracy theory related topics, such as "WHO", "Trump", "5G" and "Bill Gates". We found that the retweet networks exhibited a bipartite topology in all of the four topics, suggesting two types of voices in each topic; one represented by the main stream media news and the other for non-credible or partisan self-media sources and right-wing mediums. Although Twitter suspended many malicious accounts during the COVID-19 infodemic, a non-negligible amount of non-credible bots are still active and selectively parasitic on the partisan clusters. In our cases, there were 85, 2880, 47 and 337 non-credible bots in "WHO","Trump", "5G" and "Bill Gates" topic, respectively. According to the indegrees, the basic influence of retweets by non-credible humans can be much larger than those by non-credible bots. Thus, bots did not play as important a role during the COVID-19 infodemic as they did in previous political events, including 2016 US presidential election. However, we cannot simply make this a definitive conclusion. Rather, the clustering of noncredible bots may reflect a partisan asymmetry and that non-credible bots follow non-credible humans call for the necessity of continuously monitoring the information ecosystem of bots. This is especially important to detect their coordinated acts, though we did not find evidence of such events in the current settings, but this could still be a future threat that has a negative societal impact. As WHO mentioned, an infodemic is a "second disease" that also emerged along with COVID-19, and it is important to take immediate action to address this infodemic. As done in this study, social media analysis is important to gain an overview of the infodemic and to obtain insights into a mitigation strategy. Our study has several limitation, which need to be resolved in the future. Since Twitter suspends any accounts that it considers as "malicious", we were unable to obtain a comprehensive picture of users' interactive behaviors in this study. We also had limited information about the sources of credible and non-credible domains (URLs), which needs frequent updates; thus, all of the URLs could not be labelled in our analyses. The availability problem of a credible/non-credible domain list needs a collective effort to solve. Despite these limitations, this study furthers our understanding of the roles of bots in misinformation propagation during an infodemic in the midst of a world-wide healthcare crisis, and reemphasizes the need to develop an efficient method to address malicious bot behavior. WHO. Munich Security Conference Types, Sources, and Claims of COVID-19 Misinformation You followed my bot! Transforming robots into influential users in Twitter Social bots distort the 2016 U.S. Presidential election online discussion Proceedings of the 11th International Conference on Web and Social Media What types of covid-19 conspiracies are populated by twitter bots? The spread of low-credibility content by social bots Social bots in election campaigns: Theoretical, empirical, and methodological implications Weaponized Health Communication: Twitter Bots and Russian Trolls Amplify the Vaccine Debate Assessing the risks of 'infodemics' in response to COVID-19 epidemics Botornot: A system to evaluate social bots Detection of novel social bots by ensembles of specialized classifiers The spread of true and false news online tweet, retweet: Conversational aspects of retweeting on twitter Measuring user influence in twitter: The million follower fallacy Influence and passivity in social media Predicting popular messages in Twitter Want to be retweeted? Large scale analytics on factors impacting retweet in twitter network What is twitter How the 5G coronavirus conspiracy theory tore through the internet -WIRED 77 phone masts attacked in UK due to coronavirus 5G conspiracy theory -Business Insider COVID-19 and 5G: A case study of platforms' content moderation of conspiracy theories Coronavirus conspiracy theories are dangerous -here's how to stop them spreading Coronavirus: Bill Gates 'microchip' conspiracy theory and other vaccine claims fact-checked -BBC News Bill Gates denies conspiracy theories that say he wants to use coronavirus vaccines to implant tracking devices New Yahoo News/YouGov poll shows coronavirus conspiracy theories spreading on the right may hamper vaccine efforts Quantifying sources and themes in the COVID-19 'infodemic Nela-gt-2018: A large multi-labelled news dataset for the study of misinformation in news articles Gephi: An open source software for exploring and manipulating networks ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software You are what you eat A social media study of food identity BBC. UKIP aiming to be 'radical, populist' party -Gerard Batten -BBC News Who is David Icke? The conspiracy theorist who claims he is the son of God There's A Connection Between Coronavirus and Facebook teams with rightwing daily caller in factchecking program Partisanship, Propaganda, and Disinformation: Online Media and the 2016 U.S. Presidential Election Even rightwing sites call out Trump administration over 'alternative facts