key: cord- -aknktp d authors: bello-orgaz, gema; hernandez-castro, julio; camacho, david title: a survey of social web mining applications for disease outbreak detection date: journal: intelligent distributed computing viii doi: . / - - - - _ sha: doc_id: cord_uid: aknktp d social web media is one of the most important sources of big data to extract and acquire new knowledge. social networks have become an important environment where users provide information of their preferences and relationships. this information can be used to measure the influence of ideas and the society opinions in real time, being very useful on several fields and research areas such as marketing campaigns, financial prediction or public healthcare among others. recently, the research on artificial intelligence techniques applied to develop technologies allowing monitoring web data sources for detecting public health events has emerged as a new relevant discipline called epidemic intelligence. epidemic intelligence systems are nowadays widely used by public health organizations like monitoring mechanisms for early detection of disease outbreaks to reduce the impact of epidemics. this paper presents a survey on current data mining applications and web systems based on web data for public healthcare over the last years. it tries to take special attention to machine learning and data mining techniques and how they have been applied to these web data to extract collective knowledge from twitter. web is one of the most important sources of data in the world producing amounts of public information. the exponentially increasing of websites and online web services in the last years has allowed new interdisciplinary challenges for several fields and computer science, such as marketing campaigns [ ] [ ] , financial prediction [ ] or public healthcare [ ] [ ] [ ] , among others. recently, the research on artificial intelligence techniques applied to develop technologies allowing monitoring web data sources for detecting public health events has been emerged as a new relevant discipline called epidemic intelligence (ei). ei can be defined as the early identification, assessment and verification of potential public health risks [ ] , and the timely dissemination of the appropriate alerts. this discipline includes surveillance techniques such as automated and continuous analysis of unstructured free text information available on web from social networks, blogs, digital news media or official sources. surveillance systems are nowadays widely used by public health organizations such as world health organization (who) or the european centre for disease prevention and control (ecdc) [ ] . tracking and monitoring mechanisms for early detection are critical in reducing the impact of epidemics giving a rapid response. for instance, several of these systems can be able to discover early events of the disease breakout during the a(h n ) influenza pandemic in [ ] . traditional epidemic surveillance systems are implemented from virology and clinical data, which is manually collected, and often these traditional systems have a delay reporting the emerging diseases. but in situations like epidemic outbreaks, real-time feedback and a rapid response is critical. social web media is a profitable medium to extract the society opinion in real time. blogs, micro-blogs (twitter), and social networks (facebook) enable people to publish their personal opinions in real time, including geo-information about their current locations. these big data with situation and context aware information about the users provide a useful source for public healthcare. however, the extraction of information from web is a difficult task due to its unstructured definition, high heterogeneity, and dynamically changing nature. because of this diversity in the data format, several computational methods are required for its processing and analysing [ ] (data mining, natural language processes (nlp), knowledge extraction, context awareness, etc...). this paper presents a survey on current data mining applications and web systems based on web data for public healthcare over the last years. it tries to take special attention to machine learning and data mining techniques, and how they have been applied to these web data to extract collective knowledge from social networks like twitter. the rest of the paper has been structured as follows: section shows the state of the art of the existing epidemic intelligence systems. section describes the different web mining techniques used to detect disease outbreaks. section provides an overview of twitter applications for monitoring and predicting epidemic and their experimental results. finally, the last section presents a discussion of the main features extracted from this survey. nowadays, large amounts of emergency and health data are increasingly coming from a large range of web and social media sources. this information can be very useful for disease surveillance and early outbreak detection, and several public web surveillance projects in this field have emerged over the recent years. one of the earliest surveillance systems is the global public health intelligence network (gphin) [ ] developed by public health agency of canada in collaboration with who. it is a secure web-based multilingual warning tool that is continuously monitoring and analysing global media data sources to identify information about disease outbreaks and other events related to public healthcare. the information is filtered for relevancy by an automated process, and categorized based on a specific taxonomy of categories. then this information is analysed by public health agency of canada gphin officials. from to years, this surveillance system was able to detect the outbreak of severe acute respiratory syndrome (sars). since , biocaster [ ] is an operational ontology-based system for monitoring online media data. this system is based on text mining techniques for detecting and tracking the infectious disease outbreaks through the search of linguistic signals. the system continuously analyses documents reported from over rss feeds, google news, who, promed-mail, and the european media monitor, among others providers. the extracted text are classified for topical relevance and plots them onto a google map using geo-information. the system consists of four main stages: topic classification, named entity recognition(ner), disease/location detection and event recognition. in the first stage, the texts are classified into relevant or non-relevant categories using to train a naive bayes classifier. then, for relevant document corpus are search entities of interest from concept types based on the biocaster ontology [ ] related to diseases, viruses, bacteria, locations and symptoms, see figure . healthmap project [ ] is a global disease alert map which uses data from different sources such as google news, expert-curated discussion such as promed-mail, and official organization reports such as world health organization (who) or euro surveillance. this is an automated real-time system that monitors, organizes, integrates, filters, visualizes, and disseminates online information about emerging diseases as can be seen in figure . other system which collects news from the web, related to human and animal health, and plot the data on a google maps mashupare is epispider [ ] . this tool automatically extracts infectious disease outbreak information from several sources including promed-mail and medical web sites, and it is used as surveillance system by public healthcare organizations, several universities and health research organization. additonally, this system automatically converted the topic and location information of the reports into rss feeds. other public health surveillance system used by a public health organization (the european centre of disease prevention and control) is medisys [ ] monitoring human and animal infectious diseases, as well as chemical, biological, radiological and nuclear (cbrn) threats in open-source media. medisys automatically collects articles concerning public health in various languages from news, which are classified according to pre-defined categories as can be seen in figure . users can display world maps in which event locations are highlighted as well as statistics on the reporting about diseases, countries and combinations of them, also can apply filters for language, disease or location. a specific and extensive application of predictive analytic techniques to public health approach are the monitoring systems of influenza through web and social media. google flu trends [ ] uses google search data to estimate flu activity during two weeks giving an early detection of disease activity, see figure . this web service correlates search term frequency with influenza statistics reported by the centers for disease control and prevention (cdc), and it enables a quicker response in a potential pandemic of influenza, thus reducing its impact. internet users perform search queries [ ] and post entries in blogs using terms related to influenza illness as its diagnosis and symptoms. an increase or decrease in the number of illness searches and posts in blogs, reflects a higher or lower potential outbreak focus for influenza illness and can therefore be used to monitor it. finally all the systems mentioned together with their main characteristics are listed in table . the problem of detecting and tracking epidemic outbreaks through social media can be defined as the task of extracting relevant knowledge about the epidemics in the real world given a stream of textual or multimedia data from social media. web mining is the application of data mining techniques to discover and retrieve useful knowledge from the web documents and services. therefore, the application of these techniques to knowledge extraction provides a better using and understanding of the data space on biomedical and health care domain [ ] . there are several health data sources very useful to detect and prevent new outbreaks of different diseases. social web media and web sites give a large amount of useful data for this purpose. other important data sources are search engines such as google and yahoo! [ ] [ ] . in this case, the objective is to detect specific searches that involve terms that indicate influenza-like-illness (ili) through the keywords of the queries performed. the complexity is to interpret the search context of the query, as the user may query about a particular drug, symptom or illness for a variety of reasons. finally, promed-mail [ ] is also a widely data source used for disease outbreaks detection. it is a human network of expert volunteers operating / as an official program of the international society for infectious diseases. their volunteers monitor global media reports and in many cases have a reporting time of outbreak disease alerts better than who reports. text mining techniques have been applied on biomedical text corpus to named entity recognition, text classification, terminology extraction, or relationship extraction [ ] . these methods are a human language processing algorithms that aim to convert unstructured textual data from large-scale collections to a specific format filtering them according to the needs. once the data have been extracted from the social media sites (rss feeds, www, social networks, promed-mail, search engines, etc...), the next stage is to perform the text analysis methods for the trend detection, identifying potential sources of disease outbreaks. these methods can be used to detect words related to diseases or their symptoms in published texts [ ] . but this goal can be difficult because the same word can refer to a different thing depending upon context. furthermore, a specific disease can have multiple names and symptoms associated which increases the complexity of the problem. ontologies can help to automate human understanding of key concepts and relations between them and allow that a level of filtering accuracy can be achieved. biomedical ontologies contain lists of terms and their human definitions, which are then given unique identifiers and classified into classes with common properties according to the specific domain treated. in the domain of ei it is necessary to identify and link term classes such as disease, symptom and species in order to detect potential focus diseases. currently there are various available ontologies that contain all the biomedical terms necessary. for example, biocaster ontology (bco) [ ] is in the owl semantic web language to support automated reasoning across terms in languages. a new unsupervised machine learning approach to detect public health events is proposed in fisichella et al. work [ ] which can complement existing systems since it allows to identify public health events (phe) even if no matching keywords or linguistic patterns can be found. this new approach defines a generative model for predictive event detection from document by modeling the features based on trajectory distributions. discovering time and location of the text is the value added by ei systems for high quality. in practice location names are often highly ambiguous because geotemporal disambiguation is so difficult, and because of the variety of ways in which cases are described across different texts. keller et al. [ ] work provides a review of the issues for epidemic surveillance and present a new method for tackling the identification of a disease outbreak location based on neural networks trained on surface feature patterns in a window around geo-entity expressions. finally, a different solution for outbreak detection is shown in leskovec et al paper [ ] , where the problem is modelled as a network in order to detect the spreading of the virus or disease as quickly as possible. they present a new methodology for selecting nodes to detect outbreaks of dynamic processes spreading over a graph. this work shows that many objective functions for detecting outbreaks in networks, such as detection time, likelihood, and population affected, are submodular. this means that, for instance, reading only a few blogs provides more new information than reading it after we have read many ones. they use this characteristic to develop an efficient approximation algorithm (celf) which achieves near-optimal solutions and it is times faster than a simple greedy algorithm. the increasing popularity and use of micro-blogging services such as twitter are recently a new valuable data source for web-based surveillance because of its message volume and frequency. twitter users may post about an illness, and their relationships in the network can give us information about which people could be in contact with. furthermore, user posts retrieved from the public twitter api can come with gps-based location tags, which can be used to detect potential disease outbreaks for a health surveillance system. recently, several works have already appeared shown the potential of twitter messages to track and predict disease outbreaks. ritterman et al. [ ] work is focused on using prediction market to model public belief about the possibility that h n virus will become a pandemic. in order to forecast the future prices of the prediction market, they decided to use the support vector machine algorithm to carry out regression. a document classifier to identify relevant messages is presented in culotta et al. paper [ ] . in this work, twitter messages related to flu were recollected during weeks using keywords such as flu, cough, sore throat or headache. then, several classification systems based on different regression models to correlate these messages with cdc statistics were compared, finding that the best model achieves a correlation of . (simple model regression). aramaki et al. [ ] presents a comparative study of various machine-learning methods to classify tweets related to influenza into two categories: positive or negative. their experimental results show that svm model using a polynomial kernel achieves the highest accuracy (fmeasure of . ) and the lowest training time. a novel real-time surveillance system to detect cancer and flu is described in paper [ ] . the proposed system continuously extracts text related the two specific diseases from twitter using twitter streaming api and applies spatial, temporal, and text mining to discover disease-related activities. the output of the three models is summarized as pie charts, time-series graphs, and us disease activity maps on the project website as can be seen in figure . this system can be useful not only for early prediction of disease outbreaks, but also for monitoring distribution of different cancer types and the effectiveness of the treatments used. well known regression models are evaluated on their ability to assess disease outbreaks from tweets in bodnar et al. [ ] . regression methods such as linear, multivariable an svm, are applied to the raw count of tweets that contain at least one of the keywords related to a specific disease, in this case "flu". the results confirmed that even using irrelevant tweets and randomly generated datasets, regression methods were able to assess disease levels comparatively well. finally a summary of all the systems mentioned and the machine learning techniques used is listed in table . it can be noticed that most of the works use regression models, and are usually focused on detecting influenza outbreaks. all the systems and solutions presented have demonstrated the successful and beneficial use of artificial intelligence techniques when applied to extract and acquire new knowledge for public healthcare purposes. the main challenge of these systems is to interpret the search context of a particular query or document, because an user can query about a particular drug, symptom or illness for a variety of reasons. this goal can be difficult because the same word can refer to a different thing depending upon context. furthermore, a specific disease can have multiple names and symptoms related to it, which increases the complexity of the problem. therefore, to develop strategies for reducing false alarms and decreasing percentage of irrelevant events detected by the epidemic systems can be an important issue for future works and researches on the field. additionally, to identify the time and location of messages is a value added for increasing the quality of detecting possible new diseases outbreaks. but in practice location names are often highly ambiguous because geo-temporal disambiguation is so difficult, and because of the variety of ways in which cases are described across different texts. there are several recent works show the potential of twitter to track and detect disease outbreaks. these works demonstrate that there are health evidences in social media which can be detected. but, there can be complications regarding the possible incorrect predictions because of the huge amount of social data existing compared with the small amount of relevant data related to potential diseases outbreaks. therefore, it is necessary to test and validate carefully all the models and methods used. twitter catches the flu: detecting influenza epidemics using twitter predicting the future with social media extracting collective trends from twitter using social-based data mining validating models for disease detection using twitter surveillance sans frontieres: internet-based emerging infectious disease intelligence and the healthmap project google trends: a web-based tool for real-time surveillance of disease outbreaks ai for global disease surveillance scalable influence maximization for prevalent viral marketing in large-scale social networks a survey of current work in biomedical text mining uncovering text mining: a survey of current work on web-based epidemic intelligence biocaster: detecting public health rumors with a web-based text mining system an ontology-driven system for detecting global health events towards detecting influenza epidemics by analyzing twitter messages detecting health events on the social web to enable epidemic intelligence detecting influenza epidemics using search engine query data the landscape of international eventbased biosurveillance social web mining and exploitation for serious applications: technosocial predictive analytics and related technologies for public health, environmental and national security surveillance use of unstructured event-based reports for global infectious disease surveillance automated vocabulary discovery for geo-parsing online epidemic intelligence nowcasting events from the social web with statistical learning real-time disease surveillance using twitter data: demonstration on flu and cancer costeffective outbreak detection in networks medisys: medical information system the global public health intelligence network and early warning outbreak detection epidemic intelligence: a new framework for strengthening disease surveillance in europe using internet searches for influenza surveillance using prediction markets and twitter to predict a swine flu pandemic promed-mail: an early warning system for emerging diseases detecting and tracking disease outbreaks by mining social media data key: cord- -syp ca t authors: merkle, adam c.; hessick, catherine; leggett, britton r.; goehrig, larry; o’connor, kenneth title: exploring the components of brand equity amid declining ticket sales in major league baseball date: - - journal: j market anal doi: . /s - - - sha: doc_id: cord_uid: syp ca t ticket sales for major league baseball (mlb) games are decreasing annually, yet baseball fans have increased team interest and following in other ways. instead of following from the stands or on television fans are choosing to follow, for example, via social media. the emerging unified theory of brand equity offers a framework to examine the mediating role of attendance and local television and the moderating role of twitter followers on the relationships between mlb marketing assets (ma) and team financial performance. publicly available secondary data are analyzed with pls-sem. the results indicate attendance and local tv partially mediate the relationship between non-seasonal ma and team financial performance, whereas attendance and local tv fully mediate the relationship between in-season ma and team financial performance. furthermore, the number of twitter followers for each mlb team moderates various relationships within the mlb brand equity research model. findings suggest mlb sales and marketing professionals should design ticket sales initiatives that not only promote attendance in the short-term but, more importantly, build upon non-seasonal sources of team brand equity for the long-term. electronic supplementary material: the online version of this article ( . /s - - - ) contains supplementary material, which is available to authorized users. picture yourself sitting down in a favorite chair with a cold beverage on a beautiful, sunny day without a cloud in sight. yet when turning on the television to watch your favorite team, you notice something odd. as the cameras pan across the seats in the ballpark, there is no roaring crowd, and nearly half the stadium is empty. if not at the game, where is everyone? have they, like you, chosen to watch the game in the relative comfort of their residence? unfortunately, many sports teams, particularly in major league baseball (mlb), are asking the same question. where have all the fans gone? despite the efforts of sales and marketing professionals, ticket sales and game attendance numbers fell for the fourth consecutive year leading sports pundits and long-term baseball loyalists to wonder if we are entering an early stage decline of the sport (demause ; kelly ; love ). one journalist recently suggested the attendance decrease may be a result of a culmination of factors, including ticket prices, a sophisticated secondary ticket market, and the poor performances of teams (brown ) . this study seeks to uncover the practical, marketing-related antecedents of game attendance and local television (tv) viewership within the unified theory of brand equity (davcik et al. ) . specifically, we examine marketing assets for mlb teams, their relationship with stakeholders, and the relative influence of those marketing assets, game attendance, and local tv viewership on team financial performance. these relationships may have key implications for the sales and marketing efforts of mlb teams. for example, if declining ticket sales are more than offset by other sources of rising revenue, then sales and marketing teams need not be concerned with small ticket sales decreases. however, if declining ticket sales have a large impact on team revenue or are the result of an eroding fan base, more effort should be placed on driving game attendance. a marketing analytics approach may be helpful in taking the first step to explore these relationships. organizations are increasingly making decisions based on an in-depth statistical analysis of data (erevelles et al. ). this journal defines marketing analytics as "the study of data and modeling tools used to address marketing resource and customer-related business decisions" (iacobucci et al. , p. ) . marketing analytics offers hope that, in the future, we can better understand and perhaps predict trends through the power of data analysis (wedel and kannan ) . in this study, we take a marketing analytics approach using only publicly available secondary data to organize and evaluate the variables of interest. the findings contribute to the marketing literature in three ways. first, we show that a data-driven marketing analytics approach can be useful in the ongoing development of the unified theory of brand equity. second, we present and test an empirical model of mlb brand equity and establish the relative influence of non-seasonal marketing assets (i.e., franchise age, number of hall of famers, metro population, etc.) and in-season marketing assets (i.e., at-bats, runs, hits, etc.) on attendance, local tv viewership, and team financial performance. we find that non-seasonal marketing assets are more influential in every relationship. third, we find evidence that alternative marketing assets, such as twitter, are associated with changes in multiple relationships within the brand equity model. one change involves the weakening of the relationship between non-seasonal ma and attendance, effectively resulting in a lower rate of increases in game attendance for those teams with higher numbers of twitter followers. academics and practitioners alike understand the value of a well-known brand to the bottom line. given the longevity of mlb teams, such as the chicago cubs, new york yankees, and boston red sox, brand equity plays a vital role as an intangible asset in the organization. brand equity has traditionally been examined via one of three perspectives, namely company-based (i.e., marketing assets), customer-based (i.e., stakeholder value), or financially based (i.e., financial performance) (keller and lehmann ) . first, marketing assets, both tangible and intangible, help drive customer value and deliver advantages for the firm, thus growing brand equity (srivastava et al. ) . yoo et al. ( ) noted that brand equity could be affected by any marketing endeavor, such as advertising, price promotions, or distribution, since each new marketing action only adds to the organization's previous investments in the brand. therefore, it is important to use all available marketing assets wisely as to not erode brand equity. second, the customer-based perspective of brand equity, first introduced by aaker ( ) , focuses on value creation for both the firm and the customer. recent research has broadened the customer-based perspective to include value creation for all parties, rather than just customers at the expense of shareholders (freeman , jones , laplume et al. , hult et al. . labeled stakeholder value, this approach may help lay the foundation for longterm growth. one avenue to increase stakeholder value is through the strength of the brand. research has shown the multitude of benefits from having a strong brand, such as brand awareness and recall (johnson and russo ; kent and allen ; shamsollahi et al. ) , inclusion in consideration sets (lane and jacobson ; terech et al. ) and brand performance (chaudhuri and holbrook ; de vries and carlson ; hoeffler and keller ) . consequently, brand strength can help increase stakeholder value, thus increasing brand equity. lastly, financial-based brand equity focuses on the worth of the brand. like any asset, brand equity's value influences decision making and subsequently affects the bottom line. unfortunately, research on how brand equity is measured lacks consistency. simon and sullivan ( ) developed both a micro and macro approach for brand equity valuation using stock market values. the micro approach examines brand equity at the individual brand level to assess the impact specific marketing decisions have, whereas the macro approach evaluates at the firm level using an objective, mathematical formula. other measures include price premium (aaker ) , customer lifetime value (gupta and lehmann ) , and momentum accounting (farquhar and ijiri ) . however, with each measure having its own advantages and drawbacks, relying on one measurement tool could result in a partial picture of the full value of the brand. therefore, as ambler and barwise ( ) concluded, companies would be wise to use multiple measurements in conjunction with other brand equity measures, such as marketing assets, to obtain a truer value. in the literature, brand equity is viewed differently depending on the chosen perspective. recognizing this gap, davcik et al. ( ) proposed the emergence of a unified theory where all three perspectives, company-based, customer-based, and financial-based, are collectively considered in evaluating brand equity. with each perspective potentially interacting with the others, the authors believe viewing brand equity through one perspective results in the loss of the full extent of interaction. thus, brand equity should ideally be evaluated using all three perspectives. davcik et al. ( ) are not alone in their beliefs, as additional research has supported the need for an expanded measure of brand equity (chatzipanagiotou et al. ) . in this paper, we analyze mlb data involving measures of marketing assets, stakeholder value, and financial performance in the manner described by the emerging unified theory of brand equity. we distinguish between two marketing assets (ma), non-seasonal ma (e.g., city population, playoff history, hall of fame record), and in-season ma (e.g., team performance), and suggest that each underlying asset influences stakeholder value represented by game attendance and local tv viewership. it is known that mlb game attendance has declined in recent years (demause ; kelly ; love ); however, there is a lack of research investigating why. furthermore, the interaction between marketing assets, game attendance, and the effects on financial performance (e.g., revenue and valuation) have yet to be explored. simultaneously analyzing the data across all three areas represented in the unified theory of brand equity will capture the interaction between each area. based on our findings, we offer support for the emerging unified theory of brand equity, discuss implications associated with low attendance, and suggest how salespeople could further the goal of building team brand equity. the unified theory of brand equity calls for a blended view of the brand equity mix involving the three primary relationships of marketing assets, stakeholder value, and firm value. recent sports marketing studies have investigated company-based, customer-based, or financially based brand equity across numerous professional sports, including the national basketball association (nba), major league soccer (mls), and the national football league (nfl). however, these studies tend to focus on only one or two aspects of brand equity. for example, a study across six teams in the nba provided evidence that a team's company-based marketing assets, called marketplace characteristics, influence measures of fan social identification leading to changes in customer-based brand equity (watkins ) . in mls, a customer-based brand equity measure of brand association by fans included measures of teams' marketing assets such as on-field performance, team history, coaching, and management, among others (biscaia et al. ). lastly, a study investigating brand equity in the nfl offers evidence that teams' ability to construct new stadiums is an influential marketing asset related to revenue, which is a component of financial-based brand equity (abreu and spradley ) . this research, however, investigates a brand equity model in mlb according to the unified theory of brand equity. below, we briefly discuss firm value, marketing assets, and stakeholder value as they relate to mlb. firm value is reflected by team financial performance. two primary marketing assets are identified as non-seasonal factors and in-season team performance. two measures for assessing stakeholder value include game attendance and local tv viewership. lastly, we evaluate the impact of twitter followers as a key social media development within the brand equity model for mlb. team financial performance is considered a subjective measure because it can be approached from various perspectives, including accounting measures or marketing measures. an accounting approach would involve measures of how well a team uses its assets, otherwise known as a return on assets. marketing and sales professionals may focus instead on forward-looking metrics such as ticket sales revenue or financial valuations. these are marketing measures. marketing measures are considered by some to be a more accurate picture of firm value because they are often less susceptible to accounting changes and manipulations (mcguire et al. ). financial performance is composed of numerous sources beyond ticket sales from game attendance. for example, television contracts, negotiated by each team, contribute a significant amount to revenue. according to forbes ( ) , the yankees receive $ million and have a % equity stake in the network, bringing their annual revenue from television rights to $ million, whereas the los angeles dodgers, with its equity stake, bring in $ million per year. smaller markets, such as san diego, also partake in revenue-generating television contracts as the padres received $ million from television rights (forbes ) . a second source of income is revenue sharing, stemming from the collective bargaining agreement between the players union and the owners. revenue sharing resembles a redistribution tax to shift money from larger markets to smaller markets for major league baseball teams (rockerbie and easton ). under the current collective bargaining agreement, each team contributes % of its revenue into a pool. the mlb divides the pool equally, and / th goes back to each organization (rockerbie and easton ) . in alone, each club received $ million, an amount greater than some teams' entire payroll. thus, revenue sharing is a cost for large market teams and a revenue source for smaller markets. the significance of these revenue streams is such that forbes regularly accounts for them at the team level in their reporting and analysis of valuations (forbes ) . in the present study, team financial performance is a blended set of marketing measures and is the result of the overall effectiveness of marketing assets and measures of stakeholder value. this view aligns with other sportsrelated brand equity frameworks where ticket sales and tv contracts are specifically conceptualized as a consequence of brand equity (kerr and gladden ) . however, ticket sales and tv contracts simultaneously reflect stakeholder value wherein greater numbers of spectators are an indication of increased value delivered to customers. thus, team financial performance is the result of ( ) the effectiveness of its use of marketing assets but also ( ) reflects the level of stakeholder value delivered to its customers. some sports team brand elements extend beyond the regular season in their ability to create excitement and increase brand equity. in this study, we consider several non-seasonal variables related to attendance, local tv viewership, and team financial performance. these variables include population, franchise (club) age, the frequency and quantity of playoff appearances, and the number of hall of fame members per team. first, population, derived from each team's metropolitan statistical area (msa), plays an essential role in baseball attendance since larger populations increase the number of potential customers (bradbury ) . to explain baseball attendance, government and the sports business offered a multi-variable equation where factors, such as ticket sale prices, number of star players, and competition from other sports, are multiplied by population (noll, ) . when analyzing mlb attendance from to , baade and tiehen ( ) concurred that the population could significantly influence many items including attendance with this statement, "though noll's process of multiplication potentially creates a high level of multicollinearity, the fact remains that population merits inclusion while researching attendance." other variables, such as the number of playoff appearances, cumulative world series wins, and club age, reflect the history and prestige of a baseball club. we selected these variables as they can influence brand loyalty, which can, in turn, affect in-person viewing, at-home viewing, and financial performance. for example, the chicago cubs are known for their loyal fans who continuously support the team through merchandise purchases and game attendance, regardless of its performance (bristow and sebastian ) . zimmer ( ) examined world series wins and found that "the average benefit of a playoff appearance afforded to mlb teams is an average increase in attendance of approximately , " (zimmer , p. ) . we hypothesize these various elements can be combined in the formation of a non-seasonal marketing asset influencing attendance, local tv viewership, and team financial performance. higher levels of population, playoff successes, and hall of fame caliber players will result in larger numbers of spectators at the ballpark and watching on television. increased numbers of spectators will generate larger ticket sales. higher sustained tv viewership levels will allow teams to increase the amount of revenue generated from their television contracts. we, therefore, hypothesize the following relationships: h a: non-seasonal marketing assets are positively related to attendance. h b: non-seasonal marketing assets are positively related to local tv viewership. h c: non-seasonal marketing assets are positively related to team financial performance. on-the-field team performance can be broken into two categories, offense and defense. which category of performance is more influential in driving stadium attendance and fan viewship? an econometric model recently validated certain measures of offense, including hometeam slugging percentage and runs per game, as statistically significant determinants of attendance (lee ) . while a team's defense is vital to winning games, home runs, fly balls, and rbis generate excitement on the ballfield. baseball teams that recognize the benefits of a productive offense likely view it as a beneficial marketing asset for ticket sales. the implementation of the designated hitter rule offers an example of how an exciting team offense can improve the game viewing experience. in , the american league adopted the designated hitter (dh) rule on the belief that increased offensive output would increase attendance (domazlicky and kerr ) . designated hitters allow a skilled batter to hit in the line-up instead of the pitcher, yet the pitcher is allowed to remain in the game during the defensive (pitching) cycle. the dh rule had two major implications. first, dhs are typically better hitters than national league pitchers who were required to bat, thus providing american league teams an advantage. second, the dh may impact the opposing teams' pitching strategy, because the dh represents the potential for increased numbers of intentional walks, base hits, and home runs. in the seven years following the introduction of the dh rule, the american league experienced an increase in runs per game from . to . , producing an increase of . % in runs per game over the seven years before the dh (domazlicky and kerr ). the national league, which did not introduce dh, saw a modest rise of just . % over the same years (domazlicky and kerr ) . a productive offensive has been linked to positive attendance when measured over the course of an entire season (tainsky ) . during the steroid era, from the mid-' s to the early s, team offense became a central draw for mlb attendance as reflected by ticket price increases of %, and attendance increases approaching % (koslosky ). the exciting atmosphere created by extra-base hits and home-run titles resulted in higher levels of ballpark attendance, and also likely increased television viewership. by the end of the season, runs per game increased to . (koslosky ) . moreover, others find that team offensive output has been a significant determinant for attendance since the inception of major league baseball (ahn and lee ) . when a team is having a successful season, that success is often attributed to increases in seasonal offensive output. an increase in offensive output (i.e., in-season marketing assets) can be measured in many ways such as singles, doubles, triples, sacrifice flys, home runs, total at-bats, and runs batted in or combined ratios like batting averages, on-base percentages, and slugging percentages (ahn and lee ; lee ) . thus increases in seasonal offensive output are associated with increased fan attendance and television viewership. increases in attendance and television viewership will likely generate increases in team financial performance via higher ticket sales and the potential for increased advertising demand. for these reasons, we suggest that seasonal offensive output represents a distinct in-season marketing asset that is positively related to other elements of mlb brand equity as hypothesized below: h a: in-season marketing assets are positively related to attendance. h b: in-season marketing assets are positively related to local tv viewership. h c: in-season marketing assets are positively related to team financial performance. in this study, we assess stakeholder value according to various measures of game attendance and local tv viewership. attendance is an appropriate way to evaluate stakeholder value because it captures the willingness of fans to pay and travel, an atmosphere of excitement for players to enjoy while competing, and a financial return to ownership in the form of sales. though mlb attendance has continually declined in recent years, detailed information about the timing, universality, and proportion of the trend is scarce. likewise, predictive research about the potential consequences of the attendance decline is difficult to find. furthermore, from the perspective of club revenue and financials, it is unclear whether the downward trend in attendance is a problem. another complication arises in that the proportion of attendance increase, or decline, is not regularly reported in studies related to the effects of mlb attendance (lemke et al. ) . trends show the current attendance decline began around . prior to that year, mlb attendance was reportedly rising (beckman et al. ). however, from the perspective of the business of baseball, we need to begin to understand whether and how strongly attendance increases (decreases) are related to revenue and valuation changes. for example, if changes in club revenue are greater (lesser) than an attendance increase (decrease), this could indicate that attendance is becoming a less critical issue for the sport. attendance is primarily driven by key non-seasonal and in-season marketing assets. regular baseball enthusiasts may respond to non-seasonal marketing assets resulting from their brand loyalty or social habits and choose to visit the ballpark. newer fans however, may instead respond to inseason marketing assets, for example when their local team is having a successful year competing for a division title or heading toward playoff competition. whether responding to non-seasonal ma or in-season ma, fans choose to attend a game and receive some level of stakeholder value. when fans spend money on tickets, they generate firm value to the ball club. ticket sales are not the only source of revenue generated by attendance. fans also purchase concessions and memorabilia. some portion of gate sales, concessions, and memorabilia are retained by local authorities and the remaining amounts are sent to the team. for these reasons, a portion of team revenue generated by the effectiveness of a team's marketing assets is indirectly paid to the team via ticket sales. whereas other portions of revenue from attendance are directly related to additional sales that occur during game attendance. therefore, as marketing assets effectively drive attendance increases, we expect to observe a corresponding increase in team revenues and hypothesize: h a: attendance is positively related to team financials. h b: attendance partially mediates the positive relationship between non-seasonal marketing assets and team financials. h c: attendance partially mediates the positive relationship between in-season marketing assets and team financials. within the theoretical framework of the unified theory of brand equity, local tv viewership is another measure of stakeholder value. local tv viewership represents an alternative to game attendance by solving fan barriers such as limited stadium capacity and geographic boundaries. regional sports networks (rsn's) obtain exclusive contract rights to show local mlb games on their networks (kunz ) . these contracts allow rsn's to earn additional revenue through exclusive advertising sales with some portion of that revenue offsetting the cost of the contract paid by rsn's to the local mlb team. overall tv viewership in mlb and other professional sports was still growing through (chung et al. ). however, mlb has seen a continued recent decline in total local tv viewership starting in , according to nielsen ratings data (forbes ) . rsn contracts generally include a fixed and variable amount and may last for many years (kunz ) . as such, there is a floor on any revenue decrease to the local mlb team if local tv viewership declines. however, there may be some ability for a team to increase local tv revenue to the club based on various factors of team success if they choose to structure an agreement with that benefit. rsn contracts vary in size but may be generally related to a team's non-seasonal marketing assets. that is, teams with a large and successful franchise tend to enjoy larger rsn contracts in raw dollar amounts. devoted fans do not always choose to attend games at the stadium, instead, they may choose to watch the game on television. moreover, research about customer demand for mlb game attendance found that fans' desire to watch a game may change during the game itself (chung et al. ). this change happens as a result of the uncertainty of outcome hypothesis (uoh) whereby fan interest in the game increases as the game draws closer to an end with an unexpected winner or loser (neale ) . thus, fans with no initial interest in a ballgame may still monitor game progress for an interesting twist at the end. it may not be feasible to get to the ballpark during an exciting game, but fans instead may choose to watch the end of the game on television. this scenario helps illustrate why a team's inseason marketing assets are related to local tv viewership. seasonal success could generally result in higher levels of tv viewership. the effectiveness of a team's marketing assets is likely indirectly related to team financial performance through local tv viewership for different reasons. non-seasonal ma should have an influence on team financial performance via local tv because large metropolitan areas like new york and la host the yankees and the dodgers who possess some of the largest rsn contracts. in-season ma might generate other portions of revenue through local tv, for example, if it results in higher than expected local tv viewership, and that contract rewards the mlb club financially for this success. for these reasons, we expect positive direct and indirect relationships will exist between mlb marketing assets, local tv viewership, and team financial performance: h a: local tv viewership is positively related to club financials. h b: local tv viewership partially mediates the positive relationship between non-seasonal marketing assets and team financials. h c: local tv viewership partially mediates the positive relationship between in-season marketing assets and team financials. social media tools are an extension of a marketing program for sports teams (williams et al. ) . others write that fan engagement with social media works to fulfill certain fan motivations, including passion, hope, esteem, and camaraderie (satvros et al. ). if social media is a marketing program, then it likely classifies as a marketing asset under the unified theory of brand equity. however, if social media works to fulfill fan motivations, it could be classified instead as a form of stakeholder value. we adopt the view that twitter acts as a bridge between marketing assets and stakeholder value but classify this social media forum as primarily a marketing asset. this classification is appropriate because twitter is a communication tool in which information can be quickly shared among trusted networks of people who hold common interests, such as brand communities (mckee ). twitter following and usage for mlb rose dramatically over the past decade. research about social media usage for baseball enthusiasts broadly categorized two different types of fans called lurkers and posters (williams et al. ). lurkers follow their team but do not post very often yet are more likely to attend games. posters regularly comment on their social media feeds but exhibit fewer tendencies toward game attendance. the study concluded that posters exhibit increased team-related social media usage, yet they also are less likely to attend games than lurkers who post on social media more infrequently. these findings could indicate that devoted fans express their devotion over social media, but not necessarily with their wallets in the form of ticket purchases. perhaps, instead, they choose to watch on the television. however, other explanations for the differential game attendance habits of various types of social media followers are possible. one study compared the underlying structure and social network distances between twitter fans of the new york mets and fans of the new york yankees. yankees fans have many more followers on twitter, but a far greater dispersion and social distance between one another than mets fans (watanabe et al. ) . one possible explanation for this finding is that greater numbers of yankees fans may be found outside the state of new york versus mets fans who are more closely clustered. if true, it's possible that greater numbers of yankees fans who live far away from their team geographically are simply unable to support their team at the ballpark, and instead do so via following on twitter or watching on the television. two different explanations, based on prior research above, indicate that increased levels of social media usage and larger followings on twitter could be associated with lower likelihoods for game attendance by fans. the first reason is that twitter posters fulfill their fan motivations on social media, which would otherwise require a trip to the ballpark. the second reason is that teams with very large followings may also face significant geographic barriers in attending a game and instead root for their team on twitter and perhaps watch the game on television. as twitter fans become instantly aware of an interesting game from their feed, the uncertainty of outcome hypothesis holds that fans are more likely to want to watch and may look for a television in order to view the game. lastly, twitter likely functions as a traditional marketing asset, which can strengthen the existing relationships between fans and the financial performance of the team regardless of geographic location. this can happen, for example, by inducing the purchase of memorabilia or other mlb-related souvenirs. we hypothesize that the increase in the number of twitter followers' changes some of the relationships within the mlb brand equity model: h a: higher levels of twitter followers will weaken the positive relationship between non-seasonal marketing assets and attendance. h b: higher levels of twitter followers will strengthen the positive relationship between in-season marketing assets and local tv viewership. h c: higher levels of twitter followers will strengthen the positive relationship between non-seasonal marketing assets and team financials. publicly accessible secondary data from multiple sources were used in the analysis. baseball attendance rose through (beckman et al. ) . for this reason, we chose to examine the years through . at the time of this study, data for was not yet available. the data set consists of mlb clubs over eight years resulting in an n = . a few cases of missing data were imputed. pls-sem, as opposed to cb-sem, is the appropriate structural equation modeling method for this study because it involves secondary data, is exploratory by design, and evaluates the potential for causal relationships between constructs (hair et al. b) . annual population estimates according to the metropolitan statistical area, annual team offense, and baseball game attendance data are publicly available (baseball almanac n.d.; united states census n.d.). we obtained baseball club financial information, including both revenue and valuation from reports available through statista ( ) . revenue figures are modeled as season-ending data. club valuation is measured at the beginning of each year and thus represents the valuation of prior year-end. for example, the club value represents the results at the end of . other variables such as personal income, world series wins by team and club age came from multiple sources (baseball reference n.d.; major league baseball n.d.; statista n.d.). tv viewership data originally reported by nielsen was gathered from two baseball data outlets and is reported each year by both forbes.com and sportsbusinessdaily.com (nd). variable selection and construct alignment were guided by the unified theory of brand equity and the literature review. we identify two major marketing assets, which include nonseasonal assets and in-season assets. stakeholder value is measured in two ways, which include game attendance and local tv viewership. additionally, we include a financial measure of firm value. this measure is critical to include because we are seeking to predict how marketing assets, stakeholder value, and firm value work together in the formation of brand equity within the business of baseball. after the collection of numerous potential items to represent these constructs, we conducted separate principle component analyses for each construct with varimax rotation using spss . . a measure of sampling adequacy was used in the variable retention decision. variables with an anti-image correlation below . were eliminated oneby-one beginning with the lowest value until all retained variables were above . and kmo for the factor solution was above . (hair et al. a) . total variance extracted exceeds . for each construct. eigenvalues are above and above a parallel analysis of randomly generated eigenvalues from datasets with matching characteristics (howard ) . the results are shown below in table . the full pca results for every variable are listed in a web appendix. this construct is named non seasonal. the seven items representing the latent construct include annual cumulative measures of the number of playoff appearances, pennant titles, division titles, world series wins, hall of fame players, the age of the baseball club in years, and the population of the msa. the data sources are shown in the appendix, and the cronbach's alpha for this measure is α = . . this construct is named in season. the five items included in this latent construct represent team offense. they are the total annual number of hits, runs, doubles, at-bats, and sacrifice-fly outs. the data sources are shown in the appendix, and the cronbach's alpha for this measure is α = . . this construct is labeled attend. we model attendance as a latent variable using a blended set of four items, including total annual attendance, raw season average, percentage of filled seat capacity within the stadium, and percentage of league mean average attendance. the data sources are shown in the appendix, and the cronbach's alpha for this measure is α = . . local tv viewership is captured as a latent construct named local tv and includes three items measuring annual average local rank against other programs at the same time, the nielsen rating points/share, and the average number of viewing households. this data is collected by nielsen and reported through third party outlets, as listed in the appendix. the cronbach's alpha for this measure is α = . . we name this latent variable financials and use three items. they are gross revenue, gate revenue, and club valuation. the data sources are shown in the appendix, and the cronbach's alpha for this measure is α = . . we include a single-item measure to capture the growing influence of social media measured by twitter followers. the measure is modeled as a moderator of the relationships between marketing assets, stakeholder value, and firm performance. this data published by statista.com measures the annual number of twitter followers for each team in october. pls-sem results are evaluated according to a two-step approach. (sarstedt and cheah ) . each step involves a series of assessments resulting in a confirmatory composite analysis (hair et al. ) . we assessed internal consistency reliability according to cronbach's alpha and table were above the thresholds of . and . (hair et al. a ). evaluation of convergent validity for each construct is according to item loadings and ave as reflected in table . eighteen of twenty-two loadings exceeded the recommendation of . , and all were significant at p < . . (hair et al. ) . loadings between . and . should be retained if the deletion of the item will negatively affect the content validity of the construct (hair et al. ) . additionally, the ave for each construct exceeded the standard threshold of . (hair et al. a) . we assessed discriminant validity with the heterotraitmonotrait method (htmt). table shows that htmt for each construct was well below the recommended level of . (henseler et al. ) . figure reflects the structural model with bootstrapped path coefficients, p values, and r square for the endogenous constructs. the structural model was assessed for vif, and all measures were below the recommended level of . (hair et al. a) . also reflected in table are the r square measures for the endogenous constructs. attend has an r square of . , and local tv has an r square of . . both results explain a moderate amount of variance. team financials has an r square of . , which is approaching a strong amount of variance explained in the dependent variable (hair et al. b ). the results of the path analysis reveal that each hypothesized path was statistically significant (p < . , unless otherwise noted) except for the path between in season and financials. the path between non seasonal and attend was . in support of h a. the path between non seasonal and local tv was . in support of h b.the path between non seasonal and finan-cials was . in support of h c.the path between in season and attend was . in support of h a. the path between in season and local tv was . in support of h b. the path between in season and finan-cials was . and not statistically significant; thus, h c was not supported. this finding was unexpected but becomes clearer in conjunction with the results for h b and h b. the path between attend and financial was . in support of h a. the mediation results of attend on the relationship between in season and financials reflect a path of . in support of h b. the mediation results of attend on the relationship between non seasonal and financial reflect a path of . in support of h c. the path between local tv and financial was . in support of h a. the mediation results of local tv on the relationship between in season and financial reflect a path of . (p < . ) in support of h b. we can conclude that the relationship between in season team performance and team financials is fully mediated by attend and local tv. finally, the mediation results of local tv on the relationship between non seasonal and financial reflect a path of . (p < . ) in support of h c. these results are summarized in table . the hypotheses related to the moderating effects of twitter were tested using an orthogonal moderation approach, which effectively eliminates multicollinearity between the independent variables and the moderating variable (hair et al. ) . each moderating effect is evaluated according to simple slope analysis whereby the mean relationships are evaluated at one standard deviation above and below the mean. the moderating effect of twitter followers on the relationship between non seasonal and attend is - . (p < . ) in support of h a. depicted graphically in fig. , this moderating effect shows that higher levels of twitter followers are associated with a weaker relationship between non-seasonal ma and game attendance. whereas lower levels of twitter followers reflect a stronger relationship between non-seasonal ma and game attendance. the moderating effect of twitter followers on the relationship between in season and local tv is . (p < . ) in support of h b. depicted graphically in fig. , this moderating effect shows that higher levels of twitter followers are associated with a stronger relationship between in-season ma and local tv viewership. whereas lower levels of twitter followers reflect a weaker relationship between in-season ma and local tv viewership. the moderating effect of twitter followers on the relationship between non seasonal and financials is . (p < . ) in support of h c. depicted graphically in fig. , this moderating effect shows that higher levels of twitter followers are associated with a stronger relationship between non-seasonal ma and team financials. whereas lower levels of twitter followers reflect a weaker relationship between non-seasonal ma and team financials. the time-series nature of the data set allows for some trend analysis of the two measures associated with stakeholder value, which are attendance and local tv viewership. we additionally evaluate the trend of twitter followers over the same time period. as shown in fig. , for the years through , attendance was flat and then declining while local tv viewership enjoyed a brief increase but has since begun to match the attendance decline. meanwhile, twitter followers steadily increased over the same time period. these basic longitudinal trends capture the shift in brand equity occurring in mlb, and together with the moderating effects, may begin to explain how sources of stakeholder value are changing. the focus of this research is the analysis of the decline in mlb ticket sales and game attendance within the larger framework of mlb brand equity. the research model empirically tests portions of a prior conceptualization of professional sports brand equity (kerr and gladden ) . moreover, we extend that conceptualization with the inclusion of twitter followers and establish its moderating effect on various relationships. the brand equity model consists of two major categories of mlb marketing assets, their relationships with stakeholder value measured by attendance and local tv viewership, and firm value represented by team financials. three findings discussed below include the replication and extension of prior research establishing two distinct marketing assets in mlb, the absence of any direct relationship between in-season marketing assets and team financial performance, and the adverse effect of rising numbers of twitter followers weakening the relationship between non-seasonal ma and game attendance. research about mlb brand equity established the importance of two distinct marketing factors, one related to performance and one unrelated to performance (feng and yoon ) . those findings give some pause, however, because the sample in the study involved data from and . nevertheless, the findings in our model, based on more recent data, advance the assertion that mlb teams should invest in and grow the effectiveness of both non-seasonal and in-season marketing assets because higher levels of ma are related to higher levels of stakeholder and firm value. non-seasonal ma has a stronger relationship than in-season ma with attendance, local tv viewership, and firm value. non-seasonal ma includes measures of long-term winning records such as division and pennant championships, along with world series victories and the amount of hall of fame players. thus, we can infer that a focus on consistent long-term success is more valuable to team brand equity than shortterm strategies aimed toward winning now. the absence of a direct relationship between in-season marketing assets and team financial performance could indicate a decoupling of the relationship between on-field success and firm value. taken to the extreme, these results could indicate that a single winning season may have no direct impact on the club's financial performance. instead, the impact of that winning season is fully mediated by an increase in stakeholder value reflected in higher attendance and local tv viewership. likewise, however, a single losing season may also have no direct impact on team financial performance. this finding could become a concern for mlb fans, if, for example, team ownership prioritizes firm value over stakeholder value by choosing not to address long-term on-field performance problems related to the team. the third finding begins to address why ticket sales may be in decline as a result of a larger market shift in mlb brand equity. the rise in twitter followers is associated with the adverse effect of a weaker relationship between non-seasonal ma and game attendance. stated another way, teams with higher numbers of twitter followers versus those with lower numbers reflect lower levels of attendance increases relative to the effectiveness of their non-seasonal marketing assets. simultaneously, this same twitter trend of higher followers is associated with a stronger relationship between in season ma and local tv viewership, and a stronger direct relationship between non-seasonal ma and team financial performance. otherwise stated, teams with higher levels of twitter followers reflect higher levels of local tv viewership given similar levels of effectiveness from in-season marketing assets. likewise, teams with higher numbers of twitter followers also reflect higher levels of team financial performance given similar levels of effectiveness in nonseasonal ma. in summary, teams with higher levels of twitter followers have relatively lower levels of increases in game attendance, higher levels of local tv viewership, and higher levels of financial performance directly related to increases in their non-seasonal ma. the findings from this research offer two theoretical contributions. first, with research questioning the role theory plays in the face of increasing availability and analysis of data (chandler ) , we demonstrate that analysis of sports marketing data can build and support the development of the unified theory of brand equity (davcik et al. ) . we distinguish between the effects of two distinct marketing assets and their relative effects on financial performance as well as the mediating roles of attendance and local tv. as a result, we can begin to see the importance of the brand equity "mix" in the brand equity theory discussion, as evidenced by the analysis of sports marketing data. second, our study begins to answer the call for additional research about the interaction of the relationships at the intersection of marketing assets, stakeholder value, and financial performance (davcik et al. ) . while more work is needed to determine the relative strength of interactions between marketing assets, stakeholder value, and financial performance, our study may help explain how brand equity forms or changes over time. the unified theory of brand equity is just beginning to emerge, and every contribution brings the field one step closer to its validation. the sales and marketing implications are best summarized in two ways. first, the effect sizes from in-season ma on attendance, local tv viewership, and team financial performance are all smaller relative to non-seasonal ma. thus, non-seasonal ma are a more influential part of team brand equity than in-season ma. for this reason, mlb sales and marketing personnel should perhaps begin to look for ways to maintain fan interest in their teams through the emphasis of non-seasonal ma, rather than a focus on in-game team performance. however, some non-seasonal factors (i.e., # of division titles, pennants, etc.) are based on cumulative, multi-season success. as such, team owners and managers can support sales and marketing efforts that promote brand equity growth through a balanced plan, which includes an emphasis on a long-term culture of winning (not necessarily a power offense), while still maintaining some level of in-season offensive excitement. second, while it may be important for marketing and sales teams to continue efforts toward activities that increase game attendance, such as giveaways and memorabilia, the results in this study demonstrate that some fans are making a choice to follow their favorite mlb teams in a different way, involving more twitter and less in-person game attendance. likewise, and according to uoh, when games get interesting, these same twitter followers seem to be drawn more strongly toward their local rsn broadcast to watch their team. while not specifically researched in this study, it is plausible that many of the twitter followers represent those already in the brand community of their local team, and do not necessarily represent potential new fans who could become part of the brand community in the future. ticket sales initiatives, therefore, could be designed in ways that not only promote attendance but, more importantly, build upon non-seasonal sources of team brand equity. for example, promotions that develop a nostalgic habit of coming to the ballpark many times in a season for new customers may be more beneficial than sponsoring the traditional one-off bat-days or hat-days. ballpark development partnerships focused on excellent fan experiences regardless of the actual results of any one game, such as the new park in atlanta, ga, are another potential path to increase long-term brand equity. clearly, mlb will continue to increase revenue through future media contracts based on the recent agreement signed and set to begin in (ozanian ). thus, sales and marketing personnel should begin to think bigger. technology is shrinking the world. as such, teams should not see themselves as geographically bound in the marketplace. instead, it may be time for mlb sales and marketing managers to more fully shift toward additional sources of revenue by redirecting marketing asset management toward new technologies such as web-enabled interfaces and the coming opportunities available through augmented and virtual reality. in order to do so, mlb teams may need to focus on developing new competencies, for example, the development of a social crm capability (kim and wang ) . this study presents evidence that alternative channels of fan growth are changing the brand equity picture for mlb. this change is associated with a weaker relationship between a team's most important marketing asset and game attendance and could be a contributing factor in decreased ticket sales. lastly, the concept of satellite fans in professional sports was once just an idea (kerr and gladden ) , but now may become an important area of new growth for mlb teams. in , mlb released an early version of a virtual reality fan viewing experience called at-bat vr (petriello ) . it is conceivable that fan viewership in the future will be increasingly virtual. when that time comes, teams should be ready to compete for fans anywhere in the world. we note two limitations of this study. first, the use of secondary data is a limitation due to mlb being privately held and restricting access to more detailed information. we fully recognize that mlb has access to other variables of interest that could help deconstruct and provide understanding about additional sources of non-seasonal ma. from a sales perspective, one useful source of data from mlb would be information about the structure of team incentives for ticket sales. recent research highlighted the differential preferences of varying sales incentives among salespeople (said ) . in this view, are sales incentive structures appropriately aligned and related to critical marketing assets and attendance results? within the realm of secondary data, social media channels other than twitter, along with content analysis of mlb social media, is a readily available data-rich environment for additional insight. a second limitation of this study is the annual analysis approach, which required some variables based on means and averages. nevertheless, this model of mlb brand equity accounts for the decline in ticket sales and attendance based on secondary sports marketing data and sheds light on how the increase of an alternative marketing asset such as twitter is related to a decline in game attendance. future research could look at individual teams' brand equity building activities as well as the business patterns of the owner. do owners have a shareholder or stakeholder approach to the sport? do they prioritize a long-term stakeholder approach to consistent winning or instead reflect a win-now-at-all-costs approach? with the substantial amounts of revenue reportedly coming from revenue sharing agreements and television contracts, to what degree does winning contribute to value creation? for example, if payroll and operating expenses are not a concern, ownership may be disincentivized, have a lowrisk profile, or lack the ambition to make marketing investments that could promote long-term brand equity. additional future research could also involve an analysis of the game attendance effects within certain divisions or rivalries. in these situations, according to uoh, game attendance may not be the result of the home team's seasonal success but instead attributed to the intrigue associated with the visiting team. lastly, a future research project could consider the question of whether a consistent shortterm focus on seasonal offensive power has any positive long-term effect on brand equity. in closing, we acknowledge the impact of covid- on major league baseball. at the time of this writing, the pandemic has delayed the start of the season with an opening date yet to be determined. even when the season starts, it is unclear whether fans will be permitted to attend games. until the season begins, the ramifications on the sport from the pandemic are unknown. future research will likely explore the impact on brand equity from this unprecedented event, and as such, the research model presented here is an ideal place to begin. we look forward to the time when america's favorite pastime will return, and we can once again "play ball!" managing brand equity measuring brand equity across products and markets the national football league's brand and stadium opportunities major league baseball attendance: long-term analysis using factor models the trouble with brand valuation an analysis of major league baseball attendance major league baseball team history explaining game-to-game ticket sales for major league baseball games over time spectator-based brand equity in professional soccer an empirical assessment: determinants of revenue in sports leagues holy cow! wait 'til next year! a closer look at the brand loyalty of chicago cubs baseball fans from terrible teams to rising costs: why mlb attendance is down over % since . forbes, october a world without causation: big data and the coming of age of posthumanism decoding the complexity of the consumer-based brand equity process the chain of effects from brand trust and brand affect to brand performance: the role of brand loyalty ex ante and ex post expectations of outcome uncertainty and baseball television viewership towards a unified theory of brand equity: conceptualizations, taxonomy and avenues for future research what's the matter with baseball? deadspin examining the drivers and brand performance implications of customer engagement with brands in the social media environment baseball attendance and the designated hitter big data consumer analytics and the transformation of marketing the steroids era. december a dialogue on momentum accounting for brand management dynamic brand evolution mechanism of professional sports teams: empirical analysis using comprehensive major league baseball data mlb's most valuable television deals the politics of stakeholder theory: some future directions customer lifetime value and firm valuation multivariate data analysis assessing measurement model quality in pls-sem using confirmatory composite analysis a primer on partial least squares structural equation modeling (pls-sem) when to use and how to report the results of pls-sem a new criterion for assessing discriminant validity in variance-based structural equation modeling the marketing advantages of strong brands a review of exploratory factor analysis decisions and overview of current practices: what we are doing and how can we improve? stakeholder marketing: a definition and conceptual framework the state of marketing analytics in research and practice product familiarity and learning new information finding sources of brand value: developing a stakeholder model of brand equity brands and branding: research findings and future priorities strike three: baseball is dead. february, https :// howth eypla y.com/team-sport s/baseb all-a-chang ing-lands cape competitive interference effects in consumer memory for advertising: the role of brand familiarity extending the understanding of professional team brand equity to the global marketplace defining and measuring social customerrelationship management (crm) capabilities how the steroid era saved baseball. the motley fool the rise of the local: the power of regional sports networks in the television marketplace stock market reactions to brand extension announcements: the effects of brand attitude and familiarity stakeholder theory: reviewing a theory that moves us common factors in major league baseball game attendance estimating attendance at major league baseball games for the season how popular is baseball corporate social responsibility and firm financial performance. academy of management the peculiar economics of professional sports attendance and price setting fox's mlb tv deal is much richer for team owners than you probably think. forbes it's arrived: mlb at bat vr makes debut. major league baseball revenue sharing in major league baseball: the moments that meant so much salespeople's reward preference methodological analysis partial least squares structural equation modeling using smartpls: a software review brand name recall: a study of the effects of word types, processing, and involvement levels the measurement and determinants of brand equity: a financial approach the resourcebased view and marketing: the role of market-based assets in gaining competitive advantage major league baseball (dossier no. did- - ) understanding fan motivation for interacting on social media factors influencing demand in major league baseball: steroid policy, discrimination, and uncertainty of outcome consumer interest in major league baseball: an analytical modeling of twitter revisiting the social identity-brand equity model: an application to professional sports marketing analytics for data-rich environments social media posters and lurkers: the impact on team identification and game attendance in minor league baseball an examination of selected marketing mix elements and brand equity publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. college and an adjunct faculty member at the university of mobile. he is a phd candidate at the university of south alabama with research interests in sales, ethics, marketing, analytics, and emerging technologies.catherine hessick is a lecturer of marketing at james madison university where she teaches undergraduate courses on business decision making, integrated marketing communications, and marketing foundations. in addition, she is a ph.d. candidate at the university of south alabama currently researching the ethical perspectives, moral disengagement, and unethical consumption behaviors of generation z. her other research interests include advertising, humor, and branding. britton r. leggett is a mathematics teacher at neville high school (monroe, la) and a part-time adjunct at the university of louisiana, monroe. he is a ph.d. candidate at the university of south alabama with research interests in social media marketing, influencer marketing, and marketing analytics. university's huizenga college of business. he is also an adjunct professor at florida memorial university for finance courses. he is a ph.d. candidate at the university of south alabama with research interests in strategic management and finance. he was in real estate lending for twenty-one years and owned a lender for nine.kenneth o'connor is a marketing ph.d. student at the university of south alabama and an adjunct instructor of business at the university of west florida. he gained much of his experience through managing his family's small kitchen and bath remodeling company in addition to managing other small and medium size construction companies. his research interests are loyalty, branding, entrepreneurship and family business. key: cord- -sjsju qp authors: ewing, lee-ann; vu, huy quan title: navigating ‘home schooling’ during covid- : australian public response on twitter date: - - journal: nan doi: . / x sha: doc_id: cord_uid: sjsju qp covid- has wreaked havoc worldwide. schools have escaped neither the pandemic nor its consequences. indeed, by april , schools had been suspended in countries, affecting % of learners globally. while the australian government has implemented variously effective health and economic policies in response to covid- , their inability to agree with states on education policy during the pandemic caused considerable confusion and anxiety. accordingly, this study analyses weeks of tweets during april, leading up to the beginning of term , during the height of government policy incongruity. findings confirm a wide and rapidly changing range of public responses on twitter. nine themes were identified in the quantitative analysis, and six of these (positive, negative, humorous, appreciation for teachers, comments aimed at government/politicians and definitions) are expanded upon qualitatively. over the course of weeks, the public began to lose its sense of humour and negative tweets almost doubled. on december , the first reporting of unusual health activity came out of wuhan, in hubei province, china. eight days later, the activity was identified as a 'novel coronavirus' (virus strain sars-cov- ), named -ncov or covid- by the world health organization (who, cnn editorial research, ) . over the next several weeks, cases began to grow in asia, particularly china. as the world watched government actions and news updates out of china, thailand, japan the data were collected from twitter, one of the most popular social networking platforms worldwide. twitter allows users to post and interact with messages know as tweets. users often access school will return in the act on tuesday ( / ), and will initially be entirely delivered via remote learning. staggered return to face-to-face classroom teaching through phases. phase sees all students encouraged to utilise remote learning from home wherever possible, though 'no student will be turned away' from supervised school learning. phase begins may , requires students to return to the classroom for day/week. phase = days/week. phase = days/week with social distancing. phase would have school back running as normal. schools open for any student to receive face-to-face teaching. wa government claim term two plan is 'cautious, careful and considered to protect teachers, parents and children'. choice to send children to school lies with families, and distance education packages and resources or online remote learning will be provided to any student who is kept home. year and students 'strongly encouraged' to attend school for face-to-face classes. some catholic, independent and anglican schools have gone against this advice, adopting remote learning for students up to year . all students in the nt are expected to physically attend school, unless they are unwell. parents can choose not to send their children to school, but are then 'responsible for the student's learning, safety and wellbeing at home or elsewhere'. the twitter platform through its web interface, or its mobile-device applications software. twitter is selected for this study as it is a popular platform for sharing ideas and catching up with news and trends around the world (overbey et al., ) . we developed and deployed data extraction software to automatically extract tweet data from twitter. the program was developed based on twitter application programming interface (api), which allows users to search for tweets based on specific keywords and geographical location. full documentation about twitter api can be retrieved from https://developer.twitter.com/en/docs. this study focused on analysing public opinions about home schooling in australia; therefore, we provided a search query (homeschooling or 'home schooling') to the twitter search function of the api. in addition, an extra set of parameters, (- . , - . , . , . ) for minimum latitude, maximum latitude, minimum longitude and maximum longitude is provided to specify a bounding box to focus the search within the geographical area of australia. there is a quota limit for access to twitter api, which only allows for a proportion of all available tweets in the latest days to be retrieved. although not all available tweets are included, the collected tweets are randomly sampled from all available ones, and thus reliable to capture common patterns and trends among public. we ran the data collection program three times, for the weeks commencing , and april to collect the tweets posted during this -week period. in total, , tweets relevant to homeschooling posted in australia were collected. the number of tweets for weeks , and are , , and , respectively. we adopt both quantitative (descriptive) and qualitative approaches to analysing the contents of the collected tweets to identify their major themes and concerns of the australian public in relation to home schooling during the pandemic. the first author read all , + tweets twice before developing the broad codes detailed in column of table . after the second author agreed with the codes, a coding protocol was followed whereby the first author ('coder ') and an independent and qualified non-author ('coder ') independently coded all tweets in the three csv files by assigning one number/code to each tweet. inter-coder reliability was computed using holsti's ( ) method -based on the percentage of agreement between the two coders. inter-coder reliability scores were all above % ( %, % and % for weeks - , respectively), thereby confirming that both coders were interpreting the material in a consistent manner. discrepancies were discussed and codes were revised, yielding a final coding result. what table suggests is that over the course of weeks, leading up to and including the commencement of term , the 'novelty factor' wore off surprisingly quickly ( % reduction in positive tweets) and the public began to lose its sense of humour (humorous tweets dropped % off a high base and negative tweets almost doubled to %). appreciation for teachers more than doubledoff a low base (to %), while tweets aimed at the government/politicians halved (to %). surprisingly, the most dominant theme in week is frustration at calling remote learning 'homeschooling', which has skyrocketed (from % to %), further confirming that the novelty has indeed worn off and that the public is becoming increasingly frustrated. z-tests were performed and verified that except for general and/or neutral tweets and tips/advice/sharing of resources, the proportional differences between the two periods of other items are statistically significant at p-value ⩽ . . the second phase of data analysis was interpretive in nature. the first author went through the coded csv files again and shaded the more illuminating, illustrative and unique tweets in six of the nine themes. these were then extracted and placed into a ms word document for further analysis. positive. many people commented on the positive effects of being in isolation and homeschooling. they appreciated the perception of 'slowing down', being less busy with fewer distractions: many parents commented on how homeschooling has presented them with the time and opportunity to get to know their children better and appreciate them as individuals -'find the silver lining. with difficulty there is blessing. i can be with my girls - . precious', and another, 'i'm reminded the days are long the years are short'. the exposure to a new learning approach has gained positive reactions. many respondents commented on the positive results of homeschooling. 'i know one kid who has a problematical classroom record but is blossoming under home schooling to the extent that his parents are exploring options after the #lockdown ends. and they're not home-schooling nutters'. many parents and children are engaged in their learning, having fun and even thriving. 'home schooling positive: teenage son whose reports invariably add "too easily distracted" is getting his work done'. this could be as a result of the homeschooling model being an antithesis of all the negative factors of face-to-face schooling. 'winning! plus, no bags/lunches to pack or school drop off/pick up! #homeschooling suits me!' many parents used the twitter platform to reach out to others. there is a need to connect as adults experiencing a new challenge and to share experiences and humour. 'parents are homeschooling -especially i suspect in lower grades. we don't need to get bent out of shape about it it's just a necessity right now'. '. . . getting a bit nostalgic about our office door . . . offices-remember those?' 'send coffee please . . .'. twitter allowed parents to share resources. a collective empathy among parents is evident. the sharing of tips, links, websites and so on is a common thread through the twitter feeds. 'some great tips if you're a parent home schooling your kids' humour. twitter is also a platform for sharing humour about homeschooling. parents embraced humour as a coping mechanism. '. . . not sure i'm managing the work/life/homeschooling balance quite right. there's going to be a very cranky "teacher" in our house tomorrow morning!!'; 'it turns out i really suck at grade math's. #homeschooling'. it acts as an adult forum for venting frustration and a feeling of being overwhelmed. ' hour into day of home schooling and i've already decided it's time for a fire drill. #homeschooling #getthemout'. humour on twitter is allowing parents to express themselves safely without fear of recrimination. they have an audience who are experiencing similar challenges. it is also a platform for people in isolation as it is instant communication without having to leave home. many twitter comments mentioned alcohol consumption. 'well that went well. bwahahaha . . . sob . . . i need a drink. #homeschooling #workfromhome #completelybloodyincompatible'. negatives. many negative aspects of homeschooling were highlighted in the tweets. the platform was used by many to express their anxiety and frustration. '. . . i talk to frustrated parents every day. i'm bloody frustrated and exhausted and angry too'. there is a strong suggestion of not coping and relationship deterioration. 'i honestly could not do homeschooling for a term. my son would suffer academically and our relationship would suffer'. the situation has also highlighted the divide between public and private schools. 'if kids are behind in literacy & numeracy it's because this govt fails to invest enough in our public schools'. '. . . covid coronavirus rich people problems'. 'why is every home schooling case study on abc in a relatively welloff household? what about the poor kids @abcnews?' twitter users express a frustration with technology/lack thereof. 'no internet connection this morning . . . home schooling is canceled for today'. 'so, we had electricity issues on thursday in my area so we couldn't log in and partake in the online learning . . . i'm so deflated'. the platform was also used to express concerns regarding gender inequality. 'mostly it falls to the women in the household making it more difficult for keeping their job let alone career progression'. 'the #genderpaygap will have a shocking impact on our critical #covid workforce 'who will do the lion's share of home schooling and child caring during #iso when both parents are normally at work whose job will get priority? his/he earns more'. 'apparently mums struggling with juggling looking after a house working being a mum / and home schooling during isolation means they hate their kids'. twitter users also picked up on how 'homeschooling' has exposed the vulnerable in society. 'some children are exposed to crap parental behavior including actual abuse some don't have multiple bedrooms/ laptops/ backyards to make the home schooling thing a pleasant and productive experience. i feel for kids the most'. 'dear parents of autistic/adhd kids who are homeschooling during lockdown. how are you coping? how are your kids coping with . . .?' 'if kids had online home schooling there would be no bullying in schools which is an epidemic'. mental health challenges are also exposed: 'i'm not helping with my daughter's home schooling. it's very hard. depression doesn't help in the mix either'. twitter also points to challenges in australian society as a whole, 'people have lost homes in the bushfires. people have lost jobs due to the lockdown. parents struggle with homeschooling . . .'. the negative influence of technology is argued on twitter. 'it's like a child being raised by robots'. 'this home schooling thing is a great way to get kids addicted to screens isn't it'. '. . . it's not the work it's the endless portals passwords and it administrative confusion'. for parents, the reliance on technology tests their parenting values and approach: home schooling is a #bigtech wet dream. we have a year old now with a google account something we would not have waved through until he was but to deny means he has no interaction with his education and school mates. twitter is also a platform for raising concerns about the demands made on parents' time and balancing the demands of their children's learning and their own work commitments. 'deliberating what the appropriate time to start making work calls in this covid- time when the person i'm calling is isolating and home schooling?' 'i'm getting family help with my kids as home schooling with this job is a struggle'. 'i'm not saying u can't have parents work but children's needs have come st'. appreciation for teachers. many twitter users expressed their appreciation and respect for teachers, particularly by week , suggesting that the role of teaching has till now been undervalued. 'i am so sorry, i and many other parents have seen st hand how tough the job is for teachers'. 'when covid- restrictions are lifted i demand all good teachers get paid double . . .'. 'it's fair to say that since this crisis began most parents have discovered a newfound respect for teachers'. having to adopt the role on a 'temporary' basis has revealed the demands on teachers. many did not realize the effort involved in teaching. '. . . teaching is a demanding occupation homeschooling might be revealing to many parents just how difficult it is'. '. . . homeschooling has taught parents anything it's that it takes a specific kind of person to have the patience and dedication to teach'. some twitter users refer to teachers as 'heroes'. 'teachers are unsung heroes until everyone has tried home-schooling. we owe them thanks along with delivery workers'. for many, teachers play important part in their children's lives. twitter allows users to express their emotions about the relationship, '. . . i got a bit teary. they are so supportive and realistic and i love them and now i'm sad all over again that my boys are out of their care for this term'. government and politicians. while twitter users express respect and affection for teachers, the vitriol directed at the country's leaders is largely negative. many twitter comments concerned the government's response and communication regarding education and the pandemic. it is a platform for venting frustration and anger towards politicians, at times through sarcasm. the primary concern is the miscommunication and confusion around the messages presented to parents by the governments. 'welcome to covid- oz style . . . so we have the federal government telling us that we should be sending kids to school yet state govts are saying only send them if absolutely necessary . . . would you please clarify'. 'what a fuckstick. it's like getting a lecture from a drunk uncle. incoherent ramblings and repeating himself. speech writers from the ipa?' the issue is also around poor leadership. '@dan tehanwannon showed some pretty poor leadership with his outburst. expect more from elected officials. political squabbling is pathetic at this time'. 'pm urges teachers not to force parents into choice between home schooling and food on the table' suggests fear of the unknown. parents are concerned about the health risk to their children of returning to school and hold politicians responsible, 'if you force schools to open i will unenroll my child from school and register them for home schooling . . .'. definitions. the definition of homeschooling/remote learning elicited many emotional responses. many tweets correctly noted that it is not homeschooling, rather learning from home or remote learning. 'every time people refer to it as home schooling i want to scream. it's not bloody homeschooling you morons' and 'parents are merely fulfilling a temporary, supervisory role, and are not expected to teach'. comments acknowledged that teachers are the professionals who are still teaching. 'hey again for the people in the back it's not #homeschooling it's learning from home using stuff professional teachers have prepared'. another user summarized the role: australian parents are supporting students as they learn at home. teachers have designed the curriculum, planned lessons, decided on assessment and will mark student work. parents are definitely working hard to support learning, but they are not homeschooling their children. the language around learning is confusing and varied: remote learning, distance learning, supporting students learning at home, online learning. 'it's not outrage at the wrong term it's frustration and misrepresentation. distance education is the right term. home schooling it is not'. arguably, the frustration over the definition of the learning is highlighting how unsure parents feel in this new role: i came across some online twitter debate on whether it was technically home schooling when in reality you aren't setting the work. after hours of helping my year old navigate math's questions i will call it whatever the fuck i like . . . exhausting mostly lol. another user stated, 'we know it's not home schooling. if it's easier for parents to call it that and it makes them feel better then so be it'. finally, a humorous take on the definition: you can call it home schooling if it comes from the homeschoole region of france otherwise it is just sparkling domestic education. covid- is the biggest health, social and economic emergency the world has faced since the second world war -and its consequences will endure for years to come (khan, ) . in response, australia's national cabinet process has worked effectively by building confidence and trust between jurisdictions and cutting through narrow partisan politics in the name of 'public interest'. different levels of government have had to come together to negotiate, largely as equals, resulting in better policy than if one level had simply dictated terms (smith, ) . for example, the actions of the nsw and victorian governments on the second weekend in march, where they pushed a stage lockdown ahead of a reluctant commonwealth have been shown to be correct. likewise, the actions of queensland in imposing quarantines on domestic travellers were followed by the nt, wa, sa and tasmania and have effectively stopped the spread of the virus between states. interestingly, the only major split in the initial covid- response has been over schools, an area that is funded dually and where the commonwealth has tried to use funding to exert control over the states' responsibilities. this school split has frustrated and confused an already anxious public, who, quite justifiably turned to social media to voice their frustrations. a single recommendation emanating from this study is that the australian federal and state governments should agree on a school attendance/remote learning policy before the next pandemic or other crisis strikes. in other words, have a national/state policy 'on the shelf'. it is also highly likely that hybrid approach to primary and secondary education will emerge post-covid. in other words, parts of the curricula could transition online over time. this, coupled with the very high likelihood of future pandemics and crises, will no doubt need to be reflected in teacher recruitment and training going forward and in school policies and practices pertaining to education during crises. it will be interesting to see how many of the aforementioned issues play out in real time in the united states this fall. schools there are about to return from their long summer break and tensions are escalating between education secretary betsy devos and numerous state governors. early indications are that the us model for the fall term and / academic year is likely to be partisan, heterogenous, probably hybrid and possibly quite politically divisive. like all similar studies, this one is not without limitations. the analysis was carried out on a relatively small data set ( , tweets) and for a short time period ( weeks). the data were also collected from a single social media platform (twitter). nevertheless, the findings shed light on the emerging issues and challenges in the context of remote learning which are confronting policy makers, politicians and principals alike. in addition to the textual content, other meta-data, such as profile of twitter user (e.g. location, description, number of followers, number of following), hashtag, mentioned users, number of likes, number of re-tweets and links to other websites are also available. however, as a preliminary study, our analysis was mainly focusing on the textual content of the tweets to study the topics being discussed by the australian public at a point in time. due to the unrestricted boundary of social media, the analysis can be extended to other countries, such as new zealand and beyond, for comprehensive insights. while this study focused on twitter, analyses of facebook and instagram would be most worthwhile too. future studies may also consider employing quantitative approaches with text mining, sentiment analysis and topic modelling (silge and robinson, ) to effectively process and analyse large-scale social media data sets. future research might employ a choice modelling technique, such as best-worst scaling, to determine what learning approaches parents deem most and least appealing, for it is also possible that traditional (i.e. voluntary, non-forced) home schooling may increase in popularity following some parents/pupils' positive experiences during the pandemic. the author(s) received no financial support for the research, authorship and/or publication of this article. lee-ann ewing https://orcid.org/ - - - australian dies from coronavirus covid- in perth after infection on diamond princess ship australia closes borders to stop coronavirus. news. available at principals reject cash lure to resume classes, despite financial plan. the age content analysis for the social sciences and humanities we have a responsibility to confront covid- . the guardian linking twitter sentiment and event data to monitor public opinion of geopolitical developments and trends sbp-brims opinion: this is our chance! we have, right now, a once-in-a-generation opportunity. the mandarin ) who warns coronavirus, now dubbed covid- , is 'public enemy number ' and potentially more powerful than terrorism. abc news dan tehan admits he 'overstepped the mark' in attack on daniel andrews over coronavirus schools closure. the age key: cord- - w cam authors: fang, zhichao; costas, rodrigo; tian, wencan; wang, xianwen; wouters, paul title: an extensive analysis of the presence of altmetric data for web of science publications across subject fields and research topics date: - - journal: scientometrics doi: . /s - - - sha: doc_id: cord_uid: w cam sufficient data presence is one of the key preconditions for applying metrics in practice. based on both altmetric.com data and mendeley data collected up to , this paper presents a state-of-the-art analysis of the presence of kinds of altmetric events for nearly . million web of science publications published between and . results show that even though an upward trend of data presence can be observed over time, except for mendeley readers and twitter mentions, the overall presence of most altmetric data is still low. the majority of altmetric events go to publications in the fields of biomedical and health sciences, social sciences and humanities, and life and earth sciences. as to research topics, the level of attention received by research topics varies across altmetric data, and specific altmetric data show different preferences for research topics, on the basis of which a framework for identifying hot research topics is proposed and applied to detect research topics with higher levels of attention garnered on certain altmetric data source. twitter mentions and policy document citations were selected as two examples to identify hot research topics of interest of twitter users and policy-makers, respectively, shedding light on the potential of altmetric data in monitoring research trends of specific social attention. ever since the term "altmetrics" was coined in jason priem's tweet in , a range of theoretical and practical investigations have been taking place in this emerging area (sugimoto et al. ) . given that many types of altmetric data outperform traditional citation counts with regard to the accumulation speed after publication , initially, altmetrics were expected to serve as faster and more fine-grained alternatives to measure scholarly impact of research outputs (priem et al. (priem et al. , . nevertheless, except for mendeley readership which was found to be moderately correlated with citations zahedi and haustein ) , a series of studies have confirmed the negligible or weak correlations between citations and most altmetric indicators at the publication level (bornmann b; costas et al. ; de winter ; zahedi et al. ) , indicating that altmetrics might capture diverse forms of impact of scholarship which are different from citation impact (wouters and costas ) . the diversity of impact beyond science reflected by altmetrics, which is summarized as "broadness" by bornmann ( ) as one of the important characteristics of altmetrics, relies on diverse kinds of altmetric data sources. altmetrics do not only include events on social and mainstream media platforms related to scholarly content or scholars, but also incorporate data sources outside the social and mainstream media ecosystem such as policy documents and peer review platforms . the expansive landscape of altmetrics and their fundamental differences highlight the importance of keeping them as separate entities without mixing, and selecting datasets carefully when making generalizable claims about altmetrics (alperin ; wouters et al. ) . in this sense, data presence, as one of the significant preconditions for applying metrics in research evaluation, also needs to be analyzed separately for various altmetric data sources. bornmann ( ) regarded altmetrics as one of the hot topics in the field of scientometrics for several reasons, being one of them that there are large altmetric data sets available to be empirically analyzed for studying the impact of publications. however, according to existing studies, there are important differences of data coverage across diverse altmetric data. in one of the first, thelwall et al. ( ) conducted a comparison of the correlations between citations and categories of altmetric indicators finding that, except for twitter mentions, the coverage of all selected altmetric data of pubmed articles was substantially low. this observation was reinforced by other following studies, which provided more evidence about the exact coverage for web of science (wos) publications. based on altmetric data retrieved from impactstory (is), zahedi et al. ( ) reported the coverage of four types of altmetric data for a sample of wos publications: mendeley readers ( . %), twitter mentions ( . %), wikipedia citations ( . %), and delicious bookmarks ( . %). in a follow-up study using altmetric data from altmetric.com, costas et al. ( ) studied the coverage of five altmetric data for wos publications: twitter mentions ( . %), facebook mentions ( . %), blogs citations ( . %), google+ mentions ( . %), and news mentions ( . %). they also found that research outputs in the fields of biomedical and health sciences and social sciences and humanities showed the highest altmetric data coverage in terms of these five altmetric data. similarly, it was reported by haustein et al. ( ) that the coverage of five social and mainstream media data for wos papers varied as follows: twitter mentions ( . %), facebook mentions ( . %), blogs citations ( . %), google + mentions ( . %), and news mentions ( . %). in addition to aforementioned large-scale research on wos publications, there have been also studies focusing on the coverage of altmetric data for research outputs from a certain subject field or publisher. for example, on the basis of selected journal articles in the field of humanities, hammarfelt ( ) investigated the coverage of five kinds of altmetric data, including mendeley readers ( . %), twitter mentions ( . %), citeulike readers ( . %), facebook mentions ( . %), and blogs citations ( . %). waltman and costas ( ) found that just about % of the publications in the biomedical literature received at least one f prime recommendation. for papers published in the public library of science (plos) journals, bornmann ( a) reported the coverage of a group of altmetric data sources tracked by plos's article-level metrics (alm). since the data coverage is a value usually computed for most altmetric studies, similar coverage levels are found scattered across many other studies as well (alperin ; fenner ; robinson-garcía et al. ) . by summing up the total number of publications and those covered by altmetric data in related studies, erdt et al. ( ) calculated the aggregated percentage of coverage for altmetric data. their aggregated results showed that mendeley readers covers the highest share of publications ( . %), followed by twitter mentions ( . %) and cit-eulike readers ( . %), while other altmetric data show relatively low coverage in general (below %). the distributions of publications and article-level metrics across research topics are often uneven, which has been observed through the lens of text-based (gan and wang ) , citation-based (shibata et al. ) , usage-based (wang et al. ) , and altmetric-based (noyons ) approaches, making it possible to identify research topics of interest in different contexts, namely, the identification of hot research topics. by combining the concept made by tseng et al. ( ) , hot research topics are defined as topics that are of particular interest to certain communities such as researchers, twitter users, wikipedia editors, policy-makers, etc. thus, hot is defined as the description of a relatively high level of attention that research topics have received on different altmetric data sources. attention here is understood as the amount of interactions that different communities have generated around research topics, therefore those topics with high levels of attention can be identified and characterized as hot research topics from an altmetric point of view. traditionally, several text-based and citation-based methodologies have been widely developed and employed in detecting research topics of particular interest to researchers, like co-word analysis (ding and chen ; lee ) , direct citation and co-citation analysis (chen ; small ; small et al. ) , and the "core documents" based on bibliographic coupling (glänzel and czerwon ; glänzel and thijs ) , etc. besides, usage metrics, which are generated by broader sets of users through various behaviors such as viewing, downloading, or clicking, have been also used to track and identify hot research topics. for example, based on the usage count data provided by web of science, detected hot research topics in the field of computational neuroscience, which are listed as the keywords of the most frequently used publications. by monitoring the downloads of publications in scientometrics, wang et al. ( ) identified hot research topics in the field of scientometrics, operationalized as the most downloaded publications in the field. from the point of view that altmetrics can capture the attention around scholarly objects from broader publics (crotty ; sugimoto ) , some altmetric data were also used to characterize research topics based on the interest exhibited by different altmetric and social media users. for example, robinson-garcia et al. ( ) studied the field of microbiology to map research topics which are highly mentioned within news media outlets, policy briefs, and tweets over time. zahedi and van eck ( ) presented an overview of specific topics of interest of different types of mendeley users, like professors, students, and librarians, and found that they show different preferences in reading publications from different topics. identified research topics of publications that are faster to be mentioned by twitter users or cited by wikipedia page editors, respectively. by comparing the term network based on author keywords of climate change research papers, the term network of author keywords of those tweeted papers, and the network of "hashtags" attached to related tweets, haunschild et al. ( ) concluded that twitter users are more interested in topics about the consequences of climate change to humans, especially those papers forecasting effects of a changing climate on the environment. although there are multiple previous studies discussing the coverage of different altmetric data, after nearly years of altmetric research, we find that a renewed large-scale empirical analysis of the up-to-date presence of altmetric data for wos publications is highly relevant. particularly, since amongst previous studies, there still exist several types of altmetric data sources that have not been quantitatively analyzed. moreover, although the correlations between citations and altmetric indicators have been widely analyzed at the publication level in the past, the correlations of their presence at the research topic level are still unknown. to fill these research gaps, this paper presents a renovated analysis of the presence of various altmetric data for scientific publications, together with a more focused discussion about the presence of altmetric data across broad subject fields and smaller research topics. the main objective of this study is two-fold: ( ) to reveal the development and current situation of the presence of altmetric data across publications and subject fields, and ( ) to explore the potential application of altmetric data in identifying and tracking research trends that are of interest to certain communities such as twitter users and policy-makers. the following specific research questions are put forward: rq . compared to previous studies, how the presence of different altmetric data for wos publications has developed until now? what is the difference of altmetric data presence across wos publications published in different years? rq . how is the presence of different altmetric data across subject fields of science? for each type of altmetric data, which subject fields show higher levels of data prevalence? rq . how are the relationships among various altmetric and citation data in covering different research topics? based on specific altmetric data, in each subject field which research topics received higher levels of altmetric attention? a total of , , wos papers published between and were retrieved from the cwts in-house database. since identifiers are necessary for matching papers with their altmetric data, only publications with a digital object identifier (doi) or a pubmed identifier (pubmed id) recorded in wos were considered. using the two identifiers, wos papers were matched with types of altmetric data from altmetric.com and mendeley readership as listed in table . the data from altmetric.com were extracted from a research snapshot file with data collected up to october . mendeley readership data were separately collected through the mendeley api in july . altmetric.com provides two counting methods of altmetric performance for publications, including the number of each altmetric event that mentioned the publication and the number of unique users who mentioned the publication. to keep a parallelism with mendeley readership, which is counted at the user level, the number of unique users was selected as the indicator for counting altmetric events in this study. for selected publications, the total number of events they accumulated on each altmetric data source are provided in table as well. besides, we collected the wos citation counts in october for the selected publications. citations serves as a benchmark for a better discussion and understanding of the presence and distribution of altmetric data. to keep the consistency with altmetric data, a variable citation time window from the year of publication to was utilized and selfcitations were considered for our dataset of publications. to study subject fields and research topics, we employed the cwts classification system (also knowns as the leiden ranking classification). waltman and van eck ( ) developed this publication-level classification system mainly for citable wos publications (article, review, letter) based on their citation relations. in its version, publications are clustered into micro-level fields of science with similar research topics (here and after known as micro-topics) as shown in fig. with vosviewer. for each micro-topic, the top five most characteristic terms are extracted from the titles of the publications in order to label the different micro-topics. furthermore, these micro-topics are assigned to five main subject fields of science algorithmically obtained, including social sciences and humanities (ssh), biomedical and health sciences (bhs), physical sciences and engineering (pes), life and earth sciences (les), and mathematics and computer science (mcs). the cwts classification system has been applied not only in the leiden ranking (https :// www.leide nrank ing.com/), but also in many different previous studies related with subject field analyses didegah and thelwall ; zahedi and van eck ) . a total of , , of the initially selected publications (accounting for . %) have cwts classification information. this set of publications was drawn as a subset for the comparison of altmetric data presence across subject fields and research topics. table presents the number of selected publications in each main subject field. in order to measure the presence of different kinds of altmetric data or citation data across different sets of publications, we employed the three indicators proposed by haustein et al. coverage (c) indicates the percentage of publications with at least one altmetric event (or one citation) recorded in the set of publications. therefore, the value of coverage ranges from to %. the higher the coverage, the higher the share of publications with altmetric event data (or citation counts). density (d) is the average number of altmetric events (or citations) of the set of publications. both publications with altmetric events (or citations) and those without any altmetric events (or citations) are considered in the calculation of density, so it is heavily influenced by the coverage and zero values. the higher the value of density, the more altmetric events (or citations) received by the set of publications on average. intensity (i) is defined as the average number of altmetric events (or citations) of publications with at least one altmetric event (or citation) recorded. different from d, the calculation of i only takes publications with non-zero values in each altmetric event (or citation event) into consideration, so the value must be higher or equal to one. only in those cases of groups of publications without any altmetric events (or citations), the intensity is set to zero by default. the higher the value of intensity, the more altmetric events (or citations) that have occurred around the publications with altmetric/citation data on average. in order to reveal the relationships among these three indicators at the research topic level, as well as the relationships of preferences for research topics among different data, the spearman correlation analysis was performed with ibm spss statistics . this section consists of four parts: the first one presents the overall presence of altmetric data for the whole set of wos publications (in contrast with previous studies) and the evolution of altmetric data presence over the publication years. the second part compares the altmetric data presence of publications across five main subject fields of science. the third part focuses on the differences of preferences of altmetric data for research topics. in the fourth part, twitter mentions and policy document citations are selected as two examples for identifying hot research topics with higher levels of altmetric attention received. coverage, density, and intensity of sources of altmetric data and citations were calculated for nearly . million sample wos publications to reveal their overall presence. table presents not only the results based on our dataset, but also, for comparability purposes, the findings of data coverage (c_ref) reported by some previous altmetric empirical studies that also used altmetric.com (and mendeley api for mendeley readership) as the altmetric data source, and wos as the database for scientific publications; and also without applying restrictions of certain discipline, country, or publisher. as these previous studies analyzed datasets with size, publication years (py), and data collection years (dy) different from ours, we present them as references for discussing the retrospective historical development of altmetric data prevalence. according to the results, the presence of different altmetric data varies greatly. mendeley readership provides the largest values of coverage ( . %), density ( . ), and intensity ( . ), even higher than citations. as to other altmetric data, their presence is much lower than mendeley readers and citations. twitter mentions holds the second largest values among all other altmetric data, with . % of publications mentioned by twitter users and those mentioned publications accrued about . twitter mentions on average. it is followed by several social and mainstream media data, like facebook mentions, news mentions, and blogs citations. about . % of publications have been mentioned by facebook, . % have been mentioned by news outlets, and . % have been cited by blog posts. but among these three data sources, publications mentioned by news outlets accumulated more intensive attention in consideration of its higher value of intensity ( . ), which means that mentioned publications got more news mentions on average. in contrast, even though there are more publications mentioned by facebook, they received fewer mentions at the individual publication level (with the intensity value of . ). for the remaining altmetric data, their data coverage values are extremely low. wikipedia citations and policy document table the overall presence of types of altmetric data and citation data citations only covered respectively . % and . % of the sample publications, while the coverage of reddit mentions, f prime recommendations, video comments, peer review comments, and q&a mentions are lower than %. in terms of these data, the altmetric data of publications are seriously zero-inflated. compared to the coverage reported by previous studies, an increasing trend of altmetric data presence can be observed as time goes by. mendeley, twitter, facebook, news, and blogs are the most studied altmetric data sources. on the whole, the more recent the studies, the higher the values of coverage they report. our results show one of the highest data presence for most altmetric data. although the coverage of twitter mentions, news mentions, and reddit mentions reported by meschede and siebenlist ( ) is slightly higher than ours, it should be noted that they used a random sample consisting of wos papers published in , and as shown in fig. , there exist biases toward publication years when investigating data presence for altmetrics. after calculating the three indicators for research outputs in each publication year, fig. shows the change trends of the presence of altmetric data. overall there are two types of tendencies for all altmetric data, which are in correspondence with the accumulation velocity patterns identified in the research conducted by . thus, for altmetric data with higher speed in data accumulating, such as twitter mentions, facebook mentions, news mentions, blogs citations, and reddit mentions, newly published publications have higher coverage levels. in contrast, those altmetric data taking a longer time to accumulate (i.e., the slow sources defined by ), they tend to accumulate more prominently for older publications. wikipedia citations, policy document citations, f prime recommendations, video comments, peer review comments, and q&a mentions fall into this "slower" category. as a matter of fact, their temporal distribution patterns resemble more that of citations counts. regarding mendeley readers, although it keeps quite high coverage in every publication year, it shows a downward trend as citations too, indicating a kind of readership delay, by which newly published papers have to take time to accumulate mendeley readers (haustein et al. ; thelwall ; zahedi et al. ). in general, publications in the fields of natural sciences and medical and health sciences received more citations (marx and bornmann ) , but for altmetric data, the distribution across subject fields shows another picture. as shown in fig. , on the basis of our dataset, it is confirmed that publications in the subject fields of bhs, pse, and les hold the highest presence of citation data, and publications in the fields of ssh and mcs accumulated obviously fewer citation counts. however, as observed by costas et al. ( ) for twitter mentions, facebook mentions, news mentions, blogs citations, and google+ mentions, most altmetric data in fig. are more likely to concentrate on publications from the fields of bhs, ssh, and les, while pse publications lose the advantage of attracting attention as they show in terms of citations, thereby performing weakly in altmetric data presence as mcs publications do. amongst the presence of altmetric data and citations of scientific publications across five subject fields by altmetric.com is an aggregation of two platforms: publons and pubpeer. in our dataset, there are , distinct publications with altmetric peer review data for the analysis of data presence across subject fields, of them (accounting for . %) having peer review comments from publons and , of them (accounting for . %) having peer review comments from pubpeer ( publications have been commented by both). if we only consider the publications with publons data, bhs publications and les publications contribute the most (accounting for . % and . %, respectively), which is in line with ortega ( )'s results about publons on the whole. nevertheless, pubpeer data, which covers more publications recorded by altmetric.com, is biased towards ssh publications. ssh publications make up as high as . % of all publications with pubpeer data, followed by bhs publications (accounting for . %), besides the relatively small quantity of wos publications in the field of ssh, thereby leading to the overall high coverage of peer review comments of ssh publications. moreover, given the fact that the distributions of altmetric data are highly skewed, with the majority of publications only receiving very few altmetric events (see appendix ), particularly for altmetric data with relatively small data volume, their density and intensity are very close across subject fields. but in terms of intensity, there exist some remarkable subject field differences for some altmetric data. for example, on reddit, ssh publications received more intensive attention than other subject fields in consideration of their higher value of intensity. by comparison, those les and pse publications cited by wikipedia pages accumulated more intensive attention, even though the coverage of wikipedia citations of pse publications is rather low, suggesting that although pse publications have a lower coverage in wikipedia, they are more repeatedly cited. due to the influence of highly skewed distribution of altmetric data (see appendix ) on the calculation of coverage and density, these two indicators at the micro-topic level are strongly correlated for all kinds of altmetric data (see appendix ). in comparison, the correlation between coverage and intensity is rather weaker. moreover, in an explicit way, coverage tells how many publications around a micro-topic have been mentioned or cited at least once, and intensity describes how frequently those publications with altmetric data or citation data have been mentioned or cited. consequently, for a specific micro-topic, these two indicators can reflect the degree of broadness (coverage) and degree of deepness (intensity) of its received attention. therefore, we employed coverage and intensity to investigate the presence of altmetric data at the micro-topic level and identify research topics with higher levels of attention received on different data sources. coverage and intensity values were calculated and appended to micro-topics based on different types of altmetric and citation data, then the spearman correlation analyses were performed at the micro-topic level between each pair of data respectively. figure illustrates the spearman correlations of coverage amongst citations and types of altmetric data at the micro-topic level, as well as those of intensity. the higher the correlation coefficient, the more similar the presence patterns across micro-topics between two types of data. discrepancies in the correlations can be understood as differences in the relevance of every pair of data for micro-topics, therefore some pairs of data with stronger correlations may have a more similar preference for the same micro-topics, while those with relatively weaker correlations focus on more dissimilar micro-topics. through the lens of data coverage, mendeley readers is the only altmetric indicator that is moderately correlated with citations at the micro-topic level, being in line with the previous conclusions about the moderate correlation between mendeley readership counts and citations at the publication level . in contrast, because of the different distribution patterns between citations and most altmetric data across subject fields we found in fig. , it is not surprising that the correlations of coverage between citations and other altmetric data are relatively weak, suggesting that most altmetric data cover research topics different than citations. among altmetric data, twitter mentions, facebook mentions, news mentions, and blogs citations are strongly correlated with each other, indicating that these social media data cover similar research topics. most remaining altmetric data also present moderate correlations with the above social media data, however, q&a mentions, as the only altmetric data showing the highest coverage of publications in the field of mcs, is weakly correlated with other altmetric data at the micro-topic level. nevertheless, from the perspective of intensity, most altmetric data show different attention levels towards research topics, because the values of intensity of different data are generally weakly or moderately correlated. twitter mentions and facebook mentions, news mentions and blogs citations, are the two pairs of altmetric data showing the strongest correlations from both coverage and intensity perspectives, thus supporting the idea that these two pairs of altmetric data do not only respectively cover very similar research topics, but also focus on similar research topics. there exist a certain share of micro-topics in which their publications have not been mentioned at all by some specific altmetric data. in order to test the effect of those mutual zero-value micro-topics between each pair of data, the correlations have been performed also excluding them (see appendix ). it is observed that particularly for those pairs of altmetric data with low overall data presence across publications (e.g., q&a mentions and peer review comments, q&a mentions and policy document citations), their correlation coefficients are even lower when mutual zero-value micro-topics are excluded, although the overall correlation patterns across different data types at the micro-topic level are consistent with what we observed in fig. . on the basis of coverage and intensity, it is possible to compare the altmetric data presence across research topics and to further identify topics that received higher levels of attention. as shown in fig. , groups of publications with similar research topics (micro-topics) can be classified into four categories according to the levels of coverage and intensity of attention received. in this framework, hot research topics are those topics with a high coverage level of their publications, and at the same time they have also accumulated a relatively high intensive average attention (i.e., their publications exhibit high coverage and high intensity values). differently, those research topics in which only few publications have received relatively high intensive attention can be regarded as star-papers topics (i.e., low coverage and high intensity values), since the attention they attracted has not expanded to a large number of publications within the same research topic. thus, in star-papers topics the attention is mostly concentrated around a relatively reduced set of publications, namely, those star-papers with lots of attention accrued, while most of the other publications in the same research topic do not receive attention. following this line of reasoning, there are also research topics with a relatively large share of publications covered by a specific altmetric data, but those covered publications do not show a high average intensity of attention (i.e., high coverage and low intensity values), these research topics are defined as popular research topics with mile-wide and inch-deep attention accrued. finally, unpopular research topics indicate those topics with few publications covered by a specific altmetric data source, and the average of data accumulated by the covered publications is also relatively small (i.e., low coverage and low intensity values); these research topics have not attracted too much attention, thereby arguably remaining in an altmetric unpopular status. it should be noted that as time goes on and with newly altmetric activity generated, the status of a research topic might switch across the above four categories. following the framework proposed in fig. , we took twitter mention data as an example to empirically identify hot research topics in different subject fields. a total of micro-topics with at least one twitter mention in fig. were plotted into a two-dimensional system according to the levels of coverage and intensity they achieved (fig. a) . micro-topics are ranked based on their coverage and intensity at first, respectively. the higher the ranking a micro-topic achieves, the higher the level of its coverage or intensity. size of micro-topics is determined by their total number of publications. in order to identify representative hot research topics on twitter, here we selected the top % as the criterion for both levels of coverage and intensity (two dashed lines in fig. a) to partition micro-topics into four parts, which are in correspondence with fig. . as a result, microtopics with higher levels of coverage and intensity are classified as hot research topics that received broader and more intensive attention from twitter users (locate at the upper right corner of fig. a ). because publications in the fields of ssh, bhs, and les have much higher coverage and intensity of twitter data, micro-topics from these three subject fields are more likely to distribute at the upper right part. in contrast, micro-topics in pse and mcs concentrate at the lower left part. in consideration of the biased presence of twitter data across five main subject fields, we plotted micro-topics in each subject field by the same method as fig. a , respectively, and then zoomed in and only presented the part of hot research topics for each subject field in fig. b-f to show their identified hot research topics on twitter. for clear visualization, one of the extracted terms by cwts classification system was used as the label for each micro-topic. in the field of ssh, there are micro-topics considered, and ( %) of them rank in top % from both coverage and intensity perspectives (fig. b) . in this subject field, hot research topics tend to be about social issues, including topics related to gender and sex (e.g., "sexual orientation", "gender role conflict", "sexual harassment", etc.), education (e.g., "teacher quality", "education", "undergraduate research experience", etc.), climate ("global warming"), as well as psychological problems (e.g., "stereotype threat", "internet addiction", "stress reduction", etc.). bhs is the biggest field with both the most research outputs and the most twitter mentions, so there are micro-topics considered, and ( %) of them were detected as hot research topics in fig. c . research topics about daily health keeping (e.g., "injury prevention", "low carbohydrate diet", "longevity", etc.), worldwide infectious diseases (e.g., "zika virus infection", "ebola virus", "influenza", etc.), lifestyle diseases (e.g., "obesity", "chronic neck pain", etc.), and emerging biomedical technologies (e.g., "genome editing", "telemedicine", "mobile health", etc.) received more attention on twitter. moreover, problems and revolutions in the medical system caused by some social activities such as "brexit" and "public involvement" are also brought into focus. in the field of pse, ( %) out of micro-topics were identified as hot research topics in fig. d . as a field with less twitter mentions accumulated, although most research topics are left out by twitter users, those about the universe and astronomy (e.g., "gravitational wave", "exoplanet", "sunspot", etc.) and quantum (e.g., "quantum walk", "quantum game", "quantum gravity", etc.) received relatively higher levels of attention. in addition, there are also some hot research topics standing out from complexity sciences, such as "scale free network", "complex system", and "fluctuation theorem". in the field of les, there are micro-topics in total, and fig. e shows ( %) hot research topics in this field. these hot research topics are mainly about animals (e.g., "dinosauria", "shark", "dolphin", etc.) and natural environment problems (e.g., "extinction risk", "wildlife trade", "marine debris", etc.). finally, as the smallest subject field, mcs has ( %) out of micro-topics identified as hot research topics (fig. f) , which are mainly about emerging information technologies (e.g., "big data", "virtual reality", "carsharing") and robotics (e.g., "biped robot", "uncanny valley", etc.). to reflect the differences of hot research topics through the lens of different altmetric data sources, policy document citation data was selected as another example. figure shows the overall distribution of micro-topics with at least one policy document citation and the identified hot research topics in five main subject fields. the methodology of fig. based on twitter data. however, due to the smaller data volume of policy document citations, there are micro-topics sharing the same intensity of . in this case, total number of policy document citations of each micro-topic was introduced as a benchmark to make distinctions. for micro-topics with the same intensity, the higher the total number of policy document citations accrued, the higher the level of attention in the dimension of intensity. after this, if micro-topics still share the same ranking, they are tied for the same place with the next equivalent rankings skipped. in general, these paralleling rankings of micro-topics with relatively low level of attention do not affect the identification of hot research topics. through the lens of policy document citations, identified hot research topics differ from those in the eyes of twitter uses to some extents. in the field of ssh, ( %) out of micro-topics were classified as hot research topics (fig. b) . these research topics mainly focus on industry and finance (e.g., "microfinance", "tax compliance", "intra industry trade", etc.), as well as child and education (e.g., "child care", "child labor", "teacher quality", etc.). besides, "gender wage gap" is also a remarkable research topic appeared in policy documents. in the field of bhs, there are micro-topics have been cited by policy documents at least once, and ( %) of them were classified as hot research topics (fig. c) . worldwide infectious diseases are typically concerned by policy-makers, consequently, there is no doubt that they were identified as hot research topics, such as "sars", "ebola virus", "zika virus infection", and "hepatitis c virus genotype". in addition, healthcare (e.g., "health insurance", "nursing home resident", "newborn care", etc.), social issues (e.g., "suicide", "teenage pregnancy", "food insecurity", "adolescent smoking", etc.), and potential health-threatening environment problems (e.g., "ambient air pollution", "environmental tobacco smoke", "climate change", etc.) drew high levels of attention from policy-makers too. different from the focus of attention on astronomy of twitter users, in the field of pse (fig. d) , the ( %) hot research topics out of micro-topics that concerned by policy-makers are mainly around energy and resources, like "energy saving", "wind energy", "hydrogen production", "shale gas reservoir", "mineral oil", and "recycled aggregate", etc. in the field of les, fig. e shows the ( %) hot research topics identified out from micro-topics. from the perspective of policy documents, environmental protection (e.g., "marine debris", "forest management", "sanitation", etc.) and sustainable development (e.g., "selective logging", "human activity", "agrobiodiversity", etc.) are hot research topics. at last, in the field of mcs (fig. f) , publications are hardly cited by policy documents, thus there are only ( %) topics out of micro-topics identified as hot research topics. in this field, policy-makers paid more attention to information security ("differential privacy", "sensitive question") and traffic economy ("road pricing", "carsharing"). data presence is essential for the application of altmetrics in research evaluation and other potential areas. the heterogeneity of altmetrics makes it difficult to establish a common conceptual framework and to draw a unified conclusion (haustein ) , thus in most cases it is necessary to separate altmetrics to look into their own performance. this paper investigated types of altmetric data respectively based on a large-scale and up-to-date dataset, results show that various altmetric data vary a lot in the presence for wos publications. data presence of several altmetric data has been widely discussed and explored in previous studies. there are also some reviews summarizing the previous observations of the coverage of altmetric data (erdt et al. ; ortega ) . generally speaking, our results confirmed the overall situations of the data presence in those studies. for instance, mendeley readership keeps showing a very high data coverage across scientific publications and provides the most metrics among all altmetric data, followed by twitter mentions and facebook mentions. however, there exist huge gaps among these altmetric data. regarding the data coverage, . % of sample publications have attracted at least one mendeley reader, while for twitter mentions and facebook mentions, the value is only . % and . %, respectively. moreover, for those altmetric data which are hardly surveyed with the same dataset of wos publications before, like reddit mentions, f prime recommendations, video comments, peer review comments, and q&a mentions, their data coverage is substantially lower than %, showing an extremely weak data presence across research outputs. comparing with previous observations of altmetric data coverage reported in earlier altmetric studies, it can be concluded that the presence of altmetric data is clearly increasing, and our results are generally higher than those previous studies using the same types of datasets. there are two possible reasons for the increasing presence of altmetric data across publications. one is the progress made by altmetric data aggregators (particularly altmetric.com), by improving their publication detection techniques and by enlarging tracked data sources. for example, altmetric.com redeveloped their news tracking system in december , which partially explains the rise of news coverage in (see fig. ). the second reason for the increasing presence of some altmetric data is the rising uptake of social media by the public, researchers, and scholarly journals (nugroho et al. ; van noorden ; zheng et al. ) . against this background, scientific publications are more likely to be disseminated on social media, thereby stimulating the accumulation of altmetric data. the fact that more publications with corresponding altmetric data accrued and detected is beneficial to consolidate the data foundation, thus promoting the development and possible application of altmetrics. in the meantime, we emphasized the biases of altmetric data towards different publication years. costas et al. ( ) highlighted the "recent bias" they found in the overall altmetric scores, which refers to the dominance of most recent published papers in garnering altmetric data. nevertheless, we found that the "recent bias" is not exhibited by all types of altmetric data. for altmetric data with relatively high speed in data accumulation after publication, like twitter mentions, facebook mentions, news mentions, blogs citations, and reddit mentions , it is demonstrated that their temporal distribution conforms to a "recent bias". however, a "past bias" is found for altmetric data that take a relatively longer time to accumulate, such as wikipedia citations, policy document citations, f prime recommendations, video comments, peer review comments, and q&a mentions . due to the slower pace of these altmetric events, they are more concentrated on relatively old publications. even for mendeley readers, its data presence across recent publications is obviously lower. overall, although an upward tendency of data presence has been observed over time, most altmetric data still keep an extremely low data presence, with the only exceptions of mendeley readers and twitter mentions. as suggested by thelwall et al. ( ) , until now these altmetric data may only be applicable to identify the occasional exceptional or above average articles rather than as universal sources of impact evidence. in addition, the distinguishing presence of altmetric data reinforces the necessity of keeping altmetrics separately in future analyses or research assessments. with the information of subject fields and micro-topics assigned by the cwts publicationlevel classification system, we further compared the presence of types of altmetric data across subject fields of science and their inclinations to different research topics. most altmetric data have a stronger focus on publications in the fields of ssh, bhs, and les. in contrast, altmetric data presence in the fields of pse and mcs are generally lower. this kind of data distribution differs from what has been observed based on citations, in what ssh are underrepresented while pse stands out as the subject field with higher levels of citations. this finding supports the idea that altmetrics might have more added values for social sciences and humanities when citations are absent . in this study, it is demonstrated that even within the same subject field, altmetric data show different levels of data presence across research topics. amongst altmetric data, their correlations at the research topic level are similar with the correlations at the publication level zahedi et al. ) , with mendeley readers the only altmetric data moderately correlated with citations, and twitter mentions and facebook mentions, news mentions and blogs citations, the two pairs showing the strongest correlations. there might exist some underlying connections within these two pairs of strongly correlated altmetric data, such as the possible synchronous updating by users who utilize multiple platforms to share science information, which can be further investigated in future research. for the remaining altmetric data, although many of them achieved moderate to strong correlations with each other from the aspect of coverage because they have similar patterns of data coverage across subject fields, the correlations of data intensity are weaker, implying that research topics garnered different levels of attention across altmetric data (robinson-garcia et al. ) . in view of the uneven distribution of specific altmetric data across research topics, it is possible to identify hot research topics which received higher levels of attention from certain communities such as twitter users and policy-makers. based on two indicators for measuring data presence: coverage and intensity, we developed a framework to identify hot research topics operationalized as micro-topics that fall in the first decile in terms of the ranking distribution of both coverage and intensity. this means that hot research topics are those with large shares of the publications receiving intensive average attention. we have demonstrated the application of this approach in detecting hot research topics mentioned on twitter and cited in policy documents. since the subject field differences are so pronounced that they might hamper generalization (mund and neuhäusler ) , the identification of hot research topics was conducted for each subject field severally. hot research topics on twitter reflect the interest shown by twitter users, while those in policy documents serve as the mirror of policy-makers' focuses on science, and these two groups of identified hot research topics are diverse and hardly overlapped. this result proves that different communities are keeping an eye on different scholarly topics driven by dissimilar motivations. the methodology of identifying hot research topics sheds light on an innovative application of altmetric data in tracking research trends with particular levels of social attention. by taking the advantage of the clustered publication sets (i.e., micro-topics) algorithmically generated by the cwts classification system, the methodology proposed measures how wide and intensive is the altmetric attention to the research outputs of specific research topics. this approach provides a new option to monitor the focus of attention on science, thus representing an important difference with prior studies about the application of altmetric data in identifying topics of interest, which mostly were based on co-occurrence networks of topics with specific altmetric data accrued (haunschild et al. ; robinson-garcia et al. ) . the methodology proposed employs a two-dimensional framework to classify research topics into four main categories according to the levels of the specific altmetric attention they received. as such, the framework represents a more simplified approach to study and characterize different types of attention received by individual research topics. in our proposal for the identification of hot research topics, the influence of individual publications with extremely intensive attention received is to some extent diminished, relying the assessment of the whole topic on the overall attention of the publications around the topic, although of course those topics characterized by singularized publications with high levels of attention are also considered as "star-papers topics". it should be acknowledged that the results of this approach give an overview of the attention situations of generalized research topics, however, to get more detailed pictures of specific micro-level research fields, other complementary methods based on the detailed text information of the publications should be employed to go deep into micro-topics. moreover, in this study, the identification of hot research topics is based on the whole dataset, in future studies, through introducing the factors of publication time of research outputs and the released time of altmetric events, it is suggested to monitor those hot research topics in real time in order to reflect the dynamic of social attention on science. there are some limitations in this study. first, the dataset of publications is restricted to publications with dois or pubmed ids. the strong reliance on these identifiers is also seen as one of the challenges of altmetrics (haustein ) . second, although all types of documents are included in the overall analysis of data presence, only article, review, and letter are assigned with main subject fields of science and micro-topics by the cwts publication-level classification system, so only these three document types are considered in the following analysis of data presence across subject fields and research topics. but these three types account for . % of sample publications (see appendix ), they can be used to reveal relatively common phenomena. lastly, the cwts classification system is a coarse-grained system of disciplines in consideration of that some different fields are clustered into an integral whole, like social sciences and humanities, making it difficult to present more fine-grained results. but the advantages of this system lie in that it solves the problem caused by multi-disciplinary journals, and individual publications with similar research topics are clustered into micro-level fields, namely, micro-topics, providing us with the possibility of comparing the distribution of altmetric data at the research topic level, and identifying hot research topics based on data presence. this study investigated the state-of-the-art presence of types of altmetric data for nearly . million web of science publications across subject fields and research topics. except for mendeley readers and twitter mentions, the presence of most altmetric data is still very low, even though it is increasing over time. altmetric data with high speed of data accumulation are biased to newly published papers, while those with lower speed bias to relatively old publications. the majority of altmetric data concentrate on publications from the fields of biomedical and health sciences, social sciences and humanities, and life and earth sciences. these findings underline the importance of applying different altmetric data with suitable time windows and fields of science considered. within a specific subject field, altmetric data show different preferences for research topics, thus research topics attracted different levels of attention across altmetric data sources, making it possible to identify hot research topics with higher levels of attention received in different altmetric contexts. based on the data presence at the research topic level, a framework for identifying hot research topics with specific altmetric data was developed and applied, shedding light onto the potential of altmetric data in tracking research trends with a particular social attention focus. geographic variation in social media metrics: an analysis of latin american journal articles do altmetrics point to the broader impact of research? an overview of benefits and disadvantages of altmetrics usefulness of altmetrics for measuring the broader impact of research: a case study using data from plos and f prime alternative metrics in scientometrics: a meta-analysis of research into three altmetrics what do altmetrics counts mean? a plea for content analyses measuring field-normalized impact of papers on specific societal groups: an altmetrics study based on mendeley data citespace ii: detecting and visualizing emerging trends and transient patterns in scientific literature do "altmetrics" correlate with citations? extensive comparison of altmetric indicators with citations from a multidisciplinary perspective altmetrics: finding meaningful needles in the data haystack testing for universality of mendeley readership distributions the relationship between tweets, citations, and article views for plos one articles co-saved, co-tweeted, and co-cited networks dynamic topic detection and tracking: a comparison of hdp, c-word, and cocitation methods altmetrics: an analysis of the state-of-theart in measuring research impact on social media studying the accumulation velocity of altmetric data tracked by altmetric.com the stability of twitter metrics: a study on unavailable twitter mentions of scientific publications what can article-level metrics do for you? research characteristics and status on social media in china: a bibliometric and co-word analysis a new methodological approach to bibliographic coupling and its application to the national, regional and institutional level using 'core documents' for detecting and labelling new emerging topics using altmetrics for assessing research impact in the humanities how many scientific papers are mentioned in policy-related documents? an empirical investigation using web of science and altmetric data does the public discuss other topics on climate change than researchers? a comparison of explorative networks based on author keywords and hashtags grand challenges in altmetrics: heterogeneity, data quality and dependencies interpreting 'altmetrics': viewing acts on social media through the lens of citation and social theories characterizing social media metrics of scholarly papers: the effect of document properties and collaboration patterns tweets vs. mendeley readers: how do these two social media metrics differ? it -information technology how to identify emerging research fields using scientometrics: an example in the field of information security on the causes of subject-specific citation rates in web of science cross-metric compatability and inconsistencies of altmetrics who reads research articles? an altmetrics analysis of mendeley user categories towards an early-stage identification of emerging topics in science-the usability of bibliometric characteristics measuring societal impact is as complex as abc a survey of recent methods on deriving topics from twitter: algorithm to evaluation. knowledge and information systems exploratory analysis of publons metrics and their relationship with bibliometric and altmetric impact altmetrics data providers: a meta-analysis review of the coverage of metrics and publication the altmetrics collection altmetrics: a manifesto mapping social media attention in microbiology: identifying main topics and actors new data, new possibilities: exploring the insides of altmetric the skewness of science detecting emerging research fronts based on topological measures in citation networks of scientific publications tracking and predicting growth areas in science identifying emerging topics in science and technology attention is not impact" and other challenges for altmetrics scholarly use of social media and altmetrics: a review of the literature are mendeley reader counts high enough for research evaluations when articles are published? do altmetrics work? twitter and ten other social web services a comparison of methods for detecting hot topics online collaboration: scientists and the social network f recommendations as a potential new data source for research evaluation: a comparison with citations a new methodology for constructing a publication-level classification system of science detecting and tracking the real-time hot topics: a study on computational neuroscience usage patterns of scholarly articles on web of science: a study on web of science usage count tracing scientist's research trends realtimely users, narcissism and control-tracking the impact of scholarly publications in the st century social media metrics for new research evaluation how well developed are altmetrics? a cross-disciplinary analysis of the presence of 'alternative metrics' in scientific publications mendeley readership as a filtering tool to identify highly cited publications on the relationships between bibliographic characteristics of scientific documents and citation and mendeley readership counts: a large-scale analysis of web of science publications exploring topics of interest of mendeley users social media presence of scholarly journals acknowledgements zhichao fang is financially supported by the china scholarship council ( ). rodrigo costas is partially funded by the south african dst-nrf centre of excellence in scientometrics and science, technology and innovation policy (scistip). xianwen wang is supported by the national natural science foundation of china ( and ). the authors thank altmetric.com for providing the altmetric data of publications, and also thank the two anonymous reviewers for their valuable comments.the , , sample wos publications were matched with their document types through the cwts in-house database. table presents the number of publications and the coverage of altmetric data of each type. the types of article, review, and letter, which are included in the cwts classification system, account for about . % in total. the altmetric data coverage varies across document types as observed by zahedi et al. ( ) . for most altmetric data, review shows the highest altmetric data coverage, followed by article, editorial material, and letter. it is reported that the distributions of citation counts (seglen ) , usage counts , and twitter mentions ) are highly skewed. results in fig. show that the same situation happens to other altmetric data as well. even though the data spearman correlation analyses among coverage, density, and intensity of micro-topics were conducted for each altmetric data and citations, and the results are shown in fig. . because of the highly skewed distribution of all kinds of altmetric data, the calculation of coverage and density are prone to get similar results, especially for altmetric data with smaller data volume. therefore, the correlation between coverage and density is quite strong for every altmetric data. for most altmetric data, density and intensity are moderately or strongly correlated, and their correlations are always slightly stronger than that between coverage and intensity. in consideration of the influence of zero values of some micro-topics on inflating the spearman correlation coefficients, we did a complementary analysis by calculating the spearman correlations for each pair of data after excluding those mutual micro-topics with zero values (fig. ). compared to the results shown in fig. , values in fig. are clearly lower, especially for those pairs of altmetric data with relatively low data presence. however, the overall patterns are still consistent with what we observed in fig. . key: cord- -cw ls authors: ji, xiang; chun, soon ae; geller, james title: knowledge-based tweet classification for disease sentiment monitoring date: - - journal: sentiment analysis and ontology engineering doi: . / - - - - _ sha: doc_id: cord_uid: cw ls disease monitoring and tracking is of tremendous value, not only for containing the spread of contagious diseases but also for avoiding unnecessary public concerns and even panic. in this chapter, we present a near real-time sentiment analysis service of public health-related tweets. traditionally, it is impossible for humans to effectively measure the degree of public health concerns due to limited resources and significant time delays. to solve this problem, we have developed a computational intelligence approach for epidemic sentiment monitoring system (esmos) to automatically analyze the disease sentiments and gauge the measure of concern (moc) expressed by twitter users. more specifically, we present a knowledge-based approach that employs a disease ontology to detect the outbreak of diseases and to analyze the linguistic expressions that convey subjective expressions and sentiment polarity of emotions, feelings, opinions, personal attitudes, etc. with a sentiment classifier. the two-step sentiment classification method utilizes the subjective vocabulary corpus (mpqa), sentiment strength corpus (afinn), as well as emoticons and profanity words that are often used in social media postings. it first automatically classifies the tweets into personal and non-personal classes, eliminating many tweets such as non-personal “retweets” of news articles from further consideration. in the second stage, the personal tweets are classified into negative and non-negative sentiments. in addition, we present a model to quantify the public’s measure of concern (moc) about a disease, based on sentiment classification results. the trends of the public moc are visualized on a timeline. correlation analyses between moc timeline and disease-related sentiment category timelines show that the peaks of the moc are weakly correlated with the peaks of the news timeline without any appreciable time delay or lead. our sentiment analysis method and the moc trend analyses can be generalized to other topical domains, such as mental health monitoring and crisis management. we present the esmos prototype for public health-related disease monitoring, for public concern trending and for mapping analyses. ginsberg et al. [ ] used search engine logs, in which users submitted queries in reference to issues that they were concerned about, to approach the disease monitoring problem. their thread of research led to the realization that an aggregation of large numbers of queries might show patterns that are useful for the early detection of epidemics. however, comprehensive access to search engine logs is limited to the search engine providers. twitter, a popular social network site, has more than million users out of which more than million are active users [ ] . this shows twitter's potential to address the limitations of traditional public health surveillance methods, and of search keyword logs. a percentage of twitter messages are publicly available and researchers are able to retrieve the tweets as well as related information through the twitter api [ ] . we have developed a method to gauge the measure of concern (moc) expressed by twitter users for public health specialists and government decision makers [ ] . more specifically, we developed a two-step sentiment classification approach. firstly, personal tweets are distinguished from news tweets. news tweets are considered as non-personal, as opposed to personal tweets posted by individual twitter users. in the second stage, the personal tweets are further classified into personal negative tweets or personal non-negative tweets. the two-step sentiment classification problem addressed in this chapter is different from the traditional twitter sentiment classification problem, which classified tweets into positive/negative or positive/neutral/negative tweets [ , [ ] [ ] [ ] [ ] without distinguishing personal from non-personal tweets first. although news tweets may also express concerns about a certain disease, they tend not to reflect the direct emotional impact of that disease on people. the sentiment classification method presented in this chapter is able to identify personal tweets and news (non-personal) tweets in the first place. more importantly, using the sentiment classification results, we quantified the measure of concern (moc) using the number of personal negative tweets per day. the moc increases with the relative growth of personal negative tweets and with the absolute growth of personal negative tweets. previous research [ , ] visually noticed that sentiment surges co-occurred with health events on a timeline. different from the previous work, which is based on visual observation, we correlated the peaks on the moc timeline (i.e., change over time) and the peaks of the news timeline and also the peaks of the non-negative timeline and the peaks of news timeline using the jaccard coefficient [ ] . government officials can use moc to track public health concerns on the timeline to help make timely decisions, disprove rumors, intervene, and prevent unnecessary social crises at the earliest possible stage. more importantly, public health concern monitoring using social network data is faster and cheaper than the traditional method. the rest of this chapter is organized as follows. in sect. , we discuss an intelligent system for health sentiment monitoring with background and related. in sect. , sentiment classification methods and results are introduced in detail. in sect. , the sentiment timeline trend analysis results are illustrated, interpreted, and discussed. section contains the chapter summary. to enable public health sentiment monitoring, an intelligent system called the epidemic sentiment monitoring system (esmos) is proposed. it monitors social media data (specifically, twitter data) and "recognizes" and collects tweets about publichealth related diseases. esmos then analyzes the data to automatically classify them into sentiment categories, and calculates the public degree of concerns for a particular disease outbreak and its spread. esmos also provides visual tools such as the intensity map (heat map) of sentiments to understand the geographic distribution and concentration of public concerns, and a health concern trending map to be able to track the public health concerns over time. the trending map will also allow users to compare collected tweets with the disease-associated news events to qualitatively investigate how the news coverage may influence the public concern. figure shows the component architecture of esmos. the goal of computational intelligence is to build cognitive systems that can act like and interact with humans, to provide insights and intelligence that are needed in complex problem solving and decision-making situations. cognitive systems [ ] structured and unstructured data and learn specific domain knowledge by experience, much in the same way humans do. cognitive systems use natural language processing and image and speech recognition to understand the world and to interact with humans and other smart systems seamlessly. unlike many expert systems, cognitive systems not only match and fire pre-defined rules with anticipated possible actions, but also can be trained using ai and machine learning methods to sense, integrate, analyze, predict and reason as human brains do for various tasks. esmos utilizes both a linguistic knowledge base and automated machine learning methods to exhibit a degree of computational intelligence. it also utilizes a disease ontology to identify the disease names to collect and monitor tweet data. in addition, the machine learning algorithm used for sentiment classification utilizes automated labeling for the training data set using the linguistic knowledge base, instead of having human-labeled, supervised training data. it distinguishes personal tweets from nonpersonal tweets such as news articles. in the second step, it automatically labels the positive or negative sentiment tweets to generate a training dataset, avoiding another human labeling step. the training data sets are used to build the sentiment analysis model. the system uses unsupervised learning to learn classifiers using the lexical knowledge. esmos contains , inherited, developmental and acquired human disease names from the disease ontology (do), an open-source medical ontology, developed by schriml et al. [ ] . a partial structure of the ontology used by esmos is shown in fig. . the extracted disease vocabulary is used to supply the keywords to monitor disease tweets for signs of epidemics. the current prototype system monitors a handful of infectious diseases. however, all do disease terms may be used, given sufficient computational resources. the core component uses the twitter streaming api for collecting epidemics-related real-time tweets. the advantage of using the disease ontology is that it is linked to medical terminology codes in other ontologies with a set of synonyms. for instance, the ebola virus is known under different labels. it has the following synonyms and cross-references with three of the most important medical terminology collections: nci, snomed ct and umls: the use of other collections of medical terms (e.g. wikipedia's list of icd- codes - : infectious and parasitic diseases [ ] ) may further support the purpose. the major issue with the use of formal ontologies of medical terms is that the laymen's disease terms (e.g. ebola) may have to be matched with the scientific terms used in medical terminologies (e.g. ebola hemorrhagic fever) [ ] . this requires fuzzy matching and similarity matching, which are not addressed in this chapter. understanding human emotions (also called sentiment analysis) is a difficult task for a machine, for which the computational intelligence approach may provide better results. often, linguistic expressions as well as paralinguistic features in spoken languages (e.g., pitch, loudness, tempo, etc.) reveal the sentiments or emotional states of individuals. prior research studies have developed sentiment lexicons using a dictionary approach and a corpus approach [ ] . the mpqa (multi perspective question answering) lexicon [ ] was constructed through human annotations of a corpus of , sentences in documents that contain english-based news from different sources in a variety of countries, dated from june to may . the annotated lexicon represents opinions and other private states, such as beliefs, emotions, sentiments, speculations, etc. the subjective and objective expressions were also annotated with values of intensity and polarity to indicate the degree of subjectivity, and a negative, positive or neutral sentiment. afinn is another affective lexicon with a list of english words with sentiment valence ratings between minus five (negative) and plus five (positive). these words have been manually labeled and evaluated in - [ ] . most of the positive words were labeled with + and most of the negative words with − , and strongly obscene words were rated with either − or − . afinn includes words with negative words ( %), positive words and one neutral word. another lexical resource is sentiwordnet [ ] , which associates each synset of wordnet with three numerical scores obj(s), pos(s) and neg(s). the numerical scores describe the polarity for the terms in the synset, i.e., how objective, positive, or negative the terms contained in the synset are. in this chapter, we applied the mpqa lexicon to distinguish the personal from non-personal tweets (such as news article tweets) since the subjectivity of the lexicon helps to distinguish personal expressions from non-personal ones. we used the valence ratings in the afinn lexicon to further distinguish the personal negative from personal non-negative tweets. our goal is to develop an automated sentiment analysis system for the social heath tweets generated by the general public. this system can provide public health officials with the capability to monitor the viral effects in social media communication, so that they can take early actions to prevent unnecessary panic regarding public health related diseases. since the s, with the tremendous amount of user-generated data from various data sources, such as blogs, review sites, news articles, and micro-blogs becoming available, researchers have become interested in mining high-level sentiments from the user data. pang and lee [ ] reviewed the sentiment analysis work. this thread of research, depending on the analysis target, can be classified into the one of these levels: document-level [ ] , blog-level [ ] , sentence-level [ ] , tweet-level [ ] [ ] [ ] with the sub-category non-english tweet level [ ] , and tweet-entity-level [ ] . since , extensive research has been carried out on the topic of twitter sentiment classification. [ , , [ ] [ ] [ ] [ ] [ ] . most of this thread of research used machine learning-based approaches, such as naïve bayes, multinomial naïve bayes, and support vector machine. the naïve bayes classifier is a derivative of the bayes decision rule [ ] , and it assumes that all features are independent from each other. good performance of naïve bayes (nb) was reported in several sentiment analysis papers [ , , ] . multinomial naive bayes (mnb) is a model that works well on sentiment classification [ , , ] . mnb takes into account the number of occurrences and relative frequency of each word. support vector machine [ ] is also a popular ml-based classification method that works well on tweets [ , ] . in natural language processing, svm with a polynomial kernel is more popular [ ] . there are two drawbacks of the previous sentiment classification work. firstly, twitter messages were classified into either positive/negative or positive/negative/ neutral with the assumption that all twitter messages express ones' opinion. however, this assumption does not hold in many situations, especially when the tweets are about epidemics or more broadly, about crises. in these situations, as we found when we randomly sampled tweets, many tweets (up to %) of the samples, are repetitions of the news without any personal opinion. since they are not explicitly labeled with re-tweet symbols, it is not easy for a stop-word based pre-processing filter to detect them. we attempt to solve a different problem, which is how to classify tweets into three categories: personal negative tweets, personal non-negative tweets, and news tweets (tweets that are non-personal tweets). recently, some researchers identified irrelevant tweets. brynielsson, johansson, jonsson and westling [ ] used manual labeling to classify tweets into "angry," "fear," "positive," or "other" (irrelevant). salathe and khandelwal [ ] also identified irrelevant tweets together with sentiment classifications. without considering irrelevant tweets, they calculated the h n vaccine sentiment score from the relative difference of positive and negative messages. in our two-step classification method, we can automatically extract news tweets and perform the sentiment analysis. the results of sentiment classification are used for computing the correlation between sentiments and news trends. in this way, the goals of sentiment classification and measuring the public concern can be achieved in an integrated framework. secondly, although sophisticated models were developed by the above research, the results of the sentiment classification were not utilized to provide insights. we provide the measure of concern timeline trends as a useful sentiment monitoring tool for public health specialists and government decision makers. sentiment quantification is a method to process unstructured text and generate a numerical value or a timeline of numerical values to gain insights into the sentiment trends. zhuang et al. [ ] generated a quantification of sentiments about movie elements, such as special effects, plot, dialogue, etc. their quantification contains a positive score and a negative score towards a specific movie element. for tweet-level sentiment quantification on a timeline, chew and eysenbach [ ] used a statistical approach to computing the relative proportion of all tweets expressing concerns about h n and visualized the temporal trend of positive/negative sentiments based on their proportion. similar research was done by o'connor et al. [ ] , in which they calculated a daily (positive and negative) sentiment score by recording the number of positive and negative words of one tweet appearing in the subjectivity lexicon of opinionfinder [ ] . sha et al. [ ] found that the sentiment fluctuations on sina weibo were associated with either the new policy announcements or government actions. the drawbacks of the existing twitter sentiment quantification research are twofold. firstly, the lexicon-based sentiment extraction models have limited coverage of words. as pointed out by wiebe and riloff [ ] , identifying positive or negative tweets by counting words in a dictionary or lexicon usually has high precision but low recall. in the case of twitter sentiment analysis, the performance suffers more. the lexicon or dictionary does not contain the slang words that are common in social media. for example, lmao (laughing my a** off), is a positive "word" in twitter, but it does not match any word in mpqa [ ] , which is a popular sentiment dictionary. in this study, we consider these profanity or slang words as well as the emoticons. secondly, the existing sentiment quantification work [ , ] has shown the correlation between sentiments and real-world events (e.g. news) through observing their co-occurrence on a timeline, but has not provided a comprehensive, quantitative correlation between the sentiment timeline trend and the news timeline trend. to the best of our knowledge, there is no prior work that both quantitatively and qualitatively studies these correlations between twitter sentiment and the news in twitter to identify concerns caused by diseases and crises. in the next two sections, we first describe the machine learning approach to automatically labeling the training datasets to generate the sentiment classifiers, and then the quantitative model for measuring the measure of concern and its trend line analysis to see any correlations with news events on the traditional broadcast media. in this section, we present our two-step sentiment classification method. as discussed earlier, our approach to sentiment classification is different from the classic sentiment classification of tweets. the first step of our method involves training a twitter sentiment classifier for distinguishing personal tweets from news (non-personal) tweets. the second step builds the sentiment classifier using only the personal tweets to identify the negative versus non-negative tweets. the formal definitions of personal tweet, news tweet, personal negative tweet, and personal non-negative tweet are shown below. a personal tweet is a tweet that conveys its author's private states [ , ] . a private state can be a sentiment, opinion, speculation, emotion, or evaluation, and it cannot be verified by objective observation. in addition, if a tweet talks about a fact observed by the twitter user, it is also defined as a personal tweet. the goal of this definition is to distinguish the tweets written by the twitter users from scratch from the news tweets that are re-tweeted in the twittersphere. example (personal tweet) "since when does a commercial aircraft accident become a matter of national security interests? #diegogarcia#mh " definition (news tweet) a news tweet (denoted with nt) is a tweet that is not a personal tweet. a news tweet states an objective fact. example (news tweet): "#update cyanide levels x standard limits detected in water close to the site of explosions in china's tianjin http://u.afp.com/z ab" a tweet is a personal negative tweet (denoted as pn) if it conveys negative emotions or attitude and it is a personal tweet. otherwise, it is a personal non-negative tweet (denoted as pnn). personal non-negative tweets include personal neutral or personal positive tweets. a personal tweet is either a pn or a pnn. a personal negative tweet expresses a user's negative sentiment, such as panic, depression, anxiety, etc. note that this definition is focused on the user's negative emotional state as opposed to expressing the absence of an illness, e.g., getting a negative test result. two examples are as follows. example (personal negative tweet): "apparently #ebola doesn't read our textbooks -it keeps changing the rules as it goes along. frightening news :(" example (personal non-negative tweet): "to ensure eliminating the #measles disease in the country, it is important to vaccinate people who are at risk." as many news tweets are re-tweeted in twitter, classifying the tweets into personal and news tweets in the first step can help consider only personal tweets in a sentiment analysis in the next step (negative versus non-negative classification). an overview of sentiment classification method is shown in fig. . we only consider english tweets, which were automatically detected during the data collection phase in this chapter. as shown in fig. , the sentiment classification problem is approached in two steps. first, for all english tweets we separated personal from news (non-personal) tweets. second, after the personal tweets were extracted by the most successful of the personal/news machine learning classifier, these personal tweets were used as input to another machine learning classifier, to identify personal negative tweets and personal non-negative tweets. after the personal negative tweets, personal negative tweets, and news tweets were all identified, they were utilized to compute the measure of concern and the quantitative correlation between the sentiment timeline trend and the news timeline trend. we developed a real-time twitter data collector with the twitter api version . and twitter j library [ ] . it collects real-time tweets containing the pre-defined public health-related keywords (e.g., listeria). we can describe the overall data collection process as "etl" (extract-transform-load) pipeline, which is a frequently used term in the area of data warehousing. in the first step, the data collector collected tweets in json format with twitter streaming api. (extract step). in the second step, the data in json format was parsed into relational data, such as tweets, tweet_mentions, tweet_place, tweet_tags, tweet_urls, and users (transform step). in the last step, the relational data was stored into our mysql relational database (load step). we monitored diseases including infectious diseases: listeria, influenza, swine flu, measles, meningitis, and tuberculosis; four mental health problems: major depression, generalized anxiety disorder, obsessive-compulsive disorder, and bipolar disorder; one crisis: air disaster; and one clinical science issue: melanoma experimental drug. the preprocessing step filters out re-tweets and converts special characters into corresponding unigrams. more specifically, all tweets starting with "rt," were deleted because "rt" indicates that they are re-tweets without comments to avoid duplications. for the tweets that have a non-starting string "rt," the "rt" was removed. one member of each pair of tweets that contain the same tokens (words) in the same order was deleted. for example, of the below two tweets only one is kept in the database. ( ) " test positive for tuberculosis at high school. http://t.co/ss qt epp #cnn" ( ) " test positive for tuberculosis at high school. http://t.co/m d rgzyai-@cnn." the clue-based classifier parses each tweet into a set of tokens and matches them with a corpus of personal clues. there is no available corpus of clues for personal versus news classification, so we used a subjective corpus mpqa [ ] instead, on the assumption that if the number of strongly subjective clues and weakly subjective clues in the tweet is beyond a certain threshold (e.g., two strongly subjective clues and one weakly subjective clue), it can be regarded as personal tweet, otherwise it is a news tweet. the mpqa corpus contains a total of , words, including , adjectives, adverbs, , any-position words, , nouns, and , verbs. as for the sentiment polarity, among all , words, , are negatives, are neutrals, , are positives, and can be both negative and positive. in terms of strength of subjectivity, among all words, , are strongly subjective words, and the other , are weakly subjective words. twitter users tend to express their personal opinions in a more casual way compared with other documents, such as news, online reviews, and article comments. it is expected that the existence of any profanity might lead to the conclusion that the tweet is a personal tweet. we added a set of selected profanity words [ ] to the corpus described in the previous paragraph. us law, enforced by the federal communication commission prohibits the use of a short list of profanity words in tv and radio broadcasts [ ] . thus, any word from this list in a tweet clearly indicates that the tweet is not a news item. we counted the number of strongly subjective terms and the number of weakly subjective terms, checked for the presence of profanity words in each tweet and experimented with different thresholds. a tweet is labeled as personal if its count of subjective words surpasses the chosen threshold; otherwise it is labeled as a news tweet. if the threshold is set too low, the precision might not be good enough. on the other hand, if the threshold is set too high, the recall will be decreased. the advantage of a clue-based classifier is that it is able to automatically extract personal tweets with more precision when the threshold is set to a higher value. because only the tweets fulfilling the threshold criteria are selected for training the "personal versus news" classifier, we would like to make sure that the selected tweets are indeed personal with high precision. thus the threshold that leads to the highest precision in terms of selecting personal tweets is the best threshold for this purpose. the performance of the clue-based approach with different thresholds on humanannotated test datasets is shown in table [ ] . among all the thresholds, s w ( strong, weak) achieves the highest precision on all three human annotated datasets. in other words, when the threshold is set so that the minimum number of strongly subjective terms is and the minimum number of weakly subjective terms is , the to overcome the drawback of low recall in the clue-based approach, we combined the high precision of clue-based classification with machine learning-based classification in the personal versus news classification, as shown in fig. . suppose the collection of raw tweets of a unique type (e.g. tuberculosis) is t. after the preprocessing step, which filters out non-english tweets, re-tweets and near-duplicate tweets, the resulting tweet dataset is t = {tw , tw , tw , . . ., tw n }, which is a subset of t, and is used as the input for the clue-based method for automatically labeling datasets for training a personal versus news classifier as shown in fig. , where the blue part is the automatic labeling of tweets with lexicons (highlighted in green) and the yellow part is the machine learning classifiers for classifying personal tweets from news tweets. we choose three machine learning classifiers, naïve bayes, multinomial naïve bayes, and support vector machine, as these classifiers achieved good results for similar tasks as discussed in sect. . in the lexicon-based step for labeling training datasets, each tw i of t is compared with the mpqa dictionary [ ] . if tw i contains at least three strongly subjective clues and at least three weakly subjective clues, tw i is labeled as a personal tweet. similarly, tw i is compared with a news stop word list [ ] and a profanity list [ ]. the news stop word list contains + names of highly influential public health news sources and the profanity list has commonly used profanity words. if tw i contains at least one word from the news stop word list and does not contain any profanity word, tw i is labeled as a news tweet. for example, the tweet "un official: ebola epidemic could be defeated by end of - nbc news http://bit. ly/ j rq f #world#health" is labeled as a news tweet, because it contains at least one word from the news stop word list and does not contain any profanity word. we mark the set of labeled personal tweets as t p , and the set of labeled news tweets as t n , note that t p ∪ t n ⊆ t . the next step is the machine learning-based method. the two classes of t p data and t n from the clue-based labeling are used as training datasets to train the machine learning models. we used three popular models: naïve bayes, multinomial naïve bayes, and polynomial-kernel support vector machine. after the personal versus news classifier is trained, the classifier is used to make predictions on each tw i in t , which is the preprocessed tweets dataset. the goal of personal versus news classification is to obtain the label for each tw i in the tweet database t , where the label is either personal or nt (news tweet). personal could be pn or pnn. ji, chun, wei and geller [ ] discussed automatic labeling of personal negative and personal non-negative tweets using a sequential approach. in this method, firstly, a profanity list is used to test if a tweet contains any word from the profanity list. if the tweet contains a profanity word, it is labeled as a personal negative tweet. secondly, for the tweets that do not contain a profanity word, negative and nonnegative emoticon lists are used to test whether the tweet contains a negative emoticon or a non-negative emoticon. a partial list of emoticons is shown in table . if the tweet contains a negative emoticon, it is labeled as a personal negative tweet, and if the tweet contains a non-negative emoticon, it is labeled as a personal non-negative tweet. this approach has limitations in terms of coverage and sentiment strength. for coverage, this previous method only considered the existence of profanity and emoticons, but it did not take into account the frequency of them. a single use of a profanity word is relatively common for twitter users to express their emotions, but multiple uses of profanity words indicate a strong negative sentiment. in addition, the number of profanity words or emoticons is relatively small, since the profanity list contains "only" words and the emoticon list consists of emoticons. it is quite possible to miss potential personal negative or personal non-negative tweets with this approach. for the sentiment strength, this previous method only considered the existence of profanity or emoticon, but did not consider the various sentiment strengths of words, which are good indicators for the tweet sentiment detection. in this chapter, to address these previous limitations, we have developed a new personal negative versus personal non-negative labeling method. this new method uses metrics generated from the afinn lexicon as shown in table , in addition to emoticons and profanities. afinn [ ] is a publicly available list of , english words and phrases rated for valence with an integer between − (negative) and + (positive). to label a personal tweet as personal negative or non-negative, we aggregated the frequencies of profanity words, emoticons, and afinn words into two metrics: negative score and non-negative score. subsequently we used a threshold to determine the label of the tweet. the negative score and non-negative score are defined as follows. in the above formulas, p r, n c , n c , n c , and ne are the numbers of profanity words, words having − valence in afinn, words having − valence, words having − valence, and the number of negative emoticons in a tweet. analogously, n nc , n nc , n nc , n n e are the numbers of words having + valence in afinn, + valence, + valence, and the number of non-negative emoticons in a tweet. the two metrics express the proportion of negative words and the proportion of non-negative words, respectively. the threshold is set to . , which means that if the negative score of a tweet is greater than or equal to . , it is labeled as a personal negative tweet. a personal non-negative tweet is labeled similarly. these two metrics use a larger number of words to label tweets than our previous method ( compared with ). this allows us to generate a larger size of training data and to use sentiment strength to label tweets more accurately. the experimental results that compare the current method and our previous method are shown in sect. . . . figure shows the process of negative versus non-negative classification, where the blue part represents the automatic labeling of tweets; the green part represents lexicons; the yellow part represents the machine learning classifiers. in the rest of this section, negative is used to refer to the personal negative and non-negative is used to refer to the personal non-negative tweets. tweets were labeled as pn (personal negative) or pnn (personal non-negative) using the labeling method described above. these two categories (pn and pnn) of labeled tweets were combined into the training dataset tr-nn for negative versus non-negative classification. table shows examples of tweets in tr-nn. the set of labeled pn tweets is marked as t ne , and the set of labeled pnn tweets is marked as t ne , and the set of labeled pnn tweets is marked as t nn , and (t ne ∪ t nn ) ⊆ t . similarly, t ne and t nn are used to train the negative versus non-negative classifier, and the classifier is used to make predictions on each tw i in t , which is the set of personal tweets. the goal of negative versus non-negative classification is to obtain the label for each tw i in the tweet database t , where the label o(tw i ) is either pn (personal negative) or pnn (personal non-negative). (there are no news tweets at this stage.) after step (personal tweets classification) and step (sentiment classification), for a unique type of tweets (e.g. tuberculosis), the raw tweet dataset t is transformed into a series of tweet label datasets t s i . t s i is the tweet label dataset for time i, and t s i = {ts , ts , ts , . . ., ts n }, where the label of ts i is either pn (personal negative), or pnn (personal non-negative), or nt (news tweet). the dataset was collected from march to june and was used by our previous work [ ] . the statistics of the collected datasets are shown in table . only english tweets are used in our experiments. some datasets have a larger portion of non-english tweets, for example, influenza, swine flu, and tuberculosis compared with other datasets. to compare the naïve bayes, two-step multinomial naïve bayes, and two-step polynomial-kernel support vector machine classifiers, we created a test dataset using human annotation. weka's implementations [ ] of these classifiers was used. we extracted three test data subsets by random sampling from all tweets from the three domains epidemic, clinical science, and mental health, collected in the year . each of these subsets contains tweets. note that the test tweets are independent from the training tweets that were collected in the year . each tweet was annotated by three people out of a set of six contributors. in the evaluation, was assigned to personal and a was assigned to news. if a tweet was labeled as a personal tweet, the annotator was asked to label it as personal negative or personal non-negative tweet. fleiss' kappa [ ] was used to measure the interrater agreement between the three annotators. table [ ] presents the agreement between human annotators. for each tweet, if at least two out of three annotators agreed on a label (personal negative, personal non-negative, or news), we labeled the tweet with this sentiment. the results of the two-step classification approach are shown in this section. in order to evaluate the usability of two-step classification, personal versus news classification and negative versus non-negative classification were also evaluated with human annotated datasets. for personal versus news classification, we compared our personal versus news classification method with three baseline methods. • a random selection method • the clue-based classification method described above • a url-based method, in which a tweet that contains an url is classified as a news tweet; otherwise a personal tweet. the classification accuracies of different methods are presented in table [ ] . the results show that s-mnb and s-nb outperforms all three baselines in most of the cases. overall, all methods exhibit a better performance on the epidemic dataset than on the other two datasets. in addition, as we compare the ml-based approaches ( s-mnb, s-nb, s-svm), the ml-based approaches outperform the base line clue-based classification approaches in most of the cases. some unigrams are learned by the ml-based methods and are shown to be useful for the classification. to better understand this effect, ablation experiments were carried out with the personal versus news classification on the human annotated datasets. the classifier s-mnb was used, since it took much less time to train than the best classifier s-nb on the human-annotated test dataset. more precisely, it was trained with the automatically generated data from the epidemic, mental health, and clinical science domains collected in . the trained classifiers were used to classify the sentiments of human annotated datasets from the year , where unigrams were removed from the test dataset one at a time, in order to study each removed unigram's effect on accuracy. the change of accuracy was recorded each time, and the unigram that leads to the largest decrease in accuracy (when removed) is the most useful one for predictions. table shows the results of ablation experiments for personal versus news classification. for example, the unigrams "i", "http", "app", "url" are not in mpqa corpus but are learned by the ml classifier s-mnb as the most important unigrams contributing to classification. the second step in the two-step classification algorithm is to separate negative tweets from non-negative tweets. as discussed in sect. . . , the training datasets are automatically labeled, if the tweet has a higher-than-threshold negative score or non-negative score. both scores are calculated by considering the occurrence frequency of words from the profanity list, words from afinn, and emoticons from the emoticon lists. we compare the performance of negative versus non-negative classification with previous labeling method and with the current labeling method. in this chapter, we call the previous method epe, as it is an existence-based method using profanity list and emoticons, and we called the current labeling method fpea, as it is a frequency-based method using profanity, emoticon, and afinn. the classifier is trained by each of the three models, multinomial naïve bayes (mnb), naïve bayes (nb), and support vector machine (svm). the accuracies of negative versus non-negative classification and confusion matrices of the best classifiers for human annotated datasets are shown in tables and , respectively. overall, the frequency-based method with profanity, emoticon, and afinn increases the accuracy of the best classifier by % in the epidemic dataset, and by % in the mental health dataset compared with the previous existence-based method using profanity list and emoticons. overall, s-mnb (fpea) achieved the best negative versus non-negative result in terms of accuracy while being faster than s-svm and s-nb. we analyzed the output of sentiment classification. as discussed in sect. . . , we manually annotated tweets as personal negative, personal non-negative, and news. we used s-mnb, which achieved the best accuracy in our experiments described in sect. . . , to classify each of the manually annotated tweets as personal negative, personal non-negative, or news. then we analyzed the tweets that were assigned different labels by s-mnb and by the human annotators. for the personal versus news classification, we found two major types of errors. the first type of error is that the tweet is a personal tweet, but is classified as a news tweet. by manually checking the content, we found that these tweets are often users' comments on news items (pointed to by a url) or users are citing the news. there are out of all errors belonging to this type. one possible solution to reduce this type of error is that we can calculate what percentage of the tweet text that appears in the web page pointed to by the url. if the percentage is low, it is probably a personal tweet since most of the tweet text is the user's contribution. if the percentage is high, approaching %, it is more likely a news tweet since tweeters often paste the title of a news article into their messages. the second type of error is that the tweet is in fact a news item, but is classified as a personal tweet. in total, out of all errors are of this type. a suggested solution is to check the similarity between the tweet text and the title of the web page content pointed to by the url. if both are very similar to each other, the tweet is more likely a news item. those two types of errors together cover % ( / ) of the errors in personal versus news classification. for negative versus non-negative classification, in % ( / ) of all errors the tweet is in fact negative, but is classified as non-negative. one possible improvement is to incorporate "negative phrase identification" to complement the current machine learning paradigm. the appearance of negative phrases such as "i did not like xyz at all" and "i will not do xyz any more" are possible indicators of negative tweets. as pointed out in the work of bruns and stieglitz [ ] , there are two questions to be addressed in terms of generalizing collected twitter data. • does twitter data represent twitter? • does twitter represent society? according to the documentation of twitter [ ] , the twitter streaming api returns at most % of all the tweets at any given time. once the number of tweets matching the given api parameters (keywords, geographical boundary, user id) goes beyond the % threshold, twitter will return a sample of the data to the user. to address this problem, we used domain specific keywords (e.g. h n , h n ) for each tweet type (e.g. listeria) to increase the coverage of collected data [ ] . as for the question whether twitter postings are representative of the society at large, mislove et al. [ ] have found that the twitter users significantly over-represent the densely populated regions of the us. this might be due to the better availability of high-speed internet in large cities. twitter users are also overwhelmingly male, and highly biased with respect to race distribution and ethnicity distribution. to reduce the first bias of the collected twitter data, we defined the measure of concern in relative terms. it depends on the fraction of all tweets that have been classified as "personal negative" tweets. we assume that as long as the sample of tweets is representative, the measure of concern, which is the personal negative portion of all tweets, should be similar across different samples sizes, e.g., , , %, etc. we are interested in making the sentiment classification results available for public health monitoring, especially the results of computing the measure of concern, to monitor public sentiments towards different types of diseases. the definitions of measure of concern, non-negative sentiment, news count, and peak are shown below as english text. for a more formal treatment, refer to work by ji et al. [ ] . the reason for including the relative and the absolute growth of personal tweets into one measure is that, for example, a ratio of : personal negative tweets: personal tweets appears high, but a lower ratio of : should contribute more to the measure of concern, because a greater number of the "tweeting public" is involved in this social media discourse. definition b (non-negative sentiment) similarly, the non-negative sentiment nn i is the square of the total number of personal non-negative tweets that are posted at time i, divided by the total number of raw tweets of a particular type at the same time i. definition c (news count) finally, the news count ne i is the total number of news tweets at the time i. note that the news count is not normalized by the total number of raw tweets. the reason is that we are interested in studying the relationship between sentiment trends and news popularity trends. an absolute news count is able to better represent the popularity of news. definition (peak) given a timeline of numerical values, a value x i on the timeline is defined as a peak if and only if x i is the largest value in a given time interval the time intervals a > , b > can be chosen according to each specific case to limit the number of peaks. peaks are defined for moc timelines, non-negative timelines, and news count timelines. the method for computing the quantitative correlation is shown in fig. . there are three inputs for the correlation process. the news tweets are the outputs of the first step, as shown in fig. ; the personal negative tweets and the personal non-negative tweets are the outputs in the second step, as shown in fig. . the jaccard coefficient is used for computing the correlation. after the two-step sentiment classification method has been applied to the raw tweets, we can produce three timelines: measure of concern timeline, non-negative sentiment timeline, and news timeline, respectively. next, peaks p , p , and p are generated from these three timelines. the time interval is set to seven days. we are fig. correlation between sentiment trends and news trends interested in the correlation between p and p (peaks of news and peaks of moc), and the correlation between p and p (peaks of news and peaks of non-negative sentiments). we hypothesized that there might be a time delay between the sentiment peaks and the news timeline peaks. for example, an alarming news report might lead to many twitter users expressing their negative emotions. on the other hand, social media are nowadays often ahead of official news reports, as shown in broersma and graham [ ] that the news media now often pick up tweets for their news coverage. in the first case, news peaks would precede tweet sentiment peaks. in the second case, tweet sentiment peaks would precede news peaks. we attempted to quantify these alternative choices using the jaccard coefficient for this purpose. thus we defined two correlations as follows: jc(moc, news, t) = |p ,c+t ∩ p ,c | |p ,c+t ∪ p ,c | jc(nn, news, t) = |p ,c+t ∩ p ,c | |p ,c+t ∪ p ,c | p ,c+t is meant to assign a time lag or time lead of t days (depending on the sign of t) to the collection of moc peaks, thus the news peak at date c will be compared with the moc peak at date c + t. similarly, p ,c+t is meant to assign a time lag or time lead of t days to the collection of non-negative peaks, thus the news peak at date c will be compared with the non-negative peak at date c + t. by its definition, the jaccard coefficient has a value between and . the closer the value is to , the better the two time series correlate with each other. to illustrate the calculation of jaccard coefficient, we use the following example. assume a moc timeline and a news timeline, where the moc timeline has nine peaks and the news timeline has eight peaks. three peaks of moc and another three peaks of news are pair-wise matched. note that two peaks match with each other if and only if the two peaks happen on exactly the same day. this is an arbitrary definition, which may be replaced by a finer grain size (hours) or possibly a larger grain size (e.g., weekends). the remaining six peaks of moc and the remaining five peaks of news are not matched with each other. then the jaccard coefficient between the peaks of moc and the peaks of news is calculated by the size of the intersection divided by the size of the union. therefore, the best jaccard coefficient between moc peaks and news peaks for a given dataset was computed as follows: first we directly computed the jc between moc peaks and news peaks without any time delay or lead, and we recorded the result. then we added one, two, or three days of lead to the original moc, computed the correlation between the revised moc peaks and the original news peaks respectively, and recorded these three results. thirdly, we added one, two, or three days of delay to the original moc, and we recorded three more results. finally, we chose the highest measure from the above seven results as the best correlation between moc and news. the peaks of moc and the peaks of nn (non-negative) were correlated with the peaks of news in all datasets with a jaccard coefficient of . - . . in order to study how an observable increase in moc relates to an actual health event (e.g., news count), we quantified the timeline trends of daily moc and daily news count for listeria, a potentially lethal foodborne illness, as shown in fig. . the news count is - normalized and the top most frequent hash tags for the peak date are shown. the news count (purple) peak occurred on march st, because on that same day, several food items produced by parkers farm were recalled due to a listeria contamination. we note that there was an observable increase of moc (green) as well, which shows that the general public seemed to express negative emotions according to the news during this circumstance. we have developed a prototype system of esmos (epidemic sentiment monitoring system) to monitor the timeline and topic distribution of public concern, as a part of the epidemics outbreak and spread detection system [ ] . esmos displays ( ) a concern timeline chart to track the public concern trends on the timeline; ( ) a tag cloud for discovering the popular topics within a certain time period with a capability to drill down to the individual tweets; and ( ) a public health concern map to show the geographic distribution of particular disease concentrations with different granularity (e.g. state, county or individual location level). figure shows the different visual analytics tools. the public health specialists can utilize the concern timeline chart, as shown in fig. a , to monitor (e.g. identify concern peaks) and compare public concern timeline trends for various diseases. then the specialists might be interested fig. esmos visual analytics tools for public concern monitoring: a sentiment timeline chart, b topics cloud and c public concern distribution map in what topics people are discussing on social media during the "unusual situations" discovered with the help of the concern timeline chart. to answer this question, they can use the tag cloud, as shown in fig. b to browse the top topics within a certain time period for different diseases, and individual tweets. the public health concern heat map in fig. c shows the state-level public concern levels. the esmos prototype is currently implemented to monitor limited diseases (c.f. table ), but our proposed model can use a disease ontology, such as a dedicated epidemic ontology or a umls ontology, to monitor any disease of interest. in this chapter, we explored the potential of mining social network data, such as tweets, to provide a tool for public health specialists and government decision makers to gauge a measure of concern (moc) expressed by twitter users under the impact of diseases. to derive the moc from twitter, we developed a two-step classification approach to analyze sentiments in disease-related tweets. we first distinguished personal from news (non-personal) tweets. in the second stage, the sentiment analysis was applied only to personal tweets to distinguish negative from non-negative tweets. in order to evaluate the two-step classification method, we created a test dataset by human annotation for three domains: epidemic, clinical science, and mental health. the fleiss's kappa values between annotators were . , . , and . , respectively. these moderate agreements illustrate the complexity of the sentiment classification task, since even humans exhibit relatively low agreement on the labels of tweets. our contributions are summarized as follows. ( ) we developed a two-step sentiment classification method by combining cluebased labeling and machine learning (ml) methods by first automatically labeling the training datasets, and then building classifiers for personal tweets and classifiers for tweet sentiments. the two-step classification method shows % and % increase of accuracy over the clue-based method on epidemic and mental health dataset, respectively in personal versus non-personal classification. in negative versus non-negative classification, the frequency-based method fpea, which uses the afinn lexicon, increases the accuracy of the best classifier by % for the epidemic dataset and by % for the mental health dataset compared with our previously used method epe which had a list of profanities and a list of emoticons as the only sentiment clues. thus, the use of afinn resulted in a measurable improvement over our previous work. ( ) we quantified the moc using the results of sentiment classification, and used it to reveal the timeline trends of sentiments of tweets. the peaks of moc and the peaks of nn (non-negative) correlated with the peaks of news with jaccard coefficients of . - . . ( ) we applied our sentiment classification method and the measure of concern to other topical domains, such as mental health monitoring and crisis management. the experimental results support the hypothesis that our approach is generalizable to other domains. future work involves the following. ( ) the measure of concern (moc) is currently based on the number of personal negative tweets and total number of tweets on the same day. the measure of concern was used to define the fraction of tweets that are personal negative tweets. we plan to fine-grain this definition to quantify the number of tweets expressing real concern. to achieve this goal, we need to extend the simplistic negative/non-negative categories to a wider range of well recognized emotions, such as "concern", "surprise", "disgust", or "confusion". we plan to employ an ontology engineering approach to construct an emotion ontology from our collected twitter messages. the emotion ontology will contain the basic emotions such as anger, confusion, disgust, fear, concern, sadness, etc. along with their representative words or phrases. with the constructed emotion ontology, we will be able to detect the tweets expressing real concern and to more accurately quantify the trend of the measure of concern. ( ) to improve the performance of classification, we plan to extend the current feature set to include more features specific to micro-blogs, such as slang terms and intensifiers to capture the unique language in micro-blogs. in personal versus news classification, we chose to work in the machine learning-based paradigm. however, we note that some lightweight knowledge-based approaches could possibly produce competitive results. for example, if the tweet is of the form "text url" and the text appears on the web page that the url points to, the tweet is likely a news tweet. the intuition behind this approach is that the title of a news article is often pasted into the tweet body followed by the url to that news article. we plan to perform a quantitative comparison of these knowledge-based approaches with our ml approach in the future. ( ) the prototype esmos implementation needs to be scaled to detect and monitor diseases on a large scale using a full scale disease ontology. the sentiment analysis should be performed as the data is captured so that the tracking of public concerns happens in real-time. public concern about health in general is not limited on infectious diseases but concerns may also be expressed about particular drugs or treatments, and even current and proposed health policies. to promote an understanding of all the contexts related to a disease outbreak, the system needs to consider many different data sources. ( ) although it is difficult to find the ground truth for sentiment trends, we would like to conduct a systematic experiment on comparing the sentiments derived by our methods with the epidemic cases reported by other available tools, and with authoritative data sources, such as healthmap and cdc reports. the sentiment trends for topics will also be studied by combining the sentiment analysis algorithms with topic modeling algorithms. ( ) all our work so far used epidemics, mental health and clinical science as domains. thus, all of our experiments were health-related. however, there are other areas where tweets and the traditional news compete with each other. these areas include politics, e.g. presidential candidate debates, the economy, e.g., a precipitous fall of the dow jones index, natural disasters, e.g. typhoons and hurricanes, acts of terrorism and war, e.g., roadside bombs, and spontaneous protests, as they were common during the arab spring. all these areas are excellent targets for testing theories about the interplay between news and social media. surveillance sans frontieres: internetbased emerging infectious disease intelligence and the healthmap project syndromic classification of twitter messages the use of twitter to track levels of disease activity and public concern in the u.s. during the influenza a h n pandemic twitter catches the flu: detecting influenza epidemics using twitter tracking the flu pandemic by monitoring the social web changes in emotion of the chinese public in regard to the sars period detecting influenza epidemics using search engine query data twitter twitter sentiment classification for measuring public health concerns combining lexicon-based and learningbased methods for twitter sentiment analysis a survey of opinion mining and sentiment analysis. mining text data nrc-canada: building the state-of-the-art in sentiment analysis of tweets evaluation datasets for twitter sentiment analysis detecting public sentiment over pm . pollution hazards through analysis of chinese microblog monitoring public health concerns using twitter sentiment classifications the link prediction problem for social networks cognitive systems engineering: new wine in new bottles disease ontology: a backbone for disease semantic integration evaluating ontologies based on the naturalness of their preferred terms sentiment analysis and opinion mining learning extraction patterns for subjective expressions good friends, bad newsaffect and virality in twitter. future information technology sentiwordnet: a publicly available lexical resource for opinion mining opinion mining and sentiment analysis thumbs up?: sentiment classification using machine learning techniques experiments with mood classification in blog posts recognizing contextual polarity in phrase-level sentiment analysis estimating citizen alertness in crises using social media monitoring and analysis emotion classification of social media posts for estimating people's reactions to communicated alert messages during crises automatic stopword generation using contextual semantics for sentiment analysis of twitter an arabic twitter corpus for subjectivity and sentiment analysis semantic sentiment analysis of twitter. the semantic web-iswc robust sentiment detection on twitter from biased and noisy data sentiment knowledge discovery in twitter streaming data twitter as a corpus for sentiment analysis and opinion mining target-dependent twitter sentiment classification sentiment analysis on twitter through topic-based lexicon expansion. databases theory and applications introduction to statistical pattern recognition support-vector networks libsvm: a library for support vector machines assessing vaccination sentiments with online social media: implications for infectious disease dynamics and control movie review mining and summarization pandemics in the age of twitter: content analysis of tweets during the h n outbreak from tweets to polls: linking text sentiment to public opinion time series creating subjective and objective sentence classifiers from unannotated texts annotating opinions in the world press the weka data mining software: an update measuring nominal scale agreement among many raters twitter data: what do they represent? is the sample good enough? comparing data from twitter's streaming api with twitter's firehose understanding the demographics of twitter users twitter as a news source: how dutch and british newspapers used tweets in their news coverage key: cord- - twmcitu authors: mukhina, ksenia; visheratin, alexander; nasonov, denis title: spatiotemporal filtering pipeline for efficient social networks data processing algorithms date: - - journal: computational science - iccs doi: . / - - - - _ sha: doc_id: cord_uid: twmcitu one of the areas that gathers momentum is the investigation of location-based social networks (lbsns) because the understanding of citizens’ behavior on various scales can help to improve quality of living, enhance urban management, and advance the development of smart cities. but it is widely known that the performance of algorithms for data mining and analysis heavily relies on the quality of input data. the main aim of this paper is helping lbsn researchers to perform a preliminary step of data preprocessing and thus increase the efficiency of their algorithms. to do that we propose a spatiotemporal data processing pipeline that is general enough to fit most of the problems related to working with lbsns. the proposed pipeline includes four main stages: an identification of suspicious profiles, a background extraction, a spatial context extraction, and a fake transitions detection. efficiency of the pipeline is demonstrated on three practical applications using different lbsn: touristic itinerary generation using facebook locations, sentiment analysis of an area with the help of twitter and vk.com, and multiscale events detection from instagram posts. in today's world, the idea of studying cities and society through location-based social networks (lbsns) became a standard for everyone who wants to get insights about people's behavior in a particular area in social, cultural, or political context [ ] . nevertheless, there are several issues concerning data from lbsns in research. firstly, social networks can use both explicit (i.e., coordinates) or implicit (i.e., place names or toponyms) geographic references [ ] ; it is a common practice to allow manual location selection and changing user's position. the twitter application relies on gps tracking, but user can correct the position using the list of nearby locations, which causes potential errors from both gps and user sides [ ] . another popular source of geo-tagged data -foursquare -also relies on a combination of the gps and manual locations selection and has the same problems as twitter. instagram provides a list of closely located points-of-interest [ ] , however, it is assumed that a person will type the title of the site manually and the system will advise the list of locations with a similar name. although this functionality gives flexibility to users, there is a high chance that a person mistypes a title of the place or selects the wrong one. in facebook, pages for places are created by the users [ ] , so all data including title of the place, address and coordinates may be inaccurate. in addition to that, a user can put false data on purpose. the problem of detecting fake and compromised accounts became a big issue in the last five years [ , ] . spammers misrepresent the real level of interest to a specific subject or degree of activity in some place to promote their services. meanwhile, fake users spread unreliable or false information to influence people's opinion [ ] . if we look into any popular lbsn, like instagram or twitter, location data contains a lot of errors [ ] . thus, all studies based on social networks as a data source face two significant issues: wrong location information stored in the service (wrong coordinates, incorrect titles, duplicates, etc.) and false information provided by users (to hide an actual position or to promote their content). thus, in this paper, we propose a set of methods for data processing designed to obtain a clean dataset representing the data from real users. we performed experimental evaluations to demonstrate how the filtering pipeline can improve the results generated by data processing algorithms. with more and more data available every minute and with a rise of methods and models based on extensive data processing [ , ] , it was shown that the users' activity strongly correlates with human activities in the real world [ ] . for solving problems related to lbsn analysis, it is becoming vital to reduce the noise in input data and preserve relevant features at the same time [ ] . thus, there is no doubt that such problem gathers more and more attention in the big data era. on the one side, data provided by social media is more abundant that standard georeferenced data since it contains several attributes (i.e., rating, comments, hashtags, popularity ranking, etc.) related to specific coordinates [ ] . on the other side, the information provided by users of social networks can be false and even users may be fakes or bots. in , goodchild in [ ] raised questions concerning the quality of geospatial data: despite that a hierarchical manual verification is the most reliable data verification method, it was stated that automatic methods could efficiently identify not only false but questionable data. in paper [ ] , the method for pre-processing was presented, and only % of initial dataset was kept after filtering and cleaning process. one of the reasons for the emergence of fake geotags is a location spoofing. in [ ] , authors used the spatiotemporal cone to detect location spoofing on twitter. it was shown that in the new york city, the majority of fake geotags are located in the downtown manhattan, i.e., users tend to use popular places or locations in the city center as spoofing locations. the framework for the location spoofing detection was presented in [ ] . latent dirichlet allocation was used for the topic extraction. it was shown that message similarity for different users decreases with a distance increase. next, the history of user check-ins is used for the probability of visit calculation using bayes model. the problem of fake users and bots identification become highly important in the last years since some bots are designed to distort the reality and even to manipulate society [ ] . thus, for scientific studies, it is essential to exclude such profiles from the datasets. in [ ] , authors observed tweets with specific hashtags to identify patterns of spammers' posts. it was shown that in terms of the age of an account, retweets, replies, or follower-to-friend ratio there is no significant difference between legitimate and spammer accounts. however, the combination of different features of the user profile and the content allowed to achieve a performance of . auc [ ] . it was also shown that the part of bots among active accounts varies between % and %. this work was later improved by including new features such as time zones and device metadata [ ] . in contrast, other social networks do not actively share this information through a public api. in [ ] , available data from social network sites were studied, and results showed that social networks usually provide information about likes, reposts, and contacts, and keep the data about deleted friends, dislikes, etc., private. thus, advanced models with a high-level features are applicable only for twitter and cannot be used for social networks in general. more general methods for compromised accounts identification on facebook and twitter were presented in [ ] . the friends ratio, url ratio, message similarity, friend number, and other factors were used to identify spam accounts. some of these features were successfully used in later works. for example, in [ ] , seven features were selected to identify a regular user from a suspicious twitter account: mandatory -time, message source, language, and proximityand optional -topics, links in the text, and user interactions. the model achieved a high value of precision with approximately % of false positives. in [ ] , random forest classifier was used for spammers identification on twitter, which results in the accuracy of . %. this study was focused on five types of spam accounts: sole spammers, pornographic spammers, promotional spammers, fake profiles, and compromised accounts. nevertheless, these methods are usercentered, which means it is required to obtain full profile information for further analysis. however, there is a common situation where a full user profile is not available for researches, for example, in spatial analysis tasks. for instance, in [ ] , authors studied the differences between public streaming api of twitter and proprietary service twitter firehose. even though public api was limited to % sample of data, it provided % of geotagged data, but only % of all sample contains spatial information. in contrast, instagram users are on average times more likely post data with geotag comparing to twitter users [ ] . thus, lbsn data processing requires separate and more sophisticated methods that would be capable of identifying fake accounts considering incomplete data. in addition to that, modern methods do not consider cases when a regular user tags a false location for some reason, but it should be taken into account as well. as it was discussed above, it is critical to use as clean data as possible for research. however, different tasks require different aspects of data to be taken into consideration. in this work, we focus on the main features of the lbsn data: space, time, and messages content. first of all, any lbsn contains data with geotags and timestamps, so the proposed data processing methods are applicable for any lbsn. secondly, the logic and level of complexity of data cleaning depend on the study goals. for example, if some research is dedicated to studying daily activity patterns in a city, it is essential to exclude all data with wrong coordinates or timestamps. in contrast, if someone is interested in exploring the emotional representation of a specific place in social media, the exact timestamp might be irrelevant. in fig. , elements of a pipeline are presented along with the output data from each stage. as stated in the scheme, we start from general methods for a large scale analysis, which require fewer computations and can be applied on the city scale or higher. step by step, we eliminate accounts, places, and tags, which may mislead scientists and distort results. suspicious profiles identification. first, we identify suspicious accounts. the possibility of direct contact with potential customers attracts not only global brands or local business but spammers, which try to behave like real persons and advertise their products at the same time. since their goal differs from real people, their geotags often differ from the actual location, and they use tags or specific words for advertising of some service or product. thus, it is important to exclude such accounts from further analysis. the main idea behind this method is to group users with the same spatial activity patterns. for the business profiles such as a store, gym, etc. one location will be prevalent among the others. meanwhile, for real people, there will be some distribution in space. however, it is a common situation when people tag the city only but not a particular place, and depending on the city, coordinates of the post might be placed far from user's real location, and data will be lost among the others. thus, on the first stage, we exclude profiles, who do not use geotags correctly, from the dataset. we select users with more than ten posts with location to ensure that a person actively uses geotag functionality and commutes across the city. users with less than ten posts do not provide enough data to correctly group profiles. in addition, they do not contribute sufficiently to the data [ ] . then, we calculate all distances between two consecutive locations for each user and group them by m, i.e., we count all distances that are less than km, all distances between and km and so on. distances larger than km are united into one group. after that, we cluster users according to their spatial distribution. the cluster with a deficient level of spatial variations and with the vast majority of posts being in a single location represents business profiles and posts from these profiles can be excluded from the dataset. at the next step, we use a random forest (rf) classifier to identify bots, business profiles, and compromised accounts -profiles, which do not represent real people and behave differently from them. it has been proven by many studies that a rf approach is efficient for bots and spam detection [ , ] . since we want to keep our methods as general as possible and to keep our pipeline applicable to any social media, we consider only text message, timestamp, and location as feature sources for our model. we use all data that a particular user has posted in the studied area and extract the following spatial and temporal features: number of unique locations marked by a user, number of unique dates when a user has posted something, time difference in seconds between consecutive posts. for time difference and number of posts per date, we calculated the maximum, minimum, mean, and standard deviation. from text caption we have decided to include maximum, minimum, average, mean, standard deviation of following metrics: number of emojis per post, number of hashtags per post, number of words per post, number of digits used in post, number of urls per post, number of mail addresses per post, number of user mentions per post. in addition to that, we extracted money references, addresses, and phone numbers and included their maximum, minimum, average, mean, and standard deviation into the model. in addition, we added fraction of favourite tag in all user posts. thus, we got features in our model. as a result of this step, we obtain a list of accounts, which do not represent normal users. city background extraction. the next stage is dedicated to the extraction of basic city information such as a list of typical tags for the whole city area and a set of general locations. general locations are places that represent large geographic areas and not specific places. for example, in the web version of twitter user can only share the name of the city instead of particular coordinates. some social media like instagram or foursquare are based on a list of locations instead of exact coordinates, and some titles in this list represent generic places such as streets or cities. data from these places is useful in case of studying the whole area, but if someone is interested in studying actual temporal dynamics or spatial features, such data will distort the result. also, it should be noted that even though throughout this paper we use the word 'city' to reference the particular geographic area, all stages are applicable on the different scales starting from city districts and metropolitan regions to states, countries, or continents. firstly, we extract names of administrative areas from open street maps (osm). after that, we calculate the difference between titles in social media data and data from osm with the help of damerau-levenshtein distance. we consider a place to be general if the distance between its title and some item from the list of administrative objects is less than . these locations are excluded from the further analysis. for smaller scales such as streets or parks, there are no general locations. then, we analyze the distribution of tags mentions in the whole area. the term 'tag' denotes the important word in the text, which characterizes the whole message. usually, in lbsn, tags are represented as hashtags. however, they can also be named entities, topics, or terms. in this work, we use hashtags as an example of tags, but this concept can be further extrapolated on tags of different types. the most popular hashtags are usually related to general location (e.g., #nyc, #moscow) or a popular type of content (#photo, #picsoftheday, #selfie) or action (#travel, #shopping, etc.). however, these tags cannot be used to study separate places and they are not relevant either to places or to events since they are actively used in the whole area. nevertheless, scientists interested in studying human behavior in general can use this set of popular tags because it represents the most common patterns in the content. in this work, we consider tag as general if it was used in more than % of locations. however, it is possible to exclude tags related to public holidays. we want to avoid such situations and keep tags, which have a large spatial distribution but narrow peak in terms of temporal distribution. thus, we group all posts that mentioned a specific tag for the calendar year and compute their daily statistics. we then use the gini index g to identify tags, which do not demonstrate constant behavior throughout the year. if g ≥ . we consider tag as an event marker because it means that posts distribution have some peaks throughout the year. this pattern is common for national holidays or seasonal events such as sports games, etc. thus, after the second stage, we obtain the dataset for further processing along with a list of common tags and general locations for the studying area. spatial context extraction. using hashtags for events identification is a powerful strategy, however, there are situations where it might fail. the main problem is that people often use hashtags to indicate their location, type of activity, objects on photos and etc. thus, it is important to exclude hashtags which are not related to the possible event. to do that, we grouped all hashtags by locations, thus we learn which tags are widely used throughout the city and which are place related. if some tag is highly popular in one place, it is highly likely that the tag describes this place. excluding common place-related tags like #sea or #mall for each location, we keep only relevant tags for the following analysis. in other words, we get the list of tags which describe a normal state of particular places and their specific features. however, such tags cannot be indicators of events. fake transitions detection. the last stage of the pipeline is dedicated to suspicious posts identification. sometimes, people cannot share their thoughts or photos immediately. it leads to situations where even normal users have a bunch of posts, which are not accurate in terms of location and timestamp. at this stage, we exclude posts that cannot represent the right combination of their coordinates and timestamps. this process is similar to the ideas for location spoofing detection -we search for transitions, which someone could not make in time. the standard approach for detection of fake transitions is to use spacetime cones [ ] , but in this work, we suggest the improvement of this methodwe use isochrones for fake transitions identification. in urban studies, isochrone is an area that can be reached from a specified point in equal time. isochrone calculation is based on usage of real data about roads, that is why this method is more accurate than space-time cones. for isochrone calculation, we split the area into several zones depending on their distance from the observed point: pedestrian walking area (all locations in km radius), car/public transport area (up to km), train area ( - km) and flight area (further than km). this distinction was to define a maximum speed for every traveling distance. the time required for a specific transition is calculated by the following formula: where s i is the length of the road segment and v is the maximum possible velocity depending on the inferred type of transport. the road data was extracted from osm. it is important to note that on each stage of the pipeline, we get output data, which will be excluded, such as suspicious profiles, baseline tags, etc. however, this data can also be used, for example, for training novel models for fake accounts detection. the first experiment was designed to highlight the importance of general location extraction. to do that, we used the points-of-interest dataset for moscow, russia. the raw data was extracted from facebook using the places api and contained , places. the final dataset for moscow contained , places, and general sites were identified. however, it should be noted that among general locations, there were detected 'russia' ( , , visitors), 'moscow', 'russia' ( , , visitors) , 'moscow oblast' ( , visitors). for instance, the most popular non-general locations in moscow are sheremetyevo airport and red square, with only , and , check-ins, respectively. the itinerary construction is based on solving the orienteering problem with functional profits (opfp) with the help of the open-source framework fops [ ] . in this approach, locations are scored by their popularity and by farness distance. we used the following parameters for the ant colony optimization algorithm: ant per location and iterations of the algorithm, as it was stated in the original article. the time budget was set to h, the red square was selected as a starting point, and vorobyovy gory was used as a finish point since they two highly popular touristic places in the city center. the resulting routes are presented in fig. . both routes contain extra places, including major parks in the city: gorky park and zaryadye park. however, there are several distinctions in these routes. the route based on the raw data contains four general places (fig. , left) -'moscow', 'moscow, 'russia', 'russia', and 'khamovniki district', which do not correspond to actual places. thus, % of locations in the route cannot be visited in real life. in contrast, in case of the clean data (fig. , right) , instead of general places algorithm was able to add real locations, such as bolshoi theatre and central children's store on lubyanka with the largest clock mechanism in the world and an observation deck with the view on kremlin. thus, the framework was able to construct a much better itinerary without any additional improvements in algorithms or methods. to demonstrate the value of background analysis and typical hashtags extraction stages, we investigated the scenario of analysis of users' opinions in a geographical area via sentiment analysis. we used a combined dataset of twitter and vk.com posts taken in sochi, russia, during . sochi is one of the largest and most popular russian resorts. it was also the host of the winter olympics in . since twitter and vk.com provide geospatial data with exact coordinates, we created a squared grid with a cell size equal to m. we then kept only cells containing data (fig. , right) - cells in total. each cell was considered as a separate location for the context extraction. the most popular tags in the area are presented in fig. (left) . tag '#sochi' was mentioned in / of cells ( and cells for russian and english versions of the tag, respectively). the followup tags '#sochifornia' (used in cells) and '#sea' (mentioned in cells) were twice less popular. after that, we extracted typical tags for each cell. we considered a post to be relevant to the place if it contained at least one typical tag. thus, we can be sure that posts represent the sentiment in that area. the sentiment analysis was executed in two stages. first, we prepare the text for polarity detection. to do that, we delete punctuation, split text in words, and normalized text with the help of [ ] . in the second step, we used the russian sentiment lexicon [ ] to get the polarity of each word ( indicates positive word and − negative word). the sentiment of the text is defined as if a sum of polarities of all words more than zero and − if the sum is less than zero. the sentiment of the cell is defined as an average sentiment of all posts. on the fig. , results of sentiment analysis are presented, cells with average sentiment less than . were marked as neutral. it can be noted from maps that after the filtering process, more cells have a higher level of sentiment. for sochi city center, the number of posts with the sentiment |s| ≥ . increased by . %. it is also important that number of uncertain cell with sentiment rate . ≤ |s| ≤ . decreased by . % from to cells. thus, we highlighted the strong positive and negative areas and decreased the number of uncertain areas by applying the context extraction stage of the proposed pipeline. in this experiment, we applied the full pipeline on the instagram data. new york city was used as a target city in the event detection approach [ ] we collected the data from over , locations for a period of up to years. the total number of posts extracted from the new york city area is , , . in the first step, we try to exclude from the dataset all users who provide incorrect data, i.e. use several locations instead of the whole variety. we group users with the help of k-means clustering method. the appropriate number of clusters was obtained by calculating the distortion parameter. deviant cluster contained , users out of , , . the shape of deviant clusters can be seen in fig. . suspicious profiles mostly post in the same location. meanwhile, regular users have variety in terms of places. after that, we trained our rf model using manually labelled data from both datasets. the training dataset contains profiles with ordinary users and fake users; test data consists of profiles including normal profiles and suspicious accounts. the model distinguishes a regular user from suspicious successfully. normal user were detected correctly and users were marked as suspicious. suspicious users out of were correctly identified. thus, there were obtained % of precision and % of recall. since the goal of this work is to get clean data as a result, we are interested in a high value of recall and precision is less critical. as a result, we obtained a list of , , profiles which related to real people. at the next step, we used only data from these users to extract background information about cities. titles of general locations were derived for new york. these places were excluded from further analysis. after that, we extracted general hashtags; the example of popular tags in location before and after background tags extraction is presented on the fig. . general tags contain mostly different term related to toponyms and universal themes such as beauty or life. then, we performed the context extraction for locations. for each location typical hashtags were identified as % most frequent tags among users. we consider all posts from one user in the same location as one to avoid situations where someone tries to force their hashtag. we will use extracted lists to exclude typical tags from posts. after that, we calculated isochrones for each normal users to exclude suspicious posts from data. in addition to that, locations with a high rate of suspicious posts ( % or higher part of posts in location was detected as suspicious) were excluded as well. there was locations in new york city. the final dataset for new york consists of , locations. for event detection we performed the same experiment which was described in [ ] . in the original approach the spike in activity in particular cell of the grid was consider as an event. to find these spikes in data, historical grids is created using retrospective data for a calendar year. since we decrease amount of data significantly, we set threshold value to . we used data for to create grids, then we took two weeks from for the result evaluation: a week with a lot of events during - of march and an ordinary week with less massive events - february. the results of the recall evaluation are presented in table . as can be seen from the table on an active week, the recall increment was . % and for nonactive week recall value increase on . %. it is also important to note that some events, which do not have specific coordinates, such as snowfall in march or saint patrick's day celebration, were detected in the less number of places. this leads to lesser number of events in total and more significant contribution to the false positive rate. nevertheless, the largest and the most important events, such as nationwide protest '#enough! national school walkout' and north american international toy fair are still detected from the very beginning. in addition to that due to the altered structure of historical grids, we were able to discover new events such as a concert of canadian r&b duo 'dvsn', global engagement summit at un headquarters, etc. these events were covered with a low number of posts and stayed unnoticed during the original experiment. however, the usage of clean data helped to highlight small events which are essential for understanding the current situation in the city. in this work, we presented a spatiotemporal filtering pipeline for data preprocessing. the main goal of this process is to exclude unreliable data in terms of space and time. the pipeline consists of four stages: during the first stage, suspicious user profiles are extracted from data with the help of k-means clustering and random forest classifier. on the next stage, we exclude the buzz words from the data and filter locations related to large areas such as islands or city districts. then, we identify the context of a particular place expressed by unique tags. in the last step, we find suspicious posts using the isochrone method. stages of the pipeline can be used separately and for different tasks. for instance, in the case of touristic walking itinerary construction, we used only general location extraction, and the walking itinerary was improved by replacing % of places. in the experiment dedicated to sentiment analysis, we used a context extraction method to keep posts that are related to the area where they were taken, and as a result, . % of uncertain areas were identified either as neutral or as strongly positive or negative. in addition to that, for event detection, we performed all stages of the pipeline, and recall for event detection method increased by . %. nevertheless, there are ways for further improvement of this pipeline. in instagram, some famous places such as times square has several corresponding locations including versions in other languages. this issue can be addressed by using the same method from the general location identification stage. we can use distance to find places with a similar name. currently, we do not address the repeating places in the data since it can be a retail chain, and some retail chains include over a hundred places all over the city. in some cases, it can be useful to interpret a chain store system as one place. however, if we want to preserve distinct places, more complex methods are required. despite this, the applicability of the spatiotemporal pipeline was shown using the data from facebook, twitter, instagram, and vk.com. thus, the pipeline can be successfully used in various tasks relying on location-based social network data. deep" learning for missing value imputation in tables with non-numerical data right time, right place" health communication on twitter: value and accuracy of location information social media geographic information: why social is special when it goes spatial building sentiment lexicons for all major languages positional accuracy of twitter and instagram images in urban environments a location spoofing detection method for social networks compa: detecting compromised accounts on social networks the rise of social bots the quality of big (geo)data urban computing leveraging location-based social network data: a survey zooming into an instagram city: reading the local through social media an agnotological analysis of apis: or, disconnectivity and the ideological limits of our knowledge of social media advances in social media research: past, present and future morphological analyzer and generator for russian and ukrainian languages efficient pre-processing and feature selection for clustering of cancer tweets analyzing user activities, demographics, social network structure and user-generated content on instagram is the sample good enough? comparing data from twitter's streaming api with twitter's firehose orienteering problem with functional profits for multi-source dynamic path construction fake news detection on social media: a data mining perspective who is who on twitter-spammer, fake or compromised account? a tool to reveal true identity in real-time twitter as an indicator for whereabouts of people? correlating twitter with uk census data detecting spammers on social networks online humanbot interactions: detection, estimation, and characterization multiscale event detection using convolutional quadtrees and adaptive geogrids places nearby: facebook as a location-based social media platform arming the public with artificial intelligence to counter social bots detecting spam in a twitter network true lies in geospatial big data: detecting location spoofing in social media acknowledgement. this research is financially supported by the russian science foundation, agreement # - - . key: cord- -sgu ayvw authors: kolic, blas; dyer, joel title: data-driven modeling of public risk perception and emotion on twitter during the covid- pandemic date: - - journal: nan doi: nan sha: doc_id: cord_uid: sgu ayvw successful navigation of the covid- pandemic is predicated on public cooperation with safety measures and appropriate perception of risk, in which emotion and attention play important roles. signatures of public emotion and attention are present in social media data, thus natural language analysis of this text enables near-to-real-time monitoring of indicators of public risk perception. we compare key epidemiological indicators of the progression of the pandemic with indicators of the public perception of the pandemic constructed from approx. million unique covid- -related tweets from countries posted between th march and th june . we find evidence of psychophysical numbing: twitter users increasingly fixate on mortality, but in a decreasingly emotional and increasingly analytic tone. we find that the national attention on covid- mortality is modelled accurately as a logarithmic or power law function of national daily covid- deaths rates, implying generalisations of the weber-fechner and power law models of sensory perception to the collective. our parameter estimates for these models are consistent with estimates from psychological experiments, and indicate that users in this dataset exhibit differential sensitivity by country to the national covid- death rates. our work illustrates the potential utility of social media for monitoring public risk perception and guiding public communication during crisis scenarios. the covid- pandemic has brought about widespread disruption to human life. in many countries, public gatherings have been broadly forbidden, mass restrictions on human movement have been introduced, and entire industries have been paralysed in attempting to lower the peak stress on healthcare systems [ ] . however, the degree to which these restrictions have been enforced by law has varied over time and by location, and their success in mitigating public health risks depends on the extent of cooperation on the part of the public. a key determinant of the public's behaviour and their cooperation with state-imposed social restrictions is the public's emotional response to, and their perception of the the risk presented by, the pandemic. however, the evolution of emotions and risk perception in response to disasters is not well-understood, and there is a need for more longitudinal data on such responses with which this understanding can be improved [ ] . our goal is thus to contribute to bettering this understanding, and we do so by exploring the empirical relationships present between the progression of the covid- pandemic and the public's perception of the risk posed by the pandemic. we explain our findings in terms of the existing body of literature surrounding public perception of risk, disasters, and human suffering in cognitive psychology. in particular, we draw from psychophysics, the field that studies the relationship between stimulus and subjective sensation and perception [ ] . the search for psychophysical "laws" of perception has existed since at least the mid- th century with the proposing of the weber-fechner law [ ] , which posits that the smallest perceptible change ds in a physical stimulus of magnitude s is proportional to s. thus, the perceived magnitude p of such stimuli follows dp ∝ ds s . ( ) in the continuum limit, this implies that p grows logarithmically with the physical magnitude s of the stimulus. more recently, empirical studies by s. s. stevens [ ] supported, instead, a power law relationship between human perception of a stimulus and the physical magnitude of the stimulus: p ∝ s β . summers et al. [ ] extended this concept to human sensitivity to war death statistics and found that a power law with exponent β = . best fit the data. a number of further studies have corroborated the extension of these psychophysical laws describing the subjective perception of physical magnitudes to the subjective evaluations of human fatalities [ , , ] . in all of these, perception is a concave function of the stimulus, meaning that the larger the stimulus magnitude, the more it has to change in absolute terms to be equally noticeable. thus, perception is considered relative rather than absolute, implying that our judgments are comparative in nature. this observation has been shown to account for deviations from rationality in economic decision-making [ ] . these proposed psychophysical laws of human perception present an opportunity for monitoring a population's response to a disaster scenario such as the covid- pandemic. by evaluating the goodness of fit of these models to data on the perception of the progression of the pandemic, and determining the parameter values of such fits, we can describe the sensitivity of populations to the state of such crises, with important implications for risk communication and disaster management. to this end, we make use of a massive twitter dataset consisting of user-posted textual data to study the public's emotional and perceptual responses to the current public health crisis. twitter provides convenient access to the conversation amongst members of the public across the globe on a plethora of topics, and many authors are studying several aspects of the public's response to the pandemic with it. twitter is a particularly appropriate tool under conditions of physical distancing requirements and furlough schemes, where online communication has become more than ever a central feature of everyday life. moreover, results from psycholinguistics and advances in natural language processing techniques enable the extraction of psychologically meaningful attributes from textual data. with this dataset, our general approach is to offer a quantitative, spatiotemporal comparison between indicators of the state of the pandemic and the topics and psychologically meaningful linguistic features present in the discussion surrounding covid- on social media on a country-by-country basis, for a selection of countries. our work is novel in that, to our knowledge, it is the first to use a large social media dataset spanning multiple countries to model the perceptual response of countries' citizens to the pandemic in the context of risk perception. to date, empirical validation of the aforementioned psychophysical laws has largely taken place in controlled laboratory settings, in which decisions, actions, and scenarios are artificial or hypothetical. our work thus contributes to the body of literature surrounding risk perception by investigating these laws in a naturalistic setting. however, there have been numerous authors using social media to analyse the public response to the covid- pandemic. this includes work that has focused on the psychological burden of the social restrictions. for instance, stella et al. [ ] use the circumplex model of affect [ ] and the nrc lexicon [ ] to give a descriptive analysis of the public mood in italy from a twitter dataset collected during the week following the introduction of lockdown measures. in addition, venigalla et al [ ] has developed a web portal for categorising tweets by emotion in order to track mood in india on a daily basis. others have instead focused on negative emotions, as in the work of schild et al. [ ] , where they study the rise of hate speech and sinophobia as a result of the outbreaks. more specifically on perception, dryhust et al. [ ] measured the perceived risk of the covid- pandemic by conducting surveys at a global scale (n ∼ ) and compared countries, finding that factors such as individualistic and pro-social values and trust in government and science were significant predictors of risk perception. de bruin and bennett [ ] perform similar work in the united states. the closest work we have been able to find to our own is that of barrios and hochberg [ ] , in which the authors combine internet search data with daily travel data to show that regions in the united states with a greater proportion of trump voters exhibit behaviours that are consistent with a lower perceived risk during the covid- pandemic. despite the above, we have been unable to find work that combines large-scale social media data with linguistic analysis to offer a spatiotemporal, quantitative analysis of emotion and risk perception during the covid- pandemic across multiple countries. beyond the covid- pandemic, our work is related to a small but growing body of literature on the use of data science in understanding human emotion and risk perception. in such work, natural language analysis has succeeded in supporting established linguistic theories such as the importance of the distribution of words in a vocabulary as a proxy for knowledge [ ] , and regarding the relation between the uncertainty of events and the emotional response to their outcome [ , ] . for instance, using textual data from twitter, bhatia found that unexpected events elicit higher affective responses than those which are expected [ ] . in another instance, the same author conducted experiments with participants and predicted the perceived risk of several risk sources using a vector-space representation of natural language, concluding that the word distribution of language successfully captures human perception of risk [ ] . similar work has been conducted by jaidka et al. [ ] in the area of monitoring public well-being, in which they compare word-based and data-driven methods for predicting ground-truth survey results for subjective well-being of us citizens on a county-level basis using a . billion tweet dataset constructed from to . the remainder of this paper is laid out as follows. in section , we present the data set used in the subsequent analysis. in section , we provide further details on the approach followed to explore the relationships between indicators of the state of the pandemic and the public's perception of the pandemic, and discuss possible explanations for our observations by drawing on psychological literature. in section , we summarise and offer concluding remarks, along with a discussion of the limitations of the current work and suggestions for avenues of future work. in the following analysis, we make use of the set of tweets gathered by j. banda et. al [ ] , which are obtained and mantained using the twitter free stream api . at the time of writing, this data set consists of ∼ million original tweets spanning from march , to june , . data is collected according to the following query filters : "covid ", "coronavirus-pandemic", "covid- ", " ncov", "coronaoutbreak", "coronavirus" , "wuhanvirus", "covid ", "coronaviruspandemic", "covid- ", " ncov", "coronaoutbreak", "wuhanvirus". for our analysis, we consider only the english and spanish tweets with a non-empty selfreported location field. we process every self-reported location using openstreetmaps [ ] and remove non-sensical locations (e.g. "mars", "everywhere", "planet earth"). this allows us to group the remaining tweets by country and proceed with our analysis on a country-by-country the free stream api randomly samples around % of the total tweets for the given queries a number of publicly available twitter datasets have emerged in relation to the pandemic. we chose to work with this dataset since it used the most generic query terms among all the publicly available datasets we considered, and we wanted the least amount of bias possible for our analysis. basis. to assure the statistical significance of our analysis, we keep the countries with the highest number of tweets for each language, resulting in a geolocated twitter dataset of ∼ million tweets posted by ∼ million users on different countries, which we summarise in table . we measure the progression of the pandemic with the number of covid- confirmed cases and deaths for all the countries in our analysis. the data was made publicly available by our world in data repository [ ] . in particular, we take the daily covid- cases and deaths, both in linear and logarithmic scale, since these are four epidemiological indicators that are most frequently used to summarise the state of the pandemic, and are therefore frequently encountered by the public. in this section, we study the public's perception of the pandemic on a country-by-country basis, using the countries with the highest number of tweets in the observation period (see table ). we do this on a country-by-country basis since the pandemic has often invoked nation-level responses, making nation-level analysis the most natural geographic scale. our broad approach is to inspect and compare the linguistic features of the tweets released by users in the twitter dataset described in section . with the epidemiological data described in section . . our goal is to explore the public's perception of the pandemic. to do this, we analyse the linguistic features present in the textual data generated by twitter users, and map these features to psychologically meaningful categories that are indicative of the twitter users' perception. here, we are assuming that the words used by these twitter users are indicative of their internal cognitive and emotional states [ ] , which is supported in [ ] where they predict the perception of risk using text data. thus, we quantify the linguistic content of each tweet using the linguistic inquiry and word count (liwc) program [ ] . liwc has been widely adopted in several text data analyses, and it has proven successful in applications ranging from measuring the perception of emotions [ ] to predicting the german federal elections using twitter [ ] . liwc operates as text analysis program that reports the number of words in a document belonging to a set of predefined linguistically and psychologically meaningful categories [ ] . for our purposes, a document is a tweet d t i posted on date t and from a user based in country i. liwc represents documents as an unordered set of words, and a liwc category l is similarly a set of words associated with concept l. for a given document d t i , the linguistic score p l for category l is the percentage of words in d t i that belong to l: there are many such categories l, including family, work, and motion. we capitalise such category titles, and use the titles to refer to either the set of words associated with that category or to refer to the category itself. linguistic scores from eq. ( ) for individual tweets will be noisy, as they are short documents. moreover, we are interested in the average response of the population of a country. for this reason, we group the tweets by country i and by date t, and denote these sets of tweets as we then compute the national linguistic score (nls) for category l as the average of the linguistic scores over documents in d t i relative to an empirically observed twitter base rate p l b : the base rates p l b for the use of words on twitter associated with category l are given in [ ] . using eq. ( ) for all the selected linguistic categories, we construct multidimensional country-level time series that represent the evolution of the public perception of the pandemic, similar to the linguistic profiles introduced by tumasjan et al. [ ] . in figure , we show the collection of nlss for a selection of relevant linguistic categories. we observe clear trends that, in most cases, are synchronized between countries and languages. in particular, most categories associated with emotion -notably affect, anger, anxiety, positive emotion, negative emotion, and swear words (swearing is associated with frustration and anger [ ] ) -have their highest scores in mid-to-late march, when the world health organisation (who) announced the pandemic status of covid- and most western countries introduced more stringent social restrictions [ ] . these scores decay thereafter, indicating a relaxation of the emotional response in the conversation. this is consistent with results reported by bhatia regarding the affective response to unexpected events [ ] . a qualitatively similar trend can be seen in the social processes panel, the category involving "all non-first-person-singular personal pronouns as well as verbs that suggest human interaction (talking, sharing)" [ ] . we also observe that health-related categories such as death and health show an overall rising trend, with death rising most rapidly throughout march. these categories, with the exception of positive emotion and health, peak again in the united states at the end of may, coinciding with the murder of george floyd and the subsequent black lives matter protests. such universal trends are not apparent by visual inspection in the money, risk, and sadness panels. an additional feature of these plots is the absolute scale of these values: in all cases, there is a significant percentage change from their baseline values, with large percentage increases observed initially in the use of words associated with anxiety and later with death, and a moderate percentage increase in the use of words associated with risk. in this section, we explore the relationship between the nlss described in section . , which we use as a proxy for the public's perception, and the intensity of the pandemic, which we assume is the stimulus triggering this perception. our measure of the intensity of the pandemic is the number of covid- cases and deaths from the data described in section . . a straightforward way of approaching this relationship is by computing the correlations between the nlss and the epidemiological data in a per-country basis, and we show the average across countries of these per-country correlations in figure . on the one hand, we observe significant negative correlations in emotionally charged categories (eg. swear words, anger, anxiety, affective processes), indicating a decay in emotion as the pandemic intensifies. conversely, categories related with health and mortality (death, health) and analytical thinking (analytic) show significant positive correlation . we believe the trends we observe in fig. and the correlations we observe in fig. are consistent with the notion of psychophysical numbing. this term was introduced by robert jay lifton [ ] , and developed by paul slovic [ , ] in the context of human perception of genocides and their associated death tolls, to describe the paradoxical phenomenon in which people exhibit growing indifference towards human suffering as the number of humans suffering increases. by inspecting the correlations between the nlss and the epidemiological indicators, we find that as the pandemic intensifies -in the sense of an increasing number of cases and deaths reported daily -our emotional response diminishes, as expected from a psychophysical numbing phenomenon. specifically, we observe negative correlations between almost all components of the nlss associated with affect -affective processes, anger, anxiety, negative emotion, positive emotion, and swear words -and the epidemiological data . by inspecting figure , we see that every country exhibits similar downward trends in these components and, with the exception of anxiety, are all significantly lower than their baseline values throughout the observation period. this unusually low and decreasing affect word count is accompanied, conversely, with a growing awareness of the morbidity of the situation in that we observe significant positive correlations between the death nlss and the daily national cases and deaths, indicating that the decrease in affect occurs simultaneously with and despite an attentional shift towards covid- related mortality. we also observe a simultaneous increase in the analytic component of each english-language dataset over this same period, indicating a movement towards more logical and analytical, rather than intuitive and emotional, thinking. the potential implication of this is that the public is less perceptive of the risk that the pandemic poses to public health, since their emotional response is reduced and reducing [ ] . when analysing these correlations, we found that, overall, the cumulative cases and deaths correlate better with most linguistic categories than the daily data. however, while this is sensible in the early stages of the pandemic, it is unlikely to remain the case over a long time horizon due to humans' finite memory. we therefore proceeded with our comparison using the daily epidemiological data alone for this reason. the only exception is the cross-country average of the sadness component of the nlss, which is positively correlated with the epidemiological indicators and appears to be driven only from argentina's, chile's, and colombia's increasing use of words related to sadness. the remaining countries remain stationary at a lowerthan-baseline value for this component. unfortunately, the spanish liwc dictionary does not yet have an analytic category. for example, van bavel et al. [ ] and loewenstein et al. [ ] describe that risk perception is driven more by association and affect-based processes than analytic and reason-based processes, with the affect-based processes typically prevailing when there is disagreement between the two modes of thinking. the negative correlations between the intensity of the pandemic and affective processes, together with its positive correlation with the prevalence of analytic processes, suggests that public risk communication could be adjusted to re-balance the degree of affective and analytic thinking amongst members of the public to achieve favourable risk avoidance behaviour and, consequently, favourable public health outcomes. to support our claim that these observations are attributable to psychophysical numbing, we construct word co-occurrence networks using tweets in our dataset. given a set t of tweets, the word co-occurrence network g(t ) is represented by a weighted adjancency matrix a(t ) in which the nodes are words belonging to the death and affect liwc dictionaries. entry a ij (t ) counts the number of co-occurrences between words i and j across all tweets in t , and is computed as where b tk (t ) counts the number of instances of word k in tweet t ∈ t . we ignore self-edges by imposing a ii = , since it is the relationship between distinct words that is of interest. (see appendix b. for further details on the construction of these networks.) if the psychophysical numbing effect is legitimate, we expect that words in the death dictionary co-occur more frequently with other death-related words and less frequently with words in the affect dictionary. we construct three such networks by aggregating the word co-occurrences over three distinct periods: th march to th april , th april to rd may, and th may to th june. as we discussed previously, the first period coincides with the pandemic status of covid- declared by the who and has a high affect score but a low and increasing death score; the second one has a high and relatively stable death score and a decreasing affect score; and the third has a high death score but one in which the affect scores and some of its subcategories (e.g. anger, anxiety, negative emotion) increase again, which we attribute -at least partly -to the public response to the murder of george floyd and the subsequent black lives matter protests. in constructing these networks, we weight each country equally by taking a random sample of approximately , tweets from each country. th march - th april in this network (see fig. a ) we see two main clusters emerging. the first consists mostly of words associated with death (left), and the second of words associated with affect (right). the appearance of some of the affect-related words in the death cluster can be explained given the context of the pandemic. for example, the word "positive" is likely used in reference to the number of people that have tested positive for covid- , which is closely related to the conversation around covid- cases and deaths. similarly, the word "panic" is likely reflecting the early conversations around panic-buying of household goods, for example toilet paper and hand-sanitiser, and the word "isolat*" is likely used in calls for symptomatic individuals to self-isolate. thus, while some instances of affect-related words that appear in this predominantly death-based community are harder to explain without appealing to the existence of a true subjective experience of affect amongst the twitter users (e.g. "risk*"), the most important (in terms of node degree) of these affect-related words are more likely being used here in an affect-free sense and are appropriately grouped with death-related words here given the context of the pandemic. thus the community structure we observe is consistent with our hypothesis of a separation between words belonging to the death and affect dictionaries. th april - rd may in this network (see fig. b ), the two-cluster structure seen in the previous snapshot remains, with the cluster more centered on death on the left and a cluster corresponding to almost exclusive use of affect-related words on the right. the size of the death-related cluster has increased relative to the affect-based community, reflecting the higher death nls during this period. two new and important affect-related nodes appear in the death-based community for this time period: "care" and "fail*". these can once again be plausibly explained by the context of the pandemic. for example, the appearance of the word "care" in the death-related community can be explained in terms of the conversation surrounding the health care system and death care industries, the number of covid- patients being admitted to intensive care units, andparticularly for the united kingdom -the number of deaths that have occurred in care homes for the elderly. these are all clearly related to covid- deaths, and the word "care" in this context most likely constitutes part of the noun and topic of conversation rather than any expression of emotion. the word "fail*" could reflect the discussion around failures on the part of governments to respond with sufficient vigor to the public health crisis -e.g. in terms of a failure to impose social restrictions in a timely manner or to meet testing quotas or quotas on the provision of personal protective equipment for key workers. for example, the polling company yougov finds that approximately % of respondents felt during this period that the us and uk governments had been handling the pandemic well, and that these numbers decreased throughout this period to approximately % [ ] . this does not however exclude the possibility that the appearance of "fail*" indicates a subjective emotional experience: it is possible that twitter users that fixate on government failures are doing so as a result of a sense of outrage with regard to these perceived failures. whether such outrage is motivated specifically by the human fatalities themselves or is merely a manifestation of broader political hostilities and polarisation in modern society remains open. thus, while the appearance of "sure*", "fail*", and some other minor affect-related terms in the death-community may be truly indicative of emotion in the conversation around covid- fatalities, the presence of many of the most highly co-occurring affect-related words in this predominantly death-related community could be explained by their appearance in common phrases related to covid- fatalities, e.g. the "death care" and "health care" industries, "care homes", "testing positive" for the virus etc. these words, therefore, do not necessarily reveal emotion in the current context. we thus argue once again that this co-occurrence network and its community structure shows that death-and affect-based words are well-separated, consistent with our claim of psychophysical numbing. th may - th june our argument remains unchanged for this period (see fig. c ). the only notable difference for this period is that a significant proportion of the conversation surrounding death is focused on the political issues that inspired the black lives matter protests and the protests themselves. this is apparent from the appearance of the word "protests" in the left-hand side's death-related community. altogether, this analysis demonstrates that words indicating a subjective emotional/affective experience and words related to death are well-separated in this twitter data, which is consistent with the notion of psychophysical numbing as an explanation for the trends and correlations observed in figures and . for completeness, we include the equivalent co-occurrence graphs for the spanish-language tweets in appendix b. , from which similar conclusions can be drawn. in the previous section, we demonstrated our finding that as the pandemic intensifies, the proportion of words that appear in the set of tweets posted in each country that indicate emotion diminishes over time. this indicates that the actual emotional response to the pandemic diminishes as the intensity of the pandemic increases, implying a psychophysical numbing effect. we supported this explanation by showing that the word co-occurrence networks induced by our set of tweets host a community structure that separates words in the death and affect dictionaries, suggesting that people do not talk about covid- deaths in a highly emotional tone. the following sections model the relationship between the progression of the covid- pandemic and the twitter users' perception using grounded theories of psychophysical numbing. our analysis suggests that the public's perception of the progression of the pandemic is logarithmic or, at least, sublinear. from figure , we observe that the correlation magnitudes between nlss and epidemiological data are generally larger in absolute value whenever the latter are taken in logarithmic scale. to exemplify this observation, we show in figure the z-scores of the death nlss and of the logarithm of the daily number of deaths and cases within each country. the general correspondence between all three normalised features in each country is striking . we propose that this can be explained in terms of the weber-fechner law [ ] , which is a quantitative statement with its origins in psychology and psychophysics regarding humans' perceived magnitude p of a stimulus with physical magnitude s. it states that a human's perception of the magnitude of a stimulus varies as the logarithm of the physical magnitude s of the stimulus, meaning we are more sensitive to ratios when comparing different physical magnitudes than we are to absolute differences. in the continuum limit, eq. ( ) gives the following functional form for the weber-fechner law: recall that the z-score of a sequence of observations y = (y , · · · , y t ) is given by z = (y − µ y )/σ y , where µ y and σ y are the mean and standard deviation of y, respectively. we note that the correspondence is weaker for australia, nigeria, and south africa due to the relatively low number of cases in these countries (see fig. in the appendix for reference). the correspondence is also weaker in spain, for two reasons: due to its revision of the number of cases in late may, resulting in a day of "negative deaths"; and due to their having recorded a day with no covid- -related deaths, which was a significant event given that spain had seen many deaths until that point. (c) may th to june th, . figure : snapshots of the word co-occurrences associated with death (green labels) and affect (red labels) for english-language tweets aggregated across all analyzed countries in three different time windows (see sub-captions). the nodes are coloured according to their community label as obtained by maximising modularity with the louvain algorithm [ ] . we filtered edges with weight below co-occurrences for visualisation purposes. (t) and the national daily death rate is given in parentheses for each country. data is smoothed with a -day moving average and standardized with their z-score to make them visually comparable. vertical lines represent peaks in the death discourse caused by exogenous events (see main text for details) which we remove from the time series. table : results from the fit of the weber-fechner law to the observed relationship between the death nls and the logarithm of the daily number of deaths in each country (see figure ) . overall, this model best describes the relationship between the daily number of deaths local to each country and the death nls. where k and s are real-valued parameters and r(t) the residual. parameter k determines the sensitivity of perception to changes in the stimulus s, while s determines the minimum threshold that the stimuli s must overcome in order to be perceived. the residual term r(t) is a random variable representing noise not directly captured by the stimulus. for instance, exogenous events can trigger abrupt peaks in the death score. this is the case, for example, with the murder of george floyd in the united states, or the peak in nigeria around april th , triggered by a number of prominent african figures dying from covid- around that day, including the nigerian president's top aide (see [ ] ). in order to test the weber-fechner law, we fit a linear regression model to p death i (t), the death nls time series in country i, and log s i (t), the daily number of deaths in the same country, and summarize the results of these fits in table . we find that eq. ( ) accurately models the data, with significant coefficients (p-value < . ) for all countries except spain. the sensitivity parameter k has the same order of magnitude for all significant countries. however, the country with the lowest k is ∼ times less sensitive than the highest, indicating that twitter users in different countries may react differently to the evolution of the pandemic. the minimum stimuli threshold s , in the other hand, is always small: most countries, except for the united states and the united kingdom, need only one covid- death in a given day in order to be perceived. conversely, the united states and united kingdom need approximately and deaths to be perceived, which is small compared to the thousands of daily deaths registered in these countries during the observation period. table : the results from the fit of a power law to the relationship between the death nls and the national daily death count. this is the best model in some cases, though is outperformed by the weber-fechner law most times. *while we fit this model assuming a log-log relationship between p and s, we compute r with linear p to make it comparable to the model implied by the weber-fechner law (see eq. ( ) in appendix a for details). this may cause negative values of r as is the case for spain. an alternative functional form for the relationship between human perception p of a stimulus and the physical magnitude s of the stimulus is a power law relationship where ν and β are parameters determining the perception from a stimulus of unit magnitude and the growth rate of the perception as a function of the stimulus magnitude, andr(t) is a residual term. this form has been shown to outperform the weber-fechner law in characterising human perception in a number of empirical studies [ ] . we also therefore report the results of this model fit to the relationship between the death nls p death i (t) and national daily death counts s i (t) for each country i, reporting our results in table . in all cases, we observe sublinear exponents β for the perception of the daily deaths data, with significant exponents (p-value < . ) ranging between . and . . these exponents are of the same order of magnitude as the β of . reported in [ ] , where in several laboratory experiments they measure psychophysical numbing in participants' perception of death statistics. as discussed previously, the data for spain is unusual for a number of reasons, thus the model does not accurately describe the data in this instance. these results suggest that twitter users in certain countries are more sensitive to change in the number of deaths than others. both the weber-fechner law and power-law relationships between the death nls and the daily number of reported deaths accurately model the data. each captures the phenomenon in which "the first few fatalities in an ongoing event elicit more concern than those occurring later on" [ ] . by way of comparison, we present in table (nrmse), defined as for these models, in addition to a linear model between p death i (t) and s i (t) as a baseline "null" model. here, e(t) = p(t) −p(t) is the model residual, and n is the sample size. the models are directly comparable in this sense, since each involves only two parameters. bhatia [ ] made a similar model comparison to test psychophysical laws for subjective probability judgements of real-world events, in that case finding that the linear relationship was the best. in our case, however, a linear relationship between s and p is significantly worse than the present concave models of perception (see appendix a for the results of the linear model), reinforcing our hypothesis of psychophysical numbing. while the weber-fechner law is better than the power law model overall, the difference in their goodness of fit -as measured by the nrmse -is marginal. both are reasonable descriptions of the observed relationship, and similar conclusions can be drawn from both. in particular, the parameters k and β from the weber-fechner law and power law, respectively, are analogous in their interpretation as the measure of the sensitivity of the nation's twitter users to changes in the national covid- daily death rate. to illustrate this, we rank the countries in our dataset in order of sensitivity to changes in the local death rate, as measured separately by these two parameters, and plot the correlation between the countries' ranks in figure . here, low rank indicates high sensitivity to changes in the number of daily deaths nationally. the correlation between the two methods of ranking -according to k, the weber-fechner law slope parameters, and according to β, the power law model exponents -is high, with correlation coefficient . . this shows that the sensitivity of each country is relatively robust between models. by both measures, therefore, twitter users tweeting in english and spanish from australia and argentina, respectively, appear to be the most sensitive to changes in the national daily death rate, while twitter users posting in english from south africa, india, and nigeria and in spanish from spain and chile appear to be the least sensitive to these changes. figure : comparison of the rank of each country as determined by their k and β parameters in the weber-fechner and power-law fits, respectively, which determine the sensitivity of twitter users tweeting from each country to changes in the number of daily reported deaths. low rank indicates high sensitivity relative to the remaining countries. the correlation between countries' ranks from both measures is high at . . we explored the country-by-country relationship between the linguistic features present in a large set of tweets posted in relation to the covid- pandemic, and the progression/intensity of the pandemic as measured by the daily number of cases and deaths in each country we consider. by considering the change, relative to a baseline, in the percentage of words present in each tweet that are associated with a number of psychologically meaningful categorieshere called linguistic scores -we observed significant trends that we believe are indicative of a psychophysical numbing effect [ ] . we found that the national linguistic scores (nlss, see eq. ( )) associated with emotion and affect decrease as the pandemic intensifies. this is in spite of a greater attentional focus on death and mortality and a simultaneous increase in use of words indicating analytic reasoning. we showed, by constructing word co-occurrence networks on different time periods of the pandemic, that words related to death co-occur more frequently with other words related to death than they do with words indicating affect and emotion, and that this separation of affect from the conversation around death is also revealed by the community structure of this network. this is consistent with the notion of psychophysical numbing, which we believe explains these observations. we also showed that the psychophysical laws of weber-fechner and of power law perception in humans accurately model the relationship between the frequency of words related to death and the actual daily number of covid- deaths in each country. we estimated sub-linear exponents in the power law perception function that are of similar values to values previously estimated from psychological experiments [ ] . these exponents, together with parameter k of the weber-fechner law (see eq. ( )), tell us how sensitive the twitter users in each country are to their national covid- daily deaths, and were seen to vary by country, indicating intercountry differences in risk perception and sensitivity to death rates. such sensitivities were consistent across models (see fig. ) suggesting that these measures of a nation's twitter users' sensitivities to changes in the national death rate are robust features of the data. our findings illustrate the signaling power of twitter, and demonstrate its potential use as a tool for monitoring public perception of risk during large-scale crisis scenarios. with the modelling and visualisation approaches we employ in this paper, policy-makers and public officials could track in near-to-real-time the public's attitudes towards threats to public well-being and the prevalence of factors important to public perception of risk, including degree of outrage and relative attentional focus on the threat. our findings also imply a functional form for agent perception of the system state in models of opinion dynamics. this will be instrumental for developing coupled opinion dynamics-epidemiological models, in which the bidirectional relationships between human perception, human behaviour, and epidemic progression are modelled endogenously. a natural extension to this work would involve nowcasting and/or forecasting of certain economic indicators. it has also been limited in that we assumed that only the national death rate is a significant predictor of perception. a more complete analysis should account for the effect of other countries' death statistics as a driver of local perception, or more broadly an advancement of a process-level explanation of the cross-cultural differences we observe in the sensitivity to death statistics. this analysis could also be enhanced by relating these measures of risk perception to behavioural data, which -since "people's behavior is mediated by their perceptions of risk" [ ] -may be useful for understanding the role of emotions in driving behaviours that are conducive to public health during crises. further, a deconstruction of the aggregate indicators we have developed to the state and regional level may be necessary to more accurately characterise the relationship between local crisis progression and human risk perception. we also stress that the results presented in this paper may be indicative only of the responses of twitter users posting from each of these countries in each of these languages, so extrapolating these results to the broader population will only be possible with a better understanding of the biases present in, and representativeness of, the dataset at hand. bk acknowledges funding from the conacyt-sener: sustentabilidad energtica and jd acknowledges funding from the epsrc industrially focused mathematical modelling centre for doctoral training centre. the authors declare that they have no competing interests. the twitter data used in the manuscript is collected and maintained by banda et al. at the panacea lab [ ] , and it is available at their website http://www.panacealab.org/covid . the data on covid- confirmed cases and deaths were obtained from the "coronavirus pandemic (covid- )" page of the our world in data website [ ] , and the stable url for this data is https://covid.ourworldindata.org/data/owid-covid-data.csv. bk and jd both conceived the idea, carried out the analysis, and wrote, read, and approved the final manuscript. • liwc: linguistic inquiry and word count. • who: world health organization. • nls: national linguistic score. in this section, we present further results of our models to give a more complete overview of their quality. besides the weber-fechner law and power law models (see eqs. ( ) and ( )), we use the following linear relationship between s and p as our benchmark model where a and b are parameters. we summarize our results for the linear model in table . for all models, we compute the r values where e(t) = p(t) −p(t) is the model residual, σ p = n t= (p(t) − µ p ) /(n − ) is the variance of p(t), and n is the sample size. the r values for all models are summarized in table . (note that as the power law model implies a log-normal residual, the r values can be negative.) from this table we see that, once again, the weber-fechner law is generally a better fit to the data across all countries, but that the power law and weber-fechner models are often comparable and significantly better than the linear model. we also show in figures and scatterplots of the death nlss against the logarithm of the daily number of deaths in each country, with the y-axis in linear-and log-scales, respectively. red lines indicate the line of best fit, with the slope equal to k and β in eqs. and , respectively. b word co-occurrence analysis in constructing the word co-occurrence networks presented in section . . , we preform basic text preprocessing, including taking the lower-case form of all letters, removing urls, removing punctuation, and removing the following small set of stopwords from the vocabulary: to, today, too, has, have, like. we retain hashtags, since liwc also recognises hashtags and because hashtags are an essential aspect to communications on twitter. it is also necessary to account for the fact that a number of "words" appearing in the liwc dictionary are in fact regular expressions to which many complete words in the twitter dataset map. for example, the "word" "isolat*" appears in the english liwc dictionary, to which each of the following words would map: "isolate", "isolated", "isolating". thus, construction of the word co-occurrence networks g i involves a two-step procedure: first, constructing the raw word co-occurrence networks g i , in which the nodes are words exactly as they appear in the twitter dataset; and then reducing this to a quotient graph g i by contracting nodes in g i that are matched by the same regular expression in the liwc dictionary. more formally: the liwc dictionary implies an equivalence relation ∼ on the vocabulary v implied by the twitter dataset, such that v ∼ u for words v, u ∈ v if both v and u are matched by the same regular expression in the liwc dictionary. the weights of edges between nodes v ⊂ v and u ⊂ v in g i are then taken to be where w g (x, y) is the weight of edge (x, y) in g. note that w g (x, y) = w g (y, x) and w g (x, y) = if (x, y) is not an edge in g. for completeness, we provide here the word co-occurrence graphs for the spanish language tweets. we omit a discussion of the results, since similar conclusions can be drawn from these as in the english counterparts. we include this section as a reference for the actual number of deaths in each country for the period we analysed throughout the paper, which we present in fig. . covid- government response tracker, blavatnik school of government risk perception and behaviors: anticipating and responding to crises psychophysical numbing: an empirical basis for perceptions of collective violence if i look at the mass i will never act": psychic numbing and genocide insensitivity to the value of human life: a study of psychophysical numbing psychophysical numbing: when lives are valued less as the lives at risk increase perception matters: psychophysics for economists #lockdown: network-enhanced emotional profiling at the times of covid- the circumplex model of affect:an integrative approach to affective neuroscience, cognitive development, and psychopathology emotions evoked by common words and phrases: using mechanical turk to create an emotion lexicon mood of india during covid- -an interactive web portal based on emotion analysis of twitter data go eat a bat risk perceptions of covid- around the world relationships between initial covid- risk perceptions and protective health behaviors: a national survey risk perception through the lens of politics in the time of the covid- pandemic distributional structure the effect of differential failure on expectation of success, reported anxiety, and response uncertainty discrepancy from expectation in relation to affect and motivation: tests of mcclelland's hypothesis affective responses to uncertain real-world outcomes: sentiment change on twitter predicting risk perception: new insights from data science estimating geographic subjective well-being from twitter: a comparison of dictionary and data-driven language methods a largescale covid- twitter chatter dataset for open scientific research -an international collaboration openstreetmap contributors coronavirus pandemic (covid- ) the psychological meaning of words: liwc and computerized text analysis methods the development and psychometric properties of liwc anxious or angry? effects of discrete emotions on the perceived helpfulness of online reviews predicting elections with twitter: what characters reveal about political sentiment the pragmatics of swearing beyond psychic numbing: a call to awareness responding to community outrage: strategies for effective risk communication using social and behavioural science to support covid- pandemic response risk as feelings covid- : government handling and confidence in health authorities fast unfolding of communities in large networks africa's top virus deaths the cognitive psychology of sensitivity to human fatalities: implications for life-saving policies vector space semantic models predict subjective probability judgments for real-world events snapshots of the word co-occurrences associated with death ("muerte", green labels) and affect ("afecto", red labels) for spanish-language tweets aggregated across all analyzed countries in three different time windows (see sub-captions). the nodes are coloured based on the community labels obtained by maximising modularity using the louvain algorithm we filtered edges with weight below co-occurrences for visualisation purposes the authors would like to thank mirta galesic, rodrigo leal cervantes, rita maria del rio chanona, françois lafond, and j. doyne farmer for helpful feedback, and to the oxford inet complexity economics group for stimulating discussions. key: cord- -f i sbwt authors: pastor-escuredo, david; tarazona, carlota title: characterizing information leaders in twitter during covid- crisis date: - - journal: nan doi: nan sha: doc_id: cord_uid: f i sbwt information is key during a crisis such as the current covid- pandemic as it greatly shapes people opinion, behaviour and even their psychological state. it has been acknowledged from the secretary-general of the united nations that the infodemic of misinformation is an important secondary crisis produced by the pandemic. infodemics can amplify the real negative consequences of the pandemic in different dimensions: social, economic and even sanitary. for instance, infodemics can lead to hatred between population groups that fragment the society influencing its response or result in negative habits that help the pandemic propagate. on the contrary, reliable and trustful information along with messages of hope and solidarity can be used to control the pandemic, build safety nets and help promote resilience and antifragility. we propose a framework to characterize leaders in twitter based on the analysis of the social graph derived from the activity in this social network. centrality metrics are used to identify relevant nodes that are further characterized in terms of users parameters managed by twitter. we then assess the resulting topology of clusters of leaders. although this tool may be used for surveillance of individuals, we propose it as the basis for a constructive application to empower users with a positive influence in the collective behaviour of the network and the propagation of information. misinformation and fake news are a recurrent problem of our digital era [ ] [ ] [ ] . the volume of misinformation and its impact grows during large events, crises and hazards [ ] . when misinformation turns into a systemic pattern it becomes an infodemic [ , ] . infodemics are frequent specially in social networks that are distributed systems of information generation and spreading. for this to happen, the content is not the only variable but the structure of the social network and the behavior of relevant people greatly contribute [ ] . during a crisis such as the current covid- pandemic, information is key as it greatly shapes people's opinion, behaviour and even their psychological state [ ] [ ] [ ] . however, the greater the impact the greater the risk [ ] . it has been acknowledged from the secretary-general of the united nations that the infodemic of misinformation is an important secondary crisis produced by the pandemic. during a crisis, time is critical, so people need to be informed at the right time [ , ] . furthermore, information during a crisis leads to action, so population needs to be properly informed center of innovation and technology for development, technical university madrid, spain lifed lab, madrid, spain to act right [ ] . thus, infodemics can amplify the real negative consequences of the pandemic in different dimensions: social, economic and even sanitary. for instance, infodemics can lead to hatred between population groups [ ] that fragment the society influencing its response or result in negative habits that help the pandemic propagate. on the contrary, reliable and trustful information along with messages of hope and solidarity can be used to control the pandemic, build safety nets and help promote resilience and antifragility. to fight misinformation and hate speech,content-based filtering is the most common approach taken [ , [ ] [ ] [ ] . the availability of deep learning tools makes this task easier and scalable [ ] [ ] [ ] . also, positioning in search engines is key to ensure that misinformation does not dominate the most relevant results of the searches. however, in social media, besides content, people's individual behavior and network properties, dynamics and topology are other relevant factors that determine the spread of information through the network [ ] [ ] [ ] . we propose a framework to characterize leaders in twitter based on the analysis of the social graph derived from the activity in this social network [ ] . centrality metrics are used to identify relevant nodes that are further characterized in terms of users' parameters managed by twitter [ ] [ ] [ ] [ ] [ ] . although this tool may be used for surveillance of individuals, we propose it as the basis for a constructive application to empower users with a positive influence in the collective behaviour of the network and the propagation of information [ , ] . tweets were retrieved using the real-time streaming api of twitter. two concurrent filters were used for the streaming: location and keywords. location was restricted to a bounding box enclosing the city of madrid [- . each tweet was analyzed to extract mentioned users, retweeted users, quoted users or replied users. for each of these events the corresponding nodes were added to an undirected graph as well as a corresponding edge initializing the edge property "flow". if the edge was already created, the property "flow" was incremented. this procedure was repeated for each tweet registered. the network was completed by adding the property "inverse flow", that is /flow, to each edge. the resulting network featured nodes and edges. to compute centrality metrics the network described above was filtered. first, users with a node degree (number of edges connected to the note) less than a given threshold (experimentally set to ) were removed from the network as well as the edges connected to those nodes. the reason of this filtering was to reduce computation cost as algorithms for centrality metrics have a high computation cost and also removed poorly connected nodes as the network built comes from sparse data (retweets, mentions and quotes). however, it is desirable to minimize the amount of filtering performed to study large scale properties within the network. the resulting network featured nodes and edges. additionally the network was filtered to be connected which is a requirement for the computation of several of the centrality metrics described bellow. for this purpose the subnetworks connected were identified, selecting the largest connected network as the target network for analysis. the resulting network featured nodes and edges. several centrality metrics were computed: cfbetweenness, betweenness, closeness, cfcloseness, eigenvalue, degree and load. each of this centrality metric highlights a specific relevance property of a node with regards to the whole flow through the network. descriptors explanations are summarized in table . besides the network-based metrics, twitter user' parameters were collected: followers, following and favorites so the relationships with relevance metrics could be assessed. we applied several statistical tools to characterize users in terms of the relevance metrics. we also implemented visualizations of different variables and the network for a better understanding of leading nodes characterization and topology. we compared the relevance in the network derived from the centrality metrics with the user' profile variables of twitter: number of followers, number of following and retweet count. figure shows a scatter plots matrix among all variables. principal diagonal of the figure shows the distribution of each variable which are normally characterized by a high concentration in low values and a very long tail of the distribution. these distributions imply that few nodes concentrate most part of the relevance within the network. more surprisingly, same distributions are observed for twitter user' parameters such as number of followers or friends (following). the load centrality of a node is the fraction of all shortest paths that pass through that node. load centrality is slightly different than betweenness. the scatter plots shows that the is no significant correlation between variables except for the pair betweenness and load centralities as it is expected expected because they have similar definitions. this fact is remarkable as different centrality metrics provide a different perspective of leading nodes within the network and it does not necessarily correlates with the amount of related users, but also in the content dynamics. users were ranked using on variable as the reference. figure shows the ranking resulting from using the eigenvalue centrality as the reference. the values were saturated to the percentile of the distribution to improve visualization and avoid the effect of single values with very out of range values. this visualization confirms the lack of correlation between variables and the highly asymmetric distribution of the descriptors. figure summarizes the values of each leader for each descriptor showing that even within the top ranked leaders there is a very large variability. this means that some nodes are singular events within the network that require further analysis to be interpreted, as they could be leaders in society or just a product of the network dynamics. figure shows the ranking resulting from using current flow betweenness centrality as the reference. in this cases, the distribution of this reference variable is smoother and shows a more gradual behavior of leaders. to assess how the nodes with high relevance are distributed with projected the network into graphs by selecting the subgraph of nodes with a certain level of relevance (threshold on the network). the resulting network graphs may not be therefore connected. the eigenvalue-ranked graph shows high connectivity and very big nodes (see fig. ). this is consistent with the definition of eigenvalue centrality that highlights how a node is connected to nodes that are also highly connected. this structure has implications in the reinforcement of specific messages and information within high connected clusters which can act as promoters of solutions or may become lobbies of information. the current flow betweenness shows an unconnected graph which is very interesting as decentralized nodes play a key role in transporting information through the network (see fig. ). the current flow closeness shows also an unconnected graph which means that the social network is rather homogeneously distributed overall with parallel communities of information that do not necessarily interact with each other (see fig. ). by increasing the size of the graph more clusters can be observed, specially in the eigenvalue-ranked network (fig. ) . some clusters also appear for the current flow betweenness and current flow closeness (see fig. and ). these clusters may have a key role in establishing bridges between different communities of practice, knowledge or region-determined groups. as the edges of the network are characterized in terms of flows between users, these bridges can be understood in terms of volume of information between communities. the distributions of the centrality metrics indicate that there are some nodes with massive relevance. these nodes can be seen as events within the flow of communication through the network [ ] that require further contextualization to be interpreted. these nodes can propagate misinformation or make news or messages viral. further research is required to understand the cause of this massive relevance events, for instance, if it is related to a relevant concept or message or whether it is an emerging event of the network dynamics and topology. another way to assess these nodes is if they are consistently behaving this way along time or they are a temporal event. also, it may be necessary to contextualize with the type of content they normally spread to understand their exceptional relevance. besides the existence of massive relevance nodes, the quantification and understanding of the distribution of high relevant nodes has a lot of potential applications to spread messages to reach a wide number of users within the network. current flow betweenness particularly seems a good indicator to identify nodes to create a safety net in terms of information and positive messages. the distribution of the nodes could be approached for the general network or for different layers or subnetworks, isolated depending on several factors: type of interaction, type of content or some other behavioral pattern. experimental work is needed to test how a message either positive or negative spreads when started at one of the relevant nodes or close to the relevant nodes. for this purpose we are working towards integrating a network of concepts and the network of leaders. understanding the dynamics of narratives and concept spreading is key for a responsible use of social media for building up resilience against crisis. we also plan to make interactive graph visualization to browse the relevance of the network and dynamically investigate how relevant nodes are connected and how specific parts of the graph are ranked to really understand the distribution of the relevance variables as statistical parameters are not suitable to characterize a common pattern. it is necessary to make a dynamic ethical assessment of the potential applications of this study. understanding the network can be used to control purposes. however, we consider it is necessary that social media become the basis of pro-active response in terms of conceptual content and information. digital technologies must play a key role on building up resilience and tackle crisis. fake news detection on social media: a data mining perspective the science of fake news fake news and the economy of emotions: problems, causes, solutions. digital journalism social media and fake news in the election viral modernity? epidemics, infodemics, and the 'bioinformational'paradigm how to fight an infodemic. the lancet the covid- social media infodemic corona virus (covid- )"infodemic" and emerging issues through a data lens: the case of china infodemic": leveraging high-volume twitter data to understand public sentiment for the covid- outbreak infodemic and risk communication in the era of cov- information flow during crisis management: challenges to coordination in the emergency operations center the signal code: a human rights approach to information during crisis quantifying information flow during emergencies measuring political polarization: twitter shows the two sides of venezuela false news on social media: a data-driven survey hate speech detection: challenges and solutions an emotional analysis of false information in social media and news articles declare: debunking fake news and false claims using evidence-aware deep learning csi: a hybrid deep model for fake news detection a deep neural network for fake news detection dynamical strength of social ties in information spreading impact of human activity patterns on the dynamics of information diffusion efficiency of human activity on information spreading on twitter multiple leaders on a multilayer social media the ties that lead: a social network approach to leadership. the leadership quarterly detecting opinion leaders and trends in online social networks exploring the potential for collective leadership in a newly established hospital network who takes the lead? social network analysis as a pioneering tool to investigate shared leadership within sports teams discovering leaders from community actions analyzing world leaders interactions on social media we would like to thank the center of innovation and technology for development at technical university madrid for support and valuable input, specially to xose ramil, sara romero and mónica del moral. thanks also to pedro j. zufiria, juan garbajosa, alejandro jarabo and carlos garcía-mauriño for collaboration. key: cord- - r xx n authors: shanthakumar, swaroop gowdra; seetharam, anand; ramesh, arti title: understanding the socio-economic disruption in the united states during covid- 's early days date: - - journal: nan doi: nan sha: doc_id: cord_uid: r xx n in this paper, we collect and study twitter communications to understand the socio-economic impact of covid- in the united states during the early days of the pandemic. our analysis reveals that covid- gripped the nation during this time as is evidenced by the significant number of trending hashtags. with infections soaring rapidly, users took to twitter asking people to self isolate and quarantine themselves. users also demanded closure of schools, bars, and restaurants as well as lockdown of cities and states. the communications reveal the ensuing panic buying and the unavailability of some essential goods, in particular toilet paper. we also observe users express their frustration in their communications as the virus spread continued. we methodically collect a total of , tweets by identifying and tracking trending covid-related hashtags. we then group the hashtags into six main categories, namely ) general covid, ) quarantine, ) panic buying, ) school closures, ) lockdowns, and ) frustration and hope, and study the temporal evolution of tweets in these hashtags. we conduct a linguistic analysis of words common to all the hashtag groups and specific to each hashtag group. our preliminary study presents a succinct and aggregated picture of people's response to the pandemic and lays the groundwork for future fine-grained linguistic and behavioral analysis. covid- (also known as the novel coronavirus) is the world's first global pandemic and has affected humans in all countries of the world. while humanity has seen numerous epidemics including a number of deadly ones over the last two decades (e.g., sars, mers, ebola), the grief and disruption that covid- will inflict is incomparable (and perhaps unimaginable). at the time of writing this paper, covid- is still rapidly spreading around the world and projections for the next few months are grim and extremely disconcerting. permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. copyrights for components of this work owned by others than acm must be honored. abstracting with credit is permitted. to copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. request permissions from permissions@acm.org. conference' , july , washington, dc, usa © association for computing machinery. acm isbn -x-xxxx-xxxx-x/yy/mm. . . $ . https://doi.org / . /nnnnnnn.nnnnnnn with no cure in sight and with the chances of covid- reemerging for a second (or multiple) time(s) even after the world manages to contain this first outbreak, it is critical that we understand and analyze the socio-economic disruptions of the first outbreak, so that we are better prepared to handle it in the future. additionally, with ever-increasing mobility of humans and goods, it is only prudent to assume that such epidemics are likely to occur in the future. the learnings from covid- will also enable humankind to prevent such epidemics from transforming into global pandemics and minimize the socio-economic disruption. in this preliminary work, our goal is to analyze the socio-economic disruption caused by covid- in the united states of america, understand the chain of events that occurred during the spread of the infection, and draw meaningful conclusions so that similar mistakes can be avoided in the future. though twitter data has previously been shown to be biased [ ] , twitter has emerged as the primary media for people to express their opinion especially during this time and our study offers a perspective into the impact as self-disclosed by people in a form that is easily understandable and can be acted upon. we summarize our main contributions below. • we collect , tweets from twitter between march th to march th , a time period when the virus significantly spread in the us and quantitatively demonstrate the socioeconomic disruption and distress experienced by the people. calls for closures started off with schools (e.g., #closenycschools), then moved on to bars and restaurants (e.g., #barsshut), and finally to entire cities and states (e.g., #lockdownusa). while these calls were initially mainly confined to the seattle, bay area, and ny regions (e.g., #seattleshutdown, #shutdownnyc), they later expanded to include other parts of the country (e.g., #shutdownflorida, #vegasshutdown). alongside, panic buying and hoarding escalated with essential items particularly toilet paper becoming unavailable in stores (e.g., #panicbuying, #toiletpapercrisis). • we observe increased calls for social distancing, quarantining, and working from home to limit the spread of the disease (e.g., #socialdistancingnow, #workfromhome). to slow the exponential increase in the number of infections, people also rallied for flattening the curve and staying at home for extended periods (e.g., #flattenthecurve). the challenges of working from home also surface in communications (e.g., #stayhomechallenge). with the passage of time, we see an increased fluctuation in emotions with some people expressing their anger at individuals flouting social distancing calls (e.g., #covidiots), while others rallying people to fight the disease (e.g., #fightback) and to save workers (e.g., #saveworkers). • we group the hashtags into six main categories, namely ) general covid, ) quarantine, ) school closures, ) panic buying, ) lockdowns, and ) frustration and hope, to quantitatively and qualitatively understand the chain of events. we observe that general covid and quarantine related messages remain trending throughout the duration of our study. we observe calls for closing schools and universities peaking in the middle of march and then reducing when the closures go into effect. we observe a similar trend with panic buying. lockdowns also have a significant number of tweets with calls initially being focused on closure of bars, followed by cities and then states. tweets in the frustration and hope hashtag group have an overall increasing trend as the struggle with the virus mounts. • we then present a linguistic analysis of the tweets in the different hashtag groups and present the words that are representative of each group. we observe that words such as family, life, health and death are common across hashtag groups. we observe mentions to mental health, a possible consequence of social isolation. we also observe solidarity for essential workers and gratitude towards them (#saveworkers). our preliminary study unearths and summarizes the critical public responses surrounding covid- , paving the way for more insightful fine-grained linguistic and graph analysis in the future. in this section, we discuss our methodology for data collection from twitter to investigate the socio-economic distress and disruption in the united states caused by covid- during its early days. we collect data using the twitter search api. the results presented in this paper are based on the data collected from march to march , . we track the trending covid related hashtags every day and collect the tweets in those specific hashtags. we repeat this process to collect a total of , tweets during this time period. we group the hashtags into six main categories, namely ) general covid, ) quarantine, ) school closures, ) panic buying, ) lockdowns, and ) frustration and hope to quantitatively and qualitatively understand the chain of events. we collect data on per day basis for the different hashtags as and when they become trending. tables and show the number of tweets in each category and the grouping of the hashtags by category. we observe that the total number of tweets as grouped by hashtags is , , which is higher than the total number of tweets. this is because tweets can contain multiple hashtags and thus the same tweet can be grouped into multiple categories. we present some example tweets in table to illustrate the types of communications occurring on twitter during this period. . general covid: in this category, we group hashtags related to covid related messages as it is the most discussed topic in conversations. this grouping is done by accumulating hashtags related to covid- . . quarantine: calls for social distancing and quarantines flooded twitter during this outbreak. communications centered around quarantines, working from home and flattening the curve to slow the spread of the virus. in this category, we collect data related to school closures. before states decided to close schools, users on twitter demanded the government to shut down public schools and universities. we collect data from a number of hashtags centered around this call for action. the spread of the virus also resulted in panic buying and hoarding. people rushed to shopping marts and there was a huge panic buying of sanitizers and toilet paper. this panic buying resulted in severe shortage of toilet papers around the middle of march, an issue that remained unresolved till the first week of april. . lockdowns: with covid- spreading unabated, lock downs of public stores, bars, restaurants, and cities began in many states. this resulted in a surge in tweets related to lock downs. . frustration and hope: emotions ran high during these times with people expressing anger and resentment towards those not abiding by social distancing and quarantine rules. alongside, people also rallied to support workers working hard to keep essential services running. with the beginning of april approaching, many people started to worry about their next month's rent. due to the data collection limits imposed by twitter, we are able to only collect and analyze a portion of the tweets. though we started collecting data as quickly as we conceived of this project, we were unable to collect data during the first week of march. though we ran our script to collect data as far back as march , because of the way twitter provides data, we obtained limited number of tweets from march to march . additionally, due to the rapidly evolving situation, it is likely that we have inadvertently missed some important hashtags, despite our best efforts. as is the case with most studies based on twitter data, we also acknowledge the presence of bias in data collection [ ] . having said that, the goal of this study is to provide a panoramic summarized view of the impact of the pandemic on people's lives and aggregate public opinion as expressed by them. due to the nature of this study, we are confident that the results presented here help in appreciating the sequence of events that transpired and better prepare ourselves from a possible second outbreak of covid- or another pandemic. in this section, we present observations and results based on our preliminary analysis of the tweets. we study the popularity of when are we going to #cancelrent in this state? hundreds of thousands are filing for unemployment and can't pay rent. sure, we can't be evicted, but what's preventing companies from coming after us after this is over? individual hashtags and investigate how the number of tweets in particular hashtag groups evolve over time. we also explore the term frequency for each hashtag group to understand the main points of discussion. our analysis summarizes the critical public responses surrounding covid- and paves the way for more insightful and fine-grained linguistic analysis in the future. figure a shows the top hashtags observed in our data. as expected, we see that hashtags corresponding directly to covid or coronavirus are the most popular hashtags as most communications are centered around them. we observe that hashtags around social isolation, staying at home, and quarantining are also popular. figure b shows the most popular hashtags by date. similar to figure a , we observe that hashtags related directly to covid and social distancing trend most on twitter. the figures and the number of tweets highlight how the pandemic gripped the united states with its rate of spread. we investigate the evolution of the number of tweets in various hashtag groups over time. to calculate the number of tweets in each hashtag group, we count the number of mentions of hashtags in that group across all the tweets. if the tweet contains more than one hashtag, it is counted as part of all the hashtags mentioned in it. as the number of tweets for hashtag groups vary significantly, we plot interestingly, from figure b , we observe that panic buying and calls for school closures peak around the middle of march and then decrease as school closures and rationing of many essential goods such as toilet paper, cereal, and milk take effect. from figure c , we see that calls for lock downs related to schools, bars, and cities peak in the middle of march. with the virus spreading unabated, we observe intense calls for lock downs of cities and entire states around the beginning of the fourth week of march, resulting in an increased number of tweets in this category. with passage of time, we observe people increasingly expressing their frustration and distress in communications, while some hashtags attempt to inject a more positive outlook. in this section, we present results from a linguistic word usage analysis across the different hashtag groups. first, we identify and present the most commonly used words across all the hashtags. to construct the first group of common words across all hashtags, we remove the words that are same or similar to the hashtags mentioned in table as those words are redundant and tend to also be high in frequency. we also remove the names of places and governors such as new york, massachusetts and andrew cuomo. after filtering out these words, we then rank the words based on their occurrence in multiple groups and their combined frequency across all the groups. we observe words such as family, health, death, life, work, help, thank, need, time, love, crisis. in table , we present some notable example tweets containing the common words. while one may think that health refers to the virus-related health issues, we notice that many people also refer to mental health in their tweets as a possible consequence to social distancing and anxiety caused by the virus. we also observe the usage of words such as death and crisis to indicate the seriousness of the situation. supporting workers and showing gratitude toward them is another common tweet pattern that is worth mentioning. we plan to use these observations to guide our future exploration and fine-grained analysis. death. we must act very fast. first, we take care of the health care and emergency workers. then, we take care of whoever is in charge of keeping netflix and hulu running or itâĂŹs going to get ugly #distancesocializing #coronavirus second, we present the most semantically meaningful and uniquely identifying words in each hashtag group. to do this, we remove the common words calculated in the above step from each group. from the obtained list of words after the filtering, we then select the top words. due to lack of space, we only present results for four hashtag groups. figure gives us the uniquely identifying and semantically meaningful words in each hashtag group. in the general covid group, we find words such as impact, response, resource, doctor. similarly, for school closures, we find words such as teacher, schedule, educator, book, class. the panic buying top words mostly resonates the shortages experienced by people during this time such as roll and tissue (referring to toilet paper), hoard, bidet (as an alternative to toilet paper), wipe, and water. top words in the lockdowns group include immigration, shelter, safety, court, and petition, signifying the different issues surrounding lockdown. in this section, we outline existing research related to modeling and analyzing twitter and web data to understand social, political, psychological and economic impacts of a variety of different events. due to the recent nature of the outbreak, as far as we are aware there are no published research results related to covid- . due to the space limitations, we only discuss social-media analysis work that are closely related to our work. twitter has been used to study political events and related stance [ , ] , human trafficking [ ] , and public health [ - , , ] . several work perform fine-grained linguistic analysis on social media data [ , , ] . in this paper, we studied twitter communications in the united states during the early days of the covid- outbreak. as the disease continued to spread, we observed that calls for closures of schools, bars, cities and entire states as well as social distancing and quarantining quickly gained fervor. alongside, we observed an increase in panic buying and lack of availability of essential items, in particular toilet paper. we also conducted a linguistic word-usage analysis and distinguished between words that are common to multiple hashtag groups and words that uniquely and semantically identify each hashtag group. in addition to words that represent the group, our analysis unearths many words related to emotion in the tweets. also, our qualitative analysis reveals that the words are used in multiple different contexts, which is worth delving further into. the research presented in this paper is preliminary and its primary aim is to quantitatively outline the socio-economic distress already caused by covid- so that we as a society can learn from this experience and be better prepared if covid- (or maybe another pandemic) were to (re)emerge in the future. at the time of writing this paper, the infection spread has still not reached its peak in the united states. therefore, we plan to keep collecting data to understand and investigate the socio-economic and political impact of covid- . we plan to expand on our current research by performing topic modeling and expanding our linguistic analysis to unearth the main topics being discussed in the tweets. we also plan to conduct sentiment analysis to understand the extent of positive and negative sentiments in the tweets. detecting and characterizing mental health related self-disclosure in social media predicting depression via social media how social media will change public health all i know about politics is what i read in twitter: weakly supervised models for extracting politicians' stances from twitter bumps and bruises: mining presidential campaign announcements on twitter characterizing sleep issues using twitter discovering, assessing, and mitigating data bias in social media using twitter to understand the human bowel disease community: exploratory analysis of key topics weakly supervised cyberbullying detection using co-trained ensembles of embedding models the impact of environmental stressors on human trafficking a socio-linguistic model for cyberbullying detection fine-grained analysis of cyberbullying using weakly-supervised topic models key: cord- -ad avzd authors: gharavi, erfaneh; nazemi, neda; dadgostari, faraz title: early outbreak detection for proactive crisis management using twitter data: covid- a case study in the us date: - - journal: nan doi: nan sha: doc_id: cord_uid: ad avzd during a disease outbreak, timely non-medical interventions are critical in preventing the disease from growing into an epidemic and ultimately a pandemic. however, taking quick measures requires the capability to detect the early warning signs of the outbreak. this work collects twitter posts surrounding the covid- pandemic expressing the most common symptoms of covid- including cough and fever, geolocated to the united states. through examining the variation in twitter activities at the state level, we observed a temporal lag between the rises in the number of symptom reporting tweets and officially reported positive cases which varies between to days. "starting the new year off right with a cough and fever!" "starting the new year off right, sick as a dog with a high fever and a nasty cough. craptastic." "starting with a fever and flu like symptoms is not how i pictured this decade starting" "my ribs hurt when i cough so i don't want to cough but i have to cough i hate it here" these are only few examples of many twitter messages (known as tweets) that people have posted in early in the united states, complaining about intense flu-like symptoms such as dry cough and fever, later on, were recognized as the most common symptoms of covid- . sars-cov- , the virus that causes covid- , is thought to have first transmitted from an animal host to humans in wuhan, china in late . on march , , after the rapid increase of the cases outside china, the world health organization (who) eventually declared the covid- as a pandemic (who ). as of april , it is officially reported that more than three million people are infected by this virus in countries and territories around the world and international conveyances (worldometer ) . during a pandemic with a high infection rate, prompt mitigatory actions play a crucial role in decelerating the spread and preventing the new hotspots of the disease. though, taking immediate actions requires the capability to detect the early warning signs of the outbreak and to characterize the dynamic of the spread in a near real-time fashion. in the case of covid- pandemic, delay in developing the test kits, the limited number of kits, complicated bureaucratic health care systems, and lack of transparency in data collection procedures are the major origins of postponement of effective preventive interventions and mitigatory (washington post; achrekar et al. ). lai et. al. could have been conducted one week, two weeks, or three weeks earlier in china, cases would have been reduced by %, %, and %, respectively, together with significantly reducing the number of affected areas (lai et al. ) . to fill this gap, epidemic intelligence (ei) is being used to explore alternative mostly informal sources of data to gather information regarding disease activity, early warning, and infectious disease outbreak (de quincey and kostkova ) . human activities and interactions on the web are one of these informal sources. for instance, google flu trends exploits web search queries to estimate flu activity ("google flu trends" ). social media content is another powerful tool that provides invaluable crowd-sourced near realtime data for sensing health trends. twitter is a microblogging service with around million monthly active users that let users communicate through short messages (tweets) (salman aslam ). twitter permits third parties to explore tweets and collect data about posters and their locations. it provides the opportunity to harness tweets data to detect early signs of outbreaks which can ultimately support decision-makers in taking more informed actions (grover and aujla ) . in this paper, we explore twitter data right before and during covid- pandemic across the united states at the state level, for the most common symptoms of covid- including cough and fever. to offer a framework for outbreak early detection, the result of analysis on twitter data are compared to the formal dataset provided by john hopkins university which is openly available to the public for educational and academic research purposes . the rest of this paper is organized as follows: section reviews the related literature that harness the twitter data to analyze, detect and predict the outbreaks. in section , we present our methodology for extracting relevant information from twitter and preprocessing and analyzing the collected data. elaborated results for six states are presented in section . in section , we will discuss the results, key findings and potential application, limitations and further steps of this study. finally, we conclude in section . several studies have been conducted on the use of the twitter data to explore the outbreak trends aiming to develop models for disease outbreak prediction. achrekar et al. present a framework that monitors messages posted on twitter with a mention of flu indicators to track and predict the emergence and spread of an influenza epidemic in a population (achrekar et al. ) . similarly, chen et al. , propose an approach to aggregate users' states in a geographical region for a better estimation of the flu trends (chen et al. ) . smith et al. , offer a method to distinguish between personal flu infection tweets versus general awareness tweets (i.e. expressing concern regarding flu outbreak) (smith et al. ) . the twitter content during the h n outbreak is analyzed by (chew and eysenbach ) . besides flu, tweeter data has been also leveraged to analyze other epidemies like malaria, zika virus, dengue, ebola and so on. for instance, masri et al. , utilize in this study, we propose a conceptual framework for investigating the temporal trends in the twitter users' posts. this framework has three main modules including data collection, data preprocessing and data analysis. the framework schema is depicted in figure . in the following section, we will further elaborate each module. we use getoldtweets python package to retrieve historical tweets. by employing this package, the query can be restricted to get tweets containing determined keywords, during a particular time frame and within a specific region. we collect tweets containing keywords fever or cough, as the main symptoms of coronavirus, from the beginning of september to april- , . the query limits the retrieved tweets to be within -mile radius distance of kansas city that covers all the states in america. then, we use twitter api (application programming interfaces) to retrieve the corresponding tweets using the given ids to access the precise geographical information of the user. the statistics of the number of tweets per state is shown in figure . the data is available on our github data repository . to compare the results with the formal cases we use time series data of covid - cases, reported by john hopkins university. the data is reported from january to april , . during the preprocessing, all the variations of the location name in twitter data within a state are integrated into a unique token with the following format: "state_name, usa". it is the same naming format as in john hopkin's dataset. the number of confirmed cases in john hopkins dataset were reported separately for different counties within a state. the data has been aggregated over all counties in each state. for data preparation, a continuous time-series of the daily number of tweets and confirmed cases are calculated for each state. the data analysis steps are illustrated by figure on colorado state data as an example. step : to compare the time series of the tweets containing covid- symptoms and the number of confirmed cases, we plot the data from the beginning of december. we assign zero to all the dates before january , for the case data that was not available in the johns hopkins dataset. step : the date of the formal outbreak is defined as the date where the number of confirmed cases in a state exceeds (hartfield and alizon ) . we refer to this date as the beginning of the outbreak in a given state and show it with a vertical red line. as illustrated in figure , the formal outbreak date at colorado state was on march . step : for colorado state and all other states, tweet time series shows a linear growth trend from the beginning of december up to march th, followed by an exponential growth. to model the temporal trend, a regression-based estimator is fitted on the tweet data during this period which is represented by the black line. step : finally, we detect the date of the informal outbreak, defined as the beginning of the exponential growth phase in tweets containing symptoms key words, by estimating the initial nonlinearity on tweet time series. vertical green line represents the informal outbreak on figure . in this section, we present the results for six highly affected states. as explained in section . , these plots (figure ) exhibit the number of tweets over time compared to the number of confirmed cases. the specified formal outbreak and the estimated informal outbreak are also shown for these states. this figure shows that there is a time lag between the estimated informal outbreak and formal outbreak which varies across the states. table summarizes the time lags observed for the six states shown in figure . the longest and shortest time lags were detected as and days for maryland and new york respectively. for most of the states the lag length is estimated around two weeks. based on the epidemic models, usually in the early stages of an epidemic we expect an exponential growth trend in the number of the cases of disease (martcheva ) . however, it is difficult to monitor the growth trends of the infection in real-time and detect the outbreak without significant delay. this delay is often caused by the time-consuming and bureaucratic procedures of diagnosis including test development, test processing time, and reporting time (rong et al. ) . in this study we examined the possibility to fill this gap by detecting the early signs of an outbreak using twitter content. we collected tweets containing the common symptoms of covid- , right before and after the formal outbreak. a challenging issue in analyzing this data is to differentiate between general public concerns regarding the outbreak, and personal infection by covid- to detect any anomaly in tweets' trend. to address this issue, smith et al. , use nlp to classify flu-related tweets into two categories of personal infection tweets that express an awareness of influenza. they show the temporal trends of these two categories are very different (smith et al. ) . in this study, we assume that the general awareness tweets, prior to an outbreak in a given state, increases linearly, following the increase in the volume of the epidemic-related news of other countries or other states. on the other hand, we expect to observe exponential growth in the volume of the personal infection tweets, when an outbreak is happening in the given state, even prior to the official detection of the outbreak by formal medical procedures. looking into the temporal trends in the twitter data in states of the us, prior to the official detection of the epidemic outbreak in any of the given states, as expected, we observe linear growth in the number of tweets following the news media reporting on the outbreak in china and later in western europe. however, for each state there is a tipping point that happens before any official reports of the outbreak in those states where the growth trends change from linear to exponential, implying that the number of personal infection tweets not only dominate the general awareness tweets, but also define the growth behavior of the aggregate number of the tweets (implying that an outbreak is happening on top of a general awareness growth.) in the case of covid- pandemic, lai et. al. , show if non-pharmaceutical interventions were conducted one week, two weeks, or three weeks earlier in china, cases would have been reduced by %, %, and %, respectively, together with significantly reducing the number of affected areas (lai et al. ) . the observations of the current study, as a proof of concept, suggest that the behavioral patterns of an epidemic outbreak emerges in the temporal trends of the informal data streams like twitter data, as an early sign of an outbreak in local level. in sum, this approach has potential to be used further as a decision support system to inform the policy makers deploying the intervention policies in a timely manner. for future work, we suggest to validate the results of this study using a classifier to better differentiate the relevant from irrelevant tweets to exclude tweets containing 'baby fever' or 'cough cough'. moreover, a model can be trained to monitor the fluctuation of symptom keyword usage and predict the pandemic in advance. in this paper, we investigated the possibility of using twitter content to detect and track covid- outbreak in each state across the united states. we used a simple analysis of temporal trends of the relevant tweets. our results have shown that the trend of tweets containing the common covid- symptoms such as cough and fever, are highly correlated with the official cdc dataset. although, a significant temporal lag, between to days, was observed between the exponential growth phase of tweets and the confirmed cases which could be related to inherent delays in the testing and diagnosis procedures. therefore, we conclude that twitter data provides a near real time assessment of an outbreak which can be utilized as an early warning system to increase the public awareness. it also can be used as a decision support system to inform policy maker in taking timelier mitigatory and preventive actions. predicting flu trends using twitter data syndromic surveillance of flu on twitter using weakly supervised temporal topic models pandemics in the age of twitter: content analysis of tweets during the h n outbreak google flu trends prediction model for influenza epidemic based on twitter data introducing the outbreak threshold in epidemiology effect of non-pharmaceutical interventions for containing the covid- medrxiv, . . use of twitter data to improve zika virus surveillance in the united states during the epidemic malaria epidemic prediction model by using twitter data and precipitation volume in nigeria early warning and outbreak detection using social networking websites: the potential of twitter effect of delay in diagnosis on transmission of covid- twitter by the numbers: stats, demographics & fun facts towards real-time measurement of public epidemic awareness: monitoring influenza awareness through twitter a timeline of coronavirus testing in the u.s. -the washington post rolling updates on coronavirus disease (covid- ) coronavirus update (live): , , cases and , deaths from covid- virus pandemic key: cord- -zl txjqx authors: liu, junhua; singhal, trisha; blessing, lucienne t.m.; wood, kristin l.; lim, kwan hui title: epic m: an epidemics corpus of over million relevant tweets date: - - journal: nan doi: nan sha: doc_id: cord_uid: zl txjqx since the start of covid- , several relevant corpora from various sources are presented in the literature that contain millions of data points. while these corpora are valuable in supporting many analyses on this specific pandemic, researchers require additional benchmark corpora that contain other epidemics to facilitate cross-epidemic pattern recognition and trend analysis tasks. during our other efforts on covid- related work, we discover very little disease related corpora in the literature that are sizable and rich enough to support such cross-epidemic analysis tasks. in this paper, we present epic m, a large-scale epidemic corpus that contains millions micro-blog posts, i.e., tweets crawled from twitter, from year to . epic m contains a subset of . millions tweets related to three general diseases, namely ebola, cholera and swine flu, and another subset of . millions tweets of six global epidemic outbreaks, including h n swine flu, haiti cholera, middle-east respiratory syndrome (mers), west african ebola, yemen cholera and kivu ebola. furthermore, we explore and discuss the properties of the corpus with statistics of key terms and hashtags and trends analysis for each subset. finally, we demonstrate the value and impact that epic m could create through a discussion of multiple use cases of cross-epidemic research topics that attract growing interest in recent years. these use cases span multiple research areas, such as epidemiological modeling, pattern recognition, natural language understanding and economical modeling. the coronavirus disease has spread around the globe since the beginning of the year , affecting around countries and everyone's life. to date, the highly contagious disease has caused over . million confirmed and suspected cases and thousand deaths. in time of crisis caused by epidemics, we realize the necessity of rigorous arrangements, quick responses, credible and updated information during the premature phases of such epidemics [ ] . social media platforms, such as twitter, play an important role in informing the latest epidemic status, via the announcements of public policies in a timely manner. facilitating the posting of over half a billion tweets daily [ ] , twitter emerges as a hub for information exchange among individuals, companies, and governments, especially in time of epidemics where economies are placed in a hibernation mode, and citizens are kept isolated at home. such platforms help tremendously to raise situational awareness and provide actionable information [ ] . recently, numerous covid- related corpora from various sources are presented that contain millions of data points [ , ] . while these corpora are valuable in supporting many analyses on this specific pandemic, researchers require additional benchmark corpora that contain other epidemics to facilitate cross-epidemic pattern recognition and trend analysis tasks. during our other efforts on covid- we conduct several exploratory analyses to study the properties of the corpus, such as word cloud visualization and time series trend analysis. several interesting findings are discovered through these analyses. for instance, we find that a large quantity of topics are related to specific locations; cross-epidemic topics, i.e. one that involves more than one epidemic-related hashtag, appear frequently in several classes; and several hashtags related to non-epidemic events, such as warfare, have relatively high ranks in the list. furthermore, a time-series analysis also suggests that some of the epidemics, i.e. haiti cholera and kivu ebola, show a surge in tweets before the respective start dates of the outbreaks, which signifies the importance of leveraging social media to conduct early signal detection. we also observe that an epidemic outbreak not only leads to rapid discussion of its own, but also triggers exchanges about other diseases. epic m fills the gap in the literature where very little epidemicrelated corpora are either unavailable or not sizable enough to support cross-epidemic analysis tasks. through discussing various potential use cases, we anticipate that epic m brings great value and impact to various fast growing computer science communities, especially in natural language processing, data science and computation social science. we also foresee that epic m is able to contribute partially to cross-disciplinary research topics, such as economic modeling and humanity studies. while epic m includes tweets posted throughout the cause of each outbreak available in the corpora, we expect that epic m may serve as a timeless cross-epidemic benchmark. as of jun in this section, we discuss the existing twitter corpora for several domains, such as covid- , disasters, and others. these corpora attract a large quantity of interests and enable a large amount of research works in their respective domains, which we believe epic m generalizes to a similar level of impact in the epidemic domain. corpora of covid- . recently, the covid- pandemic spread across the globe and generated enormous economical and social impact. throughout the pandemic, numerous related corpora have been released. for instance, chen et al. [ ] released a multi-lingual corpus that consists of million tweets that include tweet ids and their timestamps, across over languages. similarly, banda et al. [ ] presented a large-scale covid- chatter corpus that consists of over m tweets with retweets and another version of million tweets without retweets. english corpora of disasters. there are several disaster-related corpora presented in the literature that are utilized for multiple works. crisislex [ ] consists of thousand tweets that are related to six natural disaster events, queried based on relevant keywords and locations during the crisis periods. the tweets are labelled as relavant or not-relevant through crowdsourcing. olteanu et al. [ ] conducts a comprehensive study of tweets to analyze crisis events from to . the paper analyzes about k tweets based on crisis and content dimensions, which include hazard type (natural or human-induced), temporal development (instantaneous or progressive), and geographic speed (focalized or diffused). the content dimensions are represented by several features such as informativeness, types and sources. imran et al. [ ] releases a collection of over million tweets, out of which thousand come with human-annotated tweets that are related to natural crisis events. the work also presents pre-trained word vec embeddings with a set of out-of-vocabulary (oov) words and their normalizations, contributing in spreading situational awareness and increasing response time for humanitarian efforts during crisis. phillips [ ] releases a set of million tweets related to hurricane harvey. littman [ ] publishes a corpus containing tweet ids of over million tweets related to hurricane irma and harvey. non-english corpora of disasters. numerous non-english crisis corpora are also found in the literature. for instance, cresci et al. [ ] released a corpus of . thousand italian tweets from to during four different disasters. the features include informativeness (damage or, no damage) and relevance (relevant or not relevant). similarly, alharbi and lee [ ] compiled a set of thousand arabic tweets, manually labelled on the relatedness and information-type for four high risk flood events in . alam et al. [ ] released a twitter corpora composed of manually-annotated thousand tweets and thousand images collected during seven natural disasters (earthquakes, hurricanes, wildfires, and floods) that occurred in . the features of the datasets include informativeness, humanitarian categories, and damage severity categories. other twitter corpora. apart from crisis-related corpora, several twitter datasets are used for analysis related to politics, news, abusive behaviour and misinformation, trolls, movie ratings, weather forecasting, etc. for instance, fraisier et al. [ ] proposes a large and complex dataset with over thousand operative twitter profiles during the french presidential campaign with their corresponding tweets, tweet ids, retweets, and mentions. the data was annotated manually based on their political party affiliation, their nature, and gender. we also find twitter corpora that are related to other domains, such as politics [ , ] , cyberbullying [ ] , and misinformation [ , , ] . this section describes the data collection process for crawling epic m. epidemic outbreaks. epic m includes six epidemic outbreaks in the st century, recorded by world health organization table . we intentionally exclude the recent covid- pandemic outbreak to avoid producing redundant work, as there are already numerous covid- datasets released by different parties with multi-million data points. search queries. for each outbreak, we initialize with a large collection of keywords used as the search queries, with the hypothesis to retrieve most relevant tweets from twitter. we use a combination of keywords for each outbreak, as listed on table , to fetch the related tweets. two types of keywords used, namely (a) general disease-related terms, such as ebola, cholera and swine flu; and (b) specific outbreak-related terms with a combination of location and disease, such as africa ebola and yemen cholera. general epidemics. besides the outbreaks set, we extend epic m by including a subset of three general diseases, namely cholera, ebola and swine flu. the tweets related to these diseases are crawled since the respective first occurrence until t h may . we expect the general epidemic subset is able to act as additional benchmarks and contribute substantially to various research topics, such as pattern recognition and trend analysis. to gain a general overview of epic m, we first conduct hashtags analysis for each epidemic and plot them on a by grid, as shown in figure . the first row (fig. a) represents three general diseases whereas the second and third rows (fig. b) represent the six outbreak classes in chronological order. each word cloud contains the top hashtags in their respective class, where the sizes represent their frequencies. through observation, we identify several interesting phenomena, such as: ( ) key terms provide semantic indication of the crises, in addition to possible cross-epidemic indicators: such as pandemic, epidemic, healthcare, vaccine, disease, sanitation, and others; ( ) location-related hashtags, such as #yemen, #haiti and #sierraleone, appear in all classes and occupy majority of the key words, which we believe to be the highest concerned feature; ( ) several classes include hashtags of other diseases, i.e., #covid in the _yemen_cholera class and #malaria in the cholera class, which implies that discussions on cross-epidemic matters are popular; and ( ) some hashtags refer to non-epidemic related events, such as # yearsofwaronyemen and #earthquake appearing in the _yemen_cholera and _haiti_cholera sets respectively. subsequently, we conduct trend analysis with an attempt to identify time-variant patterns from the corpus. for the three general classes (fig. a) , we plot each class into a line chart, where the x-axis represents the time in yearly dates and the y-axis represents the corresponding number of tweets. for the six outbreak classes (fig. b) , the x-axis of each line chart uses the number of days offset from the start date of the outbreak, whereas the y-axis represents the number of tweets normalized to between and . through the time-series line plots, we observe that some of the epidemics, i.e. haiti cholera and kivu ebola, show a surge in tweets before the respective official start dates of the outbreaks, which signifies the importance of leveraging social media to conduct early signal detection. we also observe that an epidemic outbreak not only leads to rapid discussion of its own, but also trigger exchanges of other diseases. finally, the time-series analyses also show clear dynamic properties or trends with exponential increases (shocks or spikes) in tweet type and a temporal persistence after an initial shock [ ] . other dynamic properties that may be of interest include local cycles and trends. such dynamic effects, when paired with semantic content (such as healthcare related terms), may provide potential indicators of an onset of a crisis. according to the world health organization, https://www.who.int while twitter has an enormous volume and frequency of information exchange, i.e. over half a billion tweets posted daily, such rich data potentially exposes information on epidemic events through substantial analysis. in this section, we demonstrate the value and impact that epic m could create by discussing on multiple use cases of cross-epidemic research topics that attract growing interests in recent years. these use cases span multiple research areas, such as epidemiological modeling, pattern recognition, natural language processing and economical modeling. we claim that epic m fills the gap in the literature where very little disease related corpora are sizable and rich enough to support such cross-epidemic analysis tasks. epic m supplies benchmarks of multiple epidemics to facilitate a wide range of cross-epidemic research topics. epidemiological modeling. epidemiological modeling provides various potential applications to understand the twitter dynamics during and post-outbreaks, such as compartmental modeling [ ] and misinformation detection [ ] . to name a few, jin et al. [ ] uses twitter data to detect false rumors and a susceptible-exposedinfected-skeptic (seiz) model to group users in four compartments. skaza and blais [ ] use susceptible-infectious-recovered (sir) epidemic models on twitter' hashtags to compute infectiousness of a trending topic. in the recent event of covid- , these models are repeatedly applied to predict discrete questions, such as chen et al. [ ] 's proposal of using a time-dependent sir model to estimate the total number of infected persons and the outcomes, i.e., recovery or death. trend analysis and pattern recognition. extensive prior works leverage social media data to perform trend analysis and pattern recognition tasks. for instance, kostkova et al. [ ] study the swine-flu outbreak and demonstrates the potential of twitter to act as an early warning system up-to a period of two or three weeks. similarly, joshi et al. [ ] predict alerts of western africa ebola epidemic, three months earlier than the official announcement. while early detection and warning systems for crisis events may reduce overall damage and negative impacts [ ] , epic m provides high volume and timely information that facilitate trend analysis and pattern recognition tasks for epidemic events. sentiment and opinion mining. the observation of social sentiments and public opinions plays an important part in benchmarking the effect of releasing public policy amendments or new initiatives. several prior works leverage sentimental analysis and opinion mining to extract the contextual meaning of social media content. for instance, beigi et al. [ ] provides an overview of the relationship among social media, disaster relief and situational awareness in crisis time, and neppalli et al. [ ] performs locationbased sentimental analysis on tweets for hurricane sandy in . topic detection. topic detection or modeling may enable authorities in anticipating a crisis and taking actions during the same. the technique helps in recognizing hidden patterns, understanding semantic and syntactic relations, annotating, analyzing, organizing, and summarizing the huge collections of textual information. considering the same, several researchers have implemented these approaches on crises datasets to detect and categorize the potential topics. chen et al. [ ] suggest two topic modeling prototypes to ameliorate trends estimation by seizing the underlying states of a user from a sequence of tweets and aggregating them in a geographical area. in [ ] researchers perform optimized topic modeling using community detection methods on three crises datasets [ , , ] to identify the discussion topics. natural language processing. several works leverage twitter datasets to conduct natural language processing (nlp) tasks. as a challenging downstream task of nlp, automatic text summarization techniques extract latent information from text documents where the models generates a brief, precise, and coherent summary from lengthy documents. text summarization is applicable in various real-would activities during crisis, such as generating news headlines, delivering compact instructions for rescue operations and identifying affected locations. prior works demonstrate such applications during crisis time. for instance, rudra et al. [ ] and [ ] propose two relevant methods that classify and summarize tweets fragments to derive situational information. more recently, sharma et al. [ ] proposes a system that produces highly accurate summaries from the twitter content during man-made disasters. several other works focus on nlp subtasks of social media, such as information retrieval [ , ] and text classification [ , ] . disease classification. applications of machine learning and deep learning in the healthcare sector gather growing interests in recent years. for instance, krieck et al. [ ] analyzes the relevance of twitter content for disease surveillance and activities tracking, which help alert health official regarding public health threats. lee et al. [ ] conducts text mining on twitter data and deploys a real-time disease tracking system for flu and cancer using spatial, temporal information. ashok et al. [ ] develops a disease surveillance system to cluster and visualise disease-related tweets. crisis-time economic modeling. estimating economical impact of crises, such as epidemic outbreaks, is a crucial task for policy makers and business leaders to adjust operational strategies [ ] and make right decisions for their organizations in the time of crises. several research studies in such domain. for instance, okuyama [ ] provides an overview and a critical analysis of the methodologies used for estimating the economic impact of disaster; avelino and hewings [ ] proposes the generalized dynamic input-output framework (gdio) to dynamically model higher-order economic impacts of disruptive events. such studies correlate disaster events and economy impact, which rely on disaster-related data and financial market data, respectively. we believe that epic m is able to contribute to future economic modeling studies for epidemic events. health informatics. compared to the cases above, a more general use case area is healthcare informatics , i.e., âĂĲthe integration of healthcare sciences, computer science, information science, and cognitive science to assist in the management of healthcare informa-tionâĂİ [ , , ] . while social media and online sources are used to connect with patients and provide reliable educational content in health informatics, there is growing interest in using twitter and other feeds to study and understand indicators for health trends or particular behaviors or diseases. for example, nambisan et al. [ ] utilize twitter content to study the behavior of depression. epic m contains behavioral information across various diseases and how the populace behaves with the onset and persistence of the diseases. multiple disease cases will provide such research to correlate behavioral information across instances. news and fake news. with the proliferation of news content through internet and virtual media, there is a growing interest in developing an understanding of the science of news and fake news [ ] . data mining algorithms are advancing to study news content [ ] . epic m contains real news content that grows over time from social lay-person terminology to technical and professionally based information and opinion. it likewise includes fact-based information as well as distorted or fake content. through multiple cases over time, the field will have a rich source to study news content, especially when correlating with reliable news sources for particular snapshots of time. all in all, we believe that epic m provides a set of rich benchmarks and is able to facilitate extensions of the above-mentioned works on a higher order, e.g., in cross-epidemic settings. as a result, the research findings are more robust and closer to real-world scenarios. conclusion. during our other efforts on covid- related work, we discovered very little disease related corpora in the literature that are sizable and rich enough to support such cross-epidemic analysis tasks. in this paper, we present epic m, a large-scale epidemic corpus that contains . millions tweets from to . the corpus includes a subset of tweets related to three ( ) general diseases and another subset related to six ( ) epidemic outbreaks. we conduct exploratory analysis to study the properties of the corpus and identify several phenomena, such as strong correlation between epidemics and locations, frequent cross-epidemic topics, and surge of discussion before occurrence of the outbreaks. finally, we discuss a wide range of use cases that epic m can potentially facilitate. we anticipate that epic m brings substantial value and impact to both fast growing computer science communities, such as natural language processing, data science and computation social science, and multi-disciplinary areas, such as economic modeling, health informatics and the science of news and fake news. future work. for some epidemic outbreaks, such as h n swine flu and west africa ebola, epic m includes relevant tweets posted throughout the respective duration of the epidemics. we expect the data of these few classes could serve as strong and timeless cross-epidemic and cross-disease benchmarks. on the other hand, several epidemics, such as kivu ebola and yemen cholera, are still ongoing. we intend to extend the corpus by actively or periodically crawling tweets in addition to the current version. furthermore, we plan to further develop the corpus with additional epidemic outbreak classes that happened more recently, such as the multi-national measles outbreaks in the dr congo, new zealand, philippines and malaysia, the dengue fever epidemic in asia-pacific and latin america, and the kerala nipah virus outbreak. lastly, we also intend to develop an active crawling web service that automatically update epic m, and migrate to cloudbased relational database services to ensure its availability and accessibility. the corpus is available at https://www.github.com/junhua/epic. this research is funded in part by the singapore university of technology and design under grant srg-istd- - . crisismmd: multimodal twitter datasets from natural disasters crisis detection from arabic tweets compartmental modeling and tracer kinetics a machine learning approach for disease surveillance and visualization using twitter data the challenge of estimating the impact of disasters: many approaches, many limitations and a compromise system and method for integrated learning and understanding of healthcare informatics yuning ding, and gerardo chowell. . a large-scale covid- twitter chatter dataset for open scientific research-an international collaboration an overview of sentiment analysis in social media and its applications in disaster relief analyzing discourse communities with distributional semantic models computer-aided mind map generation via crowdsourcing and machine learning covid- : the first public coronavirus twitter dataset syndromic surveillance of flu on twitter using weakly supervised temporal topic models. data mining and knowledge discovery a linguistically-driven approach to cross-event damage assessment of natural disasters from social media messages large scale crowdsourcing and characterization of twitter abusive behavior # Élysée fr: the french presidential campaign on twitter time series analysis the hoaxy misinformation and fact-checking diffusion network aidr: artificial intelligence for disaster response extracting information nuggets from disaster-related messages in social media twitter as a lifeline: human-annotated twitter corpora for nlp of crisis-related messages epidemiological modeling of news and rumors on twitter automated monitoring of tweets for early detection of the ebola epidemic # swineflu: the use of twitter as an early warning and risk communication tool in the swine flu pandemic a new age of public health: identifying disease outbreaks by analyzing tweets the science of fake news real-time disease surveillance using twitter data: demonstration on flu and cancer clustop: a clustering-based topic modelling algorithm for twitter using word networks hurricanes harvey and irma tweet ids self-evolving adaptive learning for personalized education ipod: an industrial and professional occupations dataset and its applications to occupational data mining and analysis crisisbert: a robust transformer for crisis classification and contextual crisis embedding strategic and crowd-aware itinerary recommendation understanding the perception of covid- policies by mining a multilanguage twitter dataset essentials of nursing informatics social media, big data, and public health informatics: ruminating behavior of depression revealed through twitter sentiment analysis during hurricane sandy in emergency response critical review of methodologies on disaster impact estimation crisislex: a lexicon for collecting and filtering microblogged communications in crises what to expect when the unexpected happens: social media communications across crises managing epidemics: key facts about major deadly diseases. world health organization automatic classification of disaster-related tweets why weâĂŹre sharing million russian troll tweets. fivethirtyeight summarizing situational tweets in crisis scenario extracting situational information from microblogs during disaster events: a classification-summarization approach going beyond content richness: verified information aware summarization of crisis-related microblogs fake news detection on social media: a data mining perspective mobile healthcare informatics. medical informatics and the internet in medicine modeling the infectiousness of twitter hashtags u.s. congressional election tweet ids mining misinformation in social media analysing how people orient to and spread rumours in social media by looking at conversational threads key: cord- - a sriq authors: saleh, sameh n.; lehmann, christoph u.; mcdonald, samuel a.; basit, mujeeb a.; medford, richard j. title: understanding public perception of coronavirus disease (covid- ) social distancing on twitter date: - - journal: infection control and hospital epidemiology doi: . /ice. . sha: doc_id: cord_uid: a sriq objective: social distancing policies are key in curtailing severe acute respiratory coronavirus virus (sars-cov- ) spread, but their effectiveness is heavily contingent on public understanding and collective adherence. we studied public perception of social distancing through organic, large-scale discussion on twitter. design: retrospective cross-sectional study. methods: between march and april , , we retrieved english-only tweets matching two trending social distancing hashtags, #socialdistancing and #stayathome. we analyzed the tweets using natural language processing and machine-learning models, and we conducted a sentiment analysis to identify emotions and polarity. we evaluated the subjectivity of tweets and estimated the frequency of discussion of social distancing rules. we then identified clusters of discussion using topic modeling and associated sentiments. results: we studied a sample of , tweets. for both hashtags, polarity was positive (mean, . ; sd, . ); only % of tweets had negative polarity. tweets were more likely to be objective (median, . ; iqr, – . ) with ~ % of tweets labeled as completely objective (labeled as in range from to ). approximately half of tweets ( . %) primarily expressed joy and one-fifth expressed fear and surprise. each correlated well with topic clusters identified by frequency including leisure and community support (ie, joy), concerns about food insecurity and quarantine effects (ie, fear), and unpredictability of coronavirus disease (covid- ) and its implications (ie, surprise). conclusions: considering the positive sentiment, preponderance of objective tweets, and topics supporting coping mechanisms, we concluded that twitter users generally supported social distancing in the early stages of their implementation. on march , , the world health organization (who) declared the novel coronavirus (covid- ) outbreak a pandemic and emphasized the need for global governmental commitment to control the threat, citing then , confirmed cases and , deaths worldwide. to contain severe acute respiratory coronavirus virus (sars-cov- ), countries closed their international borders. despite travel restrictions, global cases continued to increase requiring enactment of key community mitigation, which garnered significant public attention. , these mitigation strategies, named nonpharmaceutical interventions (npis), are approaches outside medications, therapies, and vaccines to prevent further spread of sars-cov- and to reduce the strain on the healthcare system. npis fall under main categories: personal, environmental, and community. personal npis refer to behaviors like staying home when sick, coughing or sneezing in a tissue or elbow, wearing a mask, and washing hands with soap and water or using hand sanitizer. environmental npis refer to appropriate surface cleaning of high-throughput areas and commonly used objects. community npis refer to social distancing and closure of areas where large gatherings may occur, such as schools, businesses, parks, and sporting events. used previously for other viral outbreaks such as influenza, social distancing or physical distancing refers to increasing the space between individuals and avoidance of larger gatherings in an attempt to reduce viral transmission. this community npi has been a main components of effectively fighting the covid- pandemic. [ ] [ ] [ ] managing and changing public opinion and behavior are vital for social distancing to successfully slow transmission of covid- , preserve hospital resources, and prevent exceeding the healthcare system's capacity. to affect public opinion, one must first examine and understand it. social media, specifically its microblogging platform twitter, serves as an ideal medium to provide this understanding. twitter has > million daily active users and allows individuals to post, repost, like, and comment on 'tweets' of up to characters. analysis of twitter has been used previously within the healthcare realm to understand public sentiment and opinion on topics ranging from diabetes, cancer therapy, and novel healthcare policies such as the affordable care act. within the field of emerging infectious diseases, twitter analysis has been used to study public opinion and sentiment on measles, influenza, and zika virus outbreaks. we hypothesized that performing sentiment, emotion, and content analysis of tweets related to social distancing on twitter during the covid- pandemic could provide valuable insight into the public's beliefs and opinions on this policy. we further hypothesized that the knowledge gained could prove valuable for public health communication as well as dissemination and refinement of information strategies. from march to april , , we extracted daily relevant samples of english-only tweets related to social distancing and created a -week cross-sectional data set of social media activity. we used the rtweet package to access twitter's application programming interface (api) via rstudio version . . (r foundation for statistical computing, vienna, austria). the hashtags #socialdistancing and #stayathome, which were the top trending social distancing hashtags at the time of data extraction, were used to identify tweets related to social distancing. we used of the collected tweet metadata variables in our analysis (table s online). we cleaned the tweets by removing characters and words of no or little analytical value and transforming text to its root form. we used python version . . software (python software foundation, wilmington, de) for all data processing and analyses. further details are discussed in appendix a (online). institutional review board approval was not required because this study used only publicly available data. we used python's textblob library to perform sentiment analysis for all tweets through natural language processing and text analysis to identify and classify emotions (positive, negative, or neutral) and content topics. textblob applies the afinn sentiment lexicon from a polarity scale of − (most negative) to (most positive). we visualized the polarity distribution using bins for strongly negative (− to − . ), negative (− . to − . ), neutral ( ), positive ( . to . ), and strongly positive ( . to ). we used a recurrent neural network model developed by colneric and demsar to label the primary emotion for each tweet based on ekman's emotional classification (anger, disgust, fear, joy, sadness, or surprise). using χ testing and bonferroni correction to adjust for multiple comparisons, we compared the proportion of each sentiment polarity and emotion for each hashtag. we evaluated changes in effect size between hashtags using the absolute difference in percentage points. we used python's textblob library to perform subjectivity analysis and labeled each tweet from a range of (objective) to (subjective). objective tweets relay factual information, whereas subjective tweets typically communicate an opinion or belief. for the hashtags #stayathome and #socialdistancing, we visualized sentiment using a histogram of values and compared the median sentiment between hashtags using the mann-whitney u test. through terminology matching, we used key words present in social distancing rules (eg, "stay at least feet [ meters] from other people" or "avoid large gatherings") to identify tweets with potentially objective information about these rules (table s online). we manually reviewed % of the resulting tweet subset to verify what percentage of these tweets truly included information about social distancing rules and extrapolated prevalence for the full subset of tweets. to understand the major topics being discussed in our tweet sample, we applied an unsupervised machine-learning algorithm called latent dirichlet allocation (lda) using the gensim python library. lda is a commonly used topic-modeling approach to identify clusters of documents (in our case, tweets) by a representative set of words. the most highly weighted words in each cluster provide insight into the content of each topic. lda requires users to input the number of expected topics. to determine the optimal number of topics, we trained multiple lda models using different numbers of topics ranging from to and computed a topic coherence score (produced by comparing semantic similarity of a topic's most highly weighted words) for each lda model. selecting the lda model with the highest score, we ultimately chose topics for the final model. an author without access or insight into the topic model initially labeled the topics using the most frequently used terms ordered by weight. all authors then reached consensus on the topic labels. we identified the prevalence of topics by labeling tweets according to their most dominant topic. we identified example tweets whose content pertained > % to specific topic ( table , last column). we extracted , , tweets during the -week period. after removal of repeat and non-english tweets, , tweets across , users (range, - tweets per user; mean, . tweets per user) were included in the analysis (table ) . of those tweets, . % were unique. the hashtag #socialdistancing was included in , tweets and #stayathome was included in , tweets; , tweets contained both hashtags. twitter for iphone was the most commonly used platform ( %), followed by twitter for android ( . %). moreover, < % of tweets had media (image or video) and more than one-third had a hyperlink. the median user had > , posts and > followers at the time of tweeting. also, % of accounts were verified, signified by a blue badge next to a user's profile name indicating that an account of public interest is authentic. our tweet data set contained , , words and , , characters. the most frequently used words associated with each hashtag before processing are illustrated in fig. . after processing, for both #socialdistancing and #stayathome, the most common word was 'day' ( , and , times, respectively). the next most frequent words for #socialdistancing were 'practice' ( , there was net positive sentiment polarity toward both #socialdistancing and #stayathome, with mean polarity scores of . (standard deviation [sd], . ) and . (sd, . ) respectively. positive and neutral tweets accounting for . % and . % of tweets, respectively (fig. ) . moreover, < % of tweets were negative and < % were strongly negative. although statistical differences between polarity categories were detected due to the large sample sizes, the differences in effect sizes were minimal (fig. ) . neutral and positive tweets had the largest absolute differences. compared to #stayathome, #socialdistancing had . % fewer neutral tweets and . % more positive tweets. tweets tended to be more objective in nature and~ % demonstrated near or complete objectivity (fig. ) . the median subjectivity scores were similar for #socialdistancing ( . ; interquartile range [iqr], - . ) and #stayathome ( . ; iqr, - . ; p = . ). we matched , tweets that included key words related to social distancing rules and manually reviewed of them. of the tweets, were confirmed to be related to social distancing rules, yielding a rate of . %. extrapolating this to all social distancing tweets, we estimate that , ( . % of all) tweets referenced social distancing rules. joy was the predominant emotion expressed in > % of tweets with topics ranging from enjoying recreational activities, connecting with family members, and working from home. examples: if you are lucky enough to have even a small garden, now is the time to spend sprucing it up. our spring gardening feature has helpful advice and new ideas to try, to help you make the most of it and #stayathome and surprise was the next most prevalent emotion, and tweets included themes of prolonged policy interventions and discovery of novel talents. examples: to save lives, #socialdistancing must continue longer than we expect. and i played golf with my wife today. odd, i didn't even know she could play. #socialdistancing, #familytime" the least common emotions found in tweets were sadness, disgust, and anger (fig. ) . we detected statistical differences in all emotions between #stayathome and #socialdistancing tweets. the largest differences in effect size were joy (#stayathome with . % more) and fear (#socialdistancing with . % more). we identified and subjectively labeled the main tweet topics. table displays the mean topic sentiment polarity and subjectivity score, key words, and example tweets. "public opinion and values", "media and entertainment", and "quarantine measures and effects" emerged as the three most prevalent topics in , , , , and , tweets, respectively. discussion of "spring and good sentiments" had the highest mean polarity of . . "public opinion and values" and "quarantine measures and effects" had the lowest mean polarity of . . mean subjectivity scores for all topics ranged from . to . , with "public opinion and values" having the highest subjectivity score. understanding the beliefs, attitudes, and thoughts of individuals and populations can aid public health organizations (eg, the who) and government institutions to identify public perception and gaps in communication and knowledge. we analyzed twitter activity around the most common social distancing trending hashtags at the study time to understand emotions, sentiment polarity, subjectivity, and topics discussed related to this npi. tweets predominantly showed positive sentiment polarity. tweets were primarily linked to emotions of joy (~ %), fear, and surprise. anger and disgust were the least common emotions expressed. analyzing key words, we demonstrated that tweets were primarily objective in nature and were used to disseminate public health information. we identified and labeled main topics demonstrating insight into the thoughts and perceptions of the public. social media data and channels provide a rich platform to perform public sentiment analysis and have already been used to examine covid- perceptions. one study leveraged social media to distribute a survey to nearly , individuals in the united states. another large study surveyed , participants in the . emotion analysis for all tweets and stratified by tweets with the hashtag #socialdistancing and #stayathome. comparison between the two hashtags is done using χ testing. bonferroni correction was used to define statistical significance at a threshold of p = . ( . /n, where n = since comparisons were completed). united kingdom and the united states. despite the robust combined sample size of , participants, there were inherent limitations to the design. these studies utilize nonprobability sampling like convenience and snowball sampling that are plagued by significant selection bias as well as potential reporting bias, making them prone to sampling error. through probability sampling from the twitter api, we analyzed nearly , english tweets across , users, providing a broader understanding of public perception that is likely more representational of the population. using a machine-learning approach, we also explored topics and perceptions without introducing predefined researcher notions, thus limiting the risk of biases inherent to the question design. recent public opinion polls from a similar time period have shown that the overwhelming majority of us citizens favored the continuation of social distancing measures. , the positive attitude is clearly reflected in the sentiments found in the analyzed tweet sample. most tweets were either positive or neutral in nature. as public sentiment shifts, we would expect this to be reflected in tweet sentiment as well. for government and public health officials, tweet sentiments may be an important measure to determine the public willingness to continue distancing, which in turn could inform future infection prediction models and social distancing policies. many tweets tend to express an opinion; however, tweets associated with #socialdistancing and #stayathome were predominately objective suggesting that these hashtags were used to transmit objective information potentially serving an important public health function. combined with the large volume of tweets and the finding that . % described social distancing rules, twitter has the potential to fulfill an important educational function for public health messages. joy, fear, and surprise were the dominant emotions for the early phase of social distancing. this correlated well with the topics we discovered, which included leisure activities, community support, and messages of hope (ie, joy), concerns about food insecurity, spreading of the infection, effects of the quarantine (ie, fear), unpredictability of covid and its unforeseen implications (ie, surprise). as time progresses and the effects of social distancing become more prominent, we would anticipate that other themes such as loss of income, unemployment, inflation, and financial burden would increase in frequency. the topics we discovered can be aggregated into larger domains. activities that can be performed during social distancing included topics: media and entertainment, activities, and music and media sharing. tweets concerning the actual rationale and effect of the social distancing included topics: public opinions and values, quarantine measures and effects, and quarantine and isolation. two of these were the most prevalent topics. one domain covered the logistics of staying at home falling under a single topic: supplies, food, and orders. the last domain pertained to messages of support and cheering up others: thank healthcare and reduce spread, community support and businesses, spring and good sentiments. our study has several limitations. first, we used social media data and specifically twitter for our analysis. although there are > million monthly active twitter users, our methodology likely introduced some sampling bias to those with internet and technology access. second, we used noncomprehensive trending hashtags to identify the most relevant social distancing tweets. we may have missed alternative terminology or key words such as "self isolation" and "corona lockdown," which appeared as weighted terms in our topic modeling. however, given that these hashtags were the top-trending social distancing hashtags, we expect that these were representative of social distancing during the study period. we recognize that the study period serves as an initial snapshot, rather than a complete evolution, of public perception towards social distancing and that sentiment and topics likely have changed over time. a longitudinal analysis will be a part of future directions. third, despite analyzing a large number of tweets, we used only a subset of tweets during this time frame, which may have resulted in selection bias. having analyzed only english tweets, our conclusions may not be generalizable to non-english speaking populations. since most tweets do not have geolocation, we are also limited in making conclusions based on geographic areas or countries. fourth, a study found that between % and % of all twitter accounts are bots, which may have affected our analysis. we used the twitter bot analyzer botometer to analyze a random sample of , users in our dataset. we found that % of users have a < % chance of being a bot. figure s shows the complete probability distribution. excluding the remaining % of users did not change sentiment, emotion, or subjectivity analysis. finally, we recognize the risk of labeling bias through assignment of topic themes to weighted terms. we attempted to prevent this by having authors perform the topic modeling and author independently perform the labeling task. in the early phases of social distancing, we were able to successfully obtain and analyze a representative subset of tweets related to this topic. performing sentiment, emotion, and content analysis of tweets provided valuable insight into the public's beliefs and opinions on social distancing. tweets were predominately objective with joy, fear, and surprise as leading emotions. tweets contained social distancing instructions in > % of tweets. in the early phases of social distancing, tweets were skewed toward leisure activities and discussion of rationale and effect of social distancing. as social distancing progresses and then is lifted, we anticipate sentiment and topics to change. although "attitude is only one antecedent of behavior," the positive emotions, the preponderance of objective tweets, and the topics supporting coping mechanisms led us to conclude that twitter users generally supported the social distancing measure. analyzing tweets about nonpharmaceutical interventions such as social distancing based on content, sentiment, and emotion may prove valuable for public health communication, knowledge dissemination, as well as adjustment of mitigation policies in the future. future research to implement this analysis in real-time using the twitter streaming api could augment directed messaging based on user interest and emotion. covid- ) situation report - . world health organization website coronavirus: travellers race home amid worldwide border closures and flight warnings. the guardian website where america didn't stay home even as the virus spread. the new york times website the social-distancing culture war has begun. the atlantic website effectiveness of workplace social distancing measures in reducing influenza transmission: a systematic review coronavirus disease (covid- ) social distancing. centers for disease control and prevention website mathematical assessment of the impact of non-pharmaceutical interventions on curtailing the novel coronavirus impact assessment of nonpharmaceutical interventions against coronavirus disease and influenza in hong kong: an observational study covid- : a look into the modern age pandemic covid- strategy update. world health organization website q earnings report. twitter website diabetes on twitter: a sentiment analysis utilizing twitter data for analysis of chemotherapy public response to obamacare on twitter disease detection or public opinion reflection? content analysis of tweets, other social media, and online newspapers during the measles outbreak in the netherlands in pandemics in the age of twitter: content analysis of tweets during the h n outbreak identifying key topics bearing negative sentiment on twitter: insights concerning the - zika epidemic search tweets-standard search api. twitter website cran r project website textblob: simplified text processing a new anew: evaluation of a word list for sentiment analysis in microblogs emotion recognition on twitter: comparative study and training a unison model latent dirichlet allocation parallelized latent dirichlet allocation. gensim website us public concerns about the covid- pandemic from results of a survey given via social media knowledge and perceptions of covid- among the general public in the united states and the united kingdom: a cross-sectional online survey just % of americans support ending social distancing in order to reopen the economy, according to a new poll kff health tracking pollearly april : the impact of coronavirus on life in america. kaiser family foundation website online human-bot interactions: detection, estimation, and characterization botornot: a system to evaluate social bots filter real-time tweets. twitter website acknowledgments.financial support. no financial support was provided relevant to this article. supplementary material. to view supplementary material for this article, please visit https://doi.org/ . /ice. . key: cord- -peakgsyp authors: walsh, james p title: social media and moral panics: assessing the effects of technological change on societal reaction date: - - journal: nan doi: . / sha: doc_id: cord_uid: peakgsyp answering calls for deeper consideration of the relationship between moral panics and emergent media systems, this exploratory article assesses the effects of social media – web-based venues that enable and encourage the production and exchange of user-generated content. contra claims of their empowering and deflationary consequences, it finds that, on balance, recent technological transformations unleash and intensify collective alarm. whether generating fear about social change, sharpening social distance, or offering new opportunities for vilifying outsiders, distorting communications, manipulating public opinion, and mobilizing embittered individuals, digital platforms and communications constitute significant targets, facilitators, and instruments of panic production. the conceptual implications of these findings are considered. technologies (for example, see flores-yeffal et al., ; hier, ; marwick, ; wright, ) . despite their insight and contributions, knowledge of social media's diverse effects remains scattered and fragmentary. thus, while some of this article's propositions can be gleaned from existing studies, it offers a systematic elaboration that aims to promote analytic balance and encourage productive exchanges that can orient future scholarship. after revisiting the media-moral panic relationship, this article assesses how social media escalate the frequency and intensity of overwrought reactions. while addressing several concrete examples, particularly the role of digital communications in promoting extremist agendas, as recent events concerning trumpism, brexit, the alt-right, and 'fake news' have shattered myths regarding their positive and empowering qualities, the focus of this article is more on general claims than particular findings. accordingly, rather than a final, definitive statement, it presents developmental suggestions and a heuristic that can, and should, be subjected to further scrutiny and debate. in the end, such preliminary efforts are significant as 'before we can pose questions of explanation, we must be aware of the character of the phenomenon we wish to explain' (smelser, : ) . while the identification and policing of deviance are perennial features of human groups, moral panics are 'unthinkable without the media' and are distinctive to modern, mass societies (critcher, : ) . in many respects, cohen and his contemporaries (cohen and young, ; hall et al., ; pearson, ) were the first to articulate the essential role of news-making in constructing social problems. beyond generating surplus visibility and making otherwise marginal behaviours appear pernicious and pervasive, the media represent an independent voice . by delineating moral boundaries and circulating dire predictions about monstrous others, the histrionic tenor of reporting sensitizes audiences, culminating in hardened sentiment and unbridled punitiveness (wright, ) . moreover, coverage translates 'stereotypes into actuality', elevating the actual and perceived severity of deviance (young, : ) . here, identifying affronts to moral order triggers virulent hostility, further marginalizing folk devils and amplifying their deviant attachments and identities. as a control culture is institutionalized, surveillance and intervention intensify, exposing additional deviance, confirming popular stereotypes and justifying further crackdowns (garland, ) . since cohen's research nearly a half-century ago, media systems have undergone sweeping transformation, leading many to question the continued relevance of his work. a particularly influential critique in these regards comes from mcrobbie and thornton ( ) . for them ( : ), cohen's emphasis on mass-broadcasting and its social and institutional correlates -a univocal press, hierarchical information flows, monolithic audiences -is untenable in the context of 'multi-mediated social worlds'. specifically, it is held that the proliferation of media sources encourages exposure to alternative, if not dissenting, claims and reactions, ensuring that 'hard and fast boundaries between 'normal' and 'deviant' are less common' (mcrobbie and thornton, ; - ; cf. tiffen, ) . moreover, expanded access to media technologies -portable camcorders, personal computers, editing software and so one -broadens the remit of expression, giving rise to media sources inflected with the interests of marginalized groups (coleman and ross, ) . able to 'produce their own media' and defended by 'niche and micromedia' (mcrobbie and thornton, : ) , folk devils are no longer powerless victims and can 'fight back' (mcrobbie, ; cf. deyoung, ; thornton, ) . consequently, deviant outsiders and their supporters display greater capacity to contest and short-circuit panicked reactions, outcomes that render the success of moral crusades 'much less certain' (mcrobbie and thornton, : ) . focused on the diversification of conventional media space, mcrobbie and thornton conducted their stock-taking precisely as media systems were being further destabilized. with the onset of the st century, digital platforms not only underpin but also constitute social life in affluent societies, with individuals' identities and relations at least partly cultivated through computing infrastructures (lupton, ) . among the most significant manifestations of 'digital societies' are social media. whether as social networking (facebook), micro-blogging (twitter), or photo -(instagram) and video-sharing (youtube) sites, social media have profoundly reconfigured the production and exchange of information. as 'many-to-many' systems of communication, they promote vernacular discourse and creativity, permitting ordinary users to produce and distribute staggering quantities of 'user-generated content' (keane, ; yar, ) . digital platforms are also displacing the mass media as an information source. finally, as loosely coupled networks of users, their structure not only promotes virality -the rapid and unpredictable diffusion of content -but also fosters an expansive virtual sociality (baym, ; . here, various attributes -'likes', 'retweets', hashtags (#), mentions (@) and so on -index and anchor communications, promoting awareness of others and uniting spatially dispersed users into communities of shared interest and identity (murthy, ) . while mcrobbie and thornton could not have anticipated these momentous shifts, contemporary scholarship assumes, either overtly or implicitly, that their corrective remains as, if not more, relevant today (for example, see carlson, ; carrabine, ; fischel, ; marres, ) . with information control representing a critical axis of power, social media are frequently depicted as an elite-challenging 'microphone for the masses' (murthy, ; cf. gerbaudo, ; jenkins, . here, the accessibility and sophistication of digital platforms is believed to empower ordinary citizens to make their own news, name issues as public concerns, and shape collective sentiment (coleman and ross, ; turner, ) . with knowledge production and image-making increasingly steered by non-experts, many perceive citizen journalism as breeding accounts of reality rooted in public-mindedness rather than sensationalism or commercial considerations (goode, ) . in light of such developments, noted panic scholars claim digital media are shifting 'the locus of definitional power', ensuring 'more voices are heard' (critcher, ) and generating 'new possibilities for resistance' (lindgren, (lindgren, : . thus, the increasingly nodal configuration of media space has attenuated moral guardians' influence, ensuring that panics are 'more likely to be blunted and scattered among competing narratives' (goode and ben-yehuda, : ; cf. le grand, ) . while mcrobbie and thornton's claims remain influential, their ability to convincingly order the evidence is considerably more limited than recent analysis suggests. in accentuating social media's progressive consequences -information pluralism and robust opportunities for citizens to access the public sphere and defuse frenzied reactionsexisting scholarship neglects how digital platforms are 'underdetermined' and doubleedged (monahan, ) . informed by such issues, the following offers a counterpoint, detailing how social media's affordances intensify the proclivity to panic. whether as objects of unease, sources of acrimonious division, or venues for staging moral contests, on balance, contemporary media systems promote febrile anxiety. changing communicative and informational conditions frequently incite moral restiveness. as cohen himself intimates ( cohen himself intimates ( [ : xvii), societies are regularly gripped by fears that, if improperly governed, new media will have deleterious effects on younger generations. the latest iteration of so-called 'media' (drotner, ) or 'techno-panics' (marwick, ) , reactions to social media encapsulate deep-seated anxieties about social change and the types of people it begets. like prior episodes involving 'dangerous' media, including 'penny dreadfuls', pinball machines, comic books and 'video nasties', youth are ambivalently constructed as threatened and threatening (springhall, ) . while anxieties have surfaced around vulnerability stemming from, inter alia, online predators, sexting, cyber-bullying, and exposure to violent and pornographic content (barak, ; gabriel, ; lynch, ; milosevic, ) , youth are also positioned as undisciplined and pathological, with social media branded a leading culprit. alongside being blamed for moral failings -obesity, addiction, disengagement, cultural vacuity, solipsism (baym, ; thurlow, ; szablewicz, ) -multi-media platforms have been linked to violent criminality. whether in relation to video game violence, the possibility of obtaining information about weaponry and prior incidents, or the promise of celebrity immortality offered through documenting their grievances and attacks, digital media have been maligned for encouraging school shootings and associated massacres (ruddock, ; sternheimer, ) . further, during the england riots, journalists and politicians referenced blackberry and twitter 'mobs', claiming teenage gangs employed digital communications to evade authorities, publicize lawlessness and coordinate anti-social behaviour (crump, ; fuchs, ) . such fears have frequently culminated in attempts by adult society to intensify surveillance, censorship, and control over online platforms. for such crusaders, who often utilize the very technologies they condemn to whip up outrage, techno-panics provide an alibi for manning the 'moral barricades' and reasserting the hegemony of their values (sternheimer, ) . thus, while they may empower grassroots actors and disturb social hierarchies, technological changes equally engender moral backlash and nostalgia. social media also reconfigure the external environment wherein panics occur. frequently valorized for encouraging connectedness and encounters with diverse others, upon closer inspection digital platforms exert centripetal force, producing 'filter bubbles' (pariser, ) and 'information silos' (mcintyre, ) which narrow social horizons and increase the likelihood of engaging with affective, and often acerbic, content. as news is increasingly digitally mediated, such dynamics reveal, pace mcrobbie and thornton ( ) , there is no one-to-one correspondence between media and message pluralism. able to curate content at the expense of professional gatekeepers, social media allow users to construct information ecologies that are personalized and restricted (sunstein, ) . such outcomes are exacerbated by social media's 'aggregative functionalities' (gerbaudo, ) : the use of promotional algorithms to deliver tailored content (rogers, ) . for example, by assessing the volume of 'clicks' (likes, shares, mentions, etc.) communications receive, facebook's customized news feed determines what is worthy of users' attention, filtering out stories deviating from extrapolations of their interests and preferences (mcintyre, ) . as this and related examples suggest, by amplifying users' biases and aversions, social media encourage confirmation bias and isomorphic social relations (powers, ) . social media also favour content likely to generate significant emotion and outrage. by promoting communications based on predicted popularity, they prioritize and reward virality and the intensity of reaction rather than veracity or the public interest (van dijck, ; yardi and boyd, ) . the result is the proliferation of 'click-bait', deliberately sensationalized content that captivates through affective arousal (vaidhyanathan, ) . more significantly, new media systems privilege incendiary communications. research suggests that, even for the most staid users, the frisson of disgust is too alluring as content unleashing fear and anger about out-groups is considerably more likely to garner attention and 'trend' (berger and milkman, ; vosoughi et al., ) . these dynamics ultimately appear contagious as messages' emotional valence 'infects' other users, influencing their subsequent interactions and escalating bitterness and antipathy within online environments (kramer et al., ; stark, ) . together such conditions promote anxious alarm. by allowing users to remain cloistered within their preferred tribes and visions of reality, digital platforms encourage misrecognition and distort understanding of social issues, making the acceptance of bloated rhetoric more likely (albright, ) . accordingly, they obstruct heterogeneous interactions and exposure to opposing perspectives, dynamics long identified as precluding the root causes of panics -intolerance and hostility (murthy, ) . finally, by inflating the visibility of inflammatory content, social media mobilize animosity towards common enemies and transform uneasy concern into full-blown panic. alongside breeding fissiparous societies, multi-media platforms can be wielded to engineer crises. historically, panics require the mass media to generate sufficient concern and indignation. social media expand the pathways of panic production. as detailed below, walsh by allowing ordinary netizens to identify and sanction transgression, they unleash participatory, crowd-sourced panics. additionally, as architectures of amplification, their structural features can be commandeered to promote moral contests that are surreptitious, automated, and finely calibrated in their transmission and targeting. conventional wisdom suggests that panics are spearheaded by seasoned and advantageously positioned activists and elites. by expanding capacities of media production and distribution, digital communications permit citizens to directly publicize issues and promote collective action. typically this has been associated with amateur news-making and attempts to document injustice and promote transparency and accountability (coleman and ross, ; walsh, a; yar, ) , but scholars have recently documented opposing trends, where social media are appropriated to define and enforce public morality. as lay actors increasingly participate in the exposure and sanctioning of deviance, distinctions between the media, the public and moral entrepreneurs are blurring, ensuring that panics stem from unorthodox sources and display new discursive and interactional contours. on the one hand, social media enable micro-crusades that, while lacking broad public appeal and support, are sustained by dispersed groups of devoted and technologically equipped citizens. whether employed to advance claims that harry potter promotes satanism and the occult to impressionable youth (sternheimer, ) or discredit public officials and assert a link between vaccinations and autism (erbschloe, ) , digital environments offer optimal arenas for uniting the conspiratorial. given their accessibility and ease-of-use, they obviate the need for elite participation, promoting patterns of mobilization around issues where all citizens potentially emerge as crusaders (hier, ) . moreover, social media's 'mob-ocratic' tendencies can activate collective effervescence (gerbaudo, ) , producing panics driven by mass collaboration. falling into this category are online 'firestorms' (johnen et al., ) , spontaneous and electric outbursts where the documentation and exposure of moral breaches -petty theft, public outbursts, drug use, sexual promiscuity, etc. -are rapidly disseminated, igniting interactive cascades of denigration (trottier, ; wright, ) . such episodes often culminate in digital vigilantism: forms of extra-judicial punitiveness -ostracism, doxing, harassment, job loss, physical attacks, death threats -that emerge from below (powell et al., ; trottier, ) . consequently, alongside increasing the frequency and velocity of panics, online environments appear to promote heightened virulence and excoriation. while underpinned by emergent technologies, forms of digitally mediated opprobrium are inseparable from late-modern social conditions as they offer a palliative for ontological precarity and allow otherwise atomized individuals to police social boundaries (ingraham and reeves, ; cf. bauman, ) . beyond expanding the profile of moral entrepreneurs, the networked and digital configuration of social media can also be marshalled to distort information flows, promote international journal of cultural studies ( ) incendiary content, and channel user experience and engagement. in such instances, digital platforms constitute architectures of amplification that allow interested parties to punch well above their weight. 'attention hacking' and media manipulation. on the one hand, digital platforms permit highly energized and sustained groups to sculpt public sentiment by maximizing the visibility of 'information pollution' and 'fake news' -arresting, sensational and morally tinged content designed to distort and agitate (kalsnes, ) . whether by steering communications, creating fake accounts, or exploiting digital interactions, techniques of 'attention hacking' can strategically influence engagement patterns and produce wildly disproportionate effects (marwick and lewis, ) . ultimately, by allowing users to eliminate ambiguity and delineate moral boundaries in publicly visible ways, sites like twitter and facebook generate new types of agency that can rapidly propel the ideas and identities of various outsiders into prominence (joosse, ). with their cacophonous character making it difficult to vet the integrity of content, digital platforms have been inundated with captivating, tendentious and skewed, if not entirely spurious, communications (news stories, videos, memes, blog posts, hashtags, etc.) to distort online conversations and mobilize receptive users. an exemplary case of digitally mediated crusades appeared during the american election as dedicated members of the 'alt-right', as well as digital mercenaries employed by the internet research agency (ira), a russia-backed 'troll farm', devoted considerable energy and resources to shaping political communication and behaviour. central to their efforts was the creation, sharing, liking and promotion of misinformation and provocative discourse about contentious sociocultural issues, including race relations, gun control, abortion, islamophobia and men's rights (bradshaw and howard, ; nagle, ; singer and brooking, ) . armed with an appreciation of digital platforms' value in shifting the parameters of public discourse, such actors succeeded in generating virality, obtaining mainstream press coverage, and inciting considerable outcry and anxiety (phillips, ) . more recently, the role of digital communication in spreading fake news and inciting panic was on full display in initial reactions to the novel coronavirus (covid- ), an infectious respiratory disease of zoonotic origin. following its emergence in wuhan, china in january , widespread scapegoating and fear-mongering erupted across social media. in relation to the former, the virus was racialized, with numerous messages linking it to the ostensibly exotic dietary practices and unsanitary behaviour of chinese populations, with representations depicting them as folk devils and dangerous, impure others (yang, ) . reflecting a 'politics of substitution' (jenkins, ) , such claims-making diverted attention from considerably more deadly (and preventable) diseases (e.g. malaria), as well as, the structural conditions -media censorship, political corruption, weakly enforced health and safety standards -underlying the emergence and rapid spread of the disease. digital platforms were also used to circulate misinformation and dire, if not apocalyptic, predictions with various rumours -whether false reports of positive cases and contaminated chinese imports, stories of individuals absconding from quarantine zones, or claims that the virus was a bioweapon developed by the chinese or american governments -outpacing official information during the early stages of the outbreak (bogle, ) . by contributing to a broader climate of suspicion, such communications appear to be reactivating fears of a 'yellow peril', as well as producing emergency measures (enhanced surveillance, quarantines, travel bans etc.) and everyday expressions of racism and anti-chinese sentiment (dingwall, ; palmer, ; yang, ) . as this example reveals, like prior epidemics (sars, aids, etc.) where media coverage promoted fear and opprobrium about various outsiders (gay men, drug users, foreigners; see muzzatti, ; ungar, ; watney, ) , digital communications also play a significant role in distorting understanding and encouraging over-reaction. the episode equally suggests, however, that social media's anonymous, horizontal structure ensures that messages travel exponentially faster, lack clear origins and feature palpable vitriol, outcomes that escalate the impetus and excess of alarm (miller, ) . the spread of information pollution frequently hinges on perceptions of social media as the embodiment of the vox populi (gerbaudo, ) . here, fake accounts are utilized to raise awareness and bolster the credibility of favoured content. on the one hand, advances in artificial intelligence allow bots -machine-led communications tools that mimic human users and perform simple, structurally repetitive, tasks -to spread 'computational propaganda' (bradshaw and howard, ; ferrara et al., ) . as social machines and artificial voices, bots automate and accelerate diffusion and engagement, creating, liking, sharing, and following content at rates vastly surpassing human capabilities. thus, they facilitate viral engineering; expanding the momentum of certain messages and, in the process, altering information flows. to exude authority and authenticity, content is also circulated by bogus, 'sockpuppet' accounts posing as those of accredited experts (scientists, journalists, etc.) or ordinary citizens belonging to various groups (women, blue-collar workers, police officers, urban youth, etc.) and appearing to possess folk wisdom (bastos and mercea, ; marwick and lewis, ) . whether manual or automated, techniques of media manipulation also control narratives by reducing the visibility of unwanted and objectionable content. here, keywords and hashtags affiliated with opposing perspectives can be 'hijacked' as platforms are flooded with nonsense or negative messages to disrupt and drown out specific communications, denuding them of their salience and influence (woolley and howard, ) . a recent example of such efforts is found in twitter communications concerning the intensity of the - australian bushfires, an outcome widely linked to the longer fire seasons produced by climate change. the preliminary results of research conducted by graham and keller ( ) suggests that, at the height of the crisis, a coordinated misinformation campaign was waged by a sprawling network of troll and bot accounts to advance broader narratives of climate denial. by flooding social media with hashtags like #arsonemergency (in place of #climateemergency) and co-opting those already trending (e.g. #australiafire, #bushfireaustralia), such actors sought to publicize conspiracies that criminal elements -whether arsonists, radical environmentalists, or isis fighters -were responsible for the blazes and that climate change is an elite-engineered hoax and form of population control (knaus, ) . finally, the propagation of misinformation involves attempts to harness social interaction and collective sense-making. studies suggest that distorted, emotionally charged content is considerably more likely to be shared by ordinary users who unwittingly enlarge its sphere of influence (albright, ; tanz, ) . by bearing the imprimatur international journal of cultural studies ( ) of whomever shared it, whether a relative, colleague, neighbour, or opinion leader, the substance of messages is validated and appears authentic as it spreads laterally across users' networks (van der linden, ) . for instance, on several occasions, accounts linked to the alt-right and russian operatives have successfully 'seeded' content, goading journalists, bloggers, activists, and politicians (including president trump) into endorsing particular communications and providing broader platforms (phillips, ) . since messages distributed through formal channels and hierarchical apparatuses are frequently perceived as self-serving and inauthentic, media manipulation provides a powerful vehicle of promotion. by engineering popularity and relevance, the discursive swarms unleashed by bots and fake accounts can generate an impression of credibility, unanimity and common sense, an outcome essential to normalizing particular modes of thought (chen, ) . ultimately, by concealing the authors and agendas behind communications, such practices facilitate shadow crusades and astroturfing (rubin, ) . while applicable to numerous topics, digitally mediated crusades are distinctly prominent in relation to issues -migration, crime and policing, or terrorism -identified as leading and recurrent sources of panic (hall et al., ; kidd-hewitt and osborne, ; odartey-wellington, ; walsh, walsh, , c welch and schuster, ) , as well as, central topics in online conversations during critical political moments (benkler et al., ; evolvi, ) . for instance, in their recent study of anti-immigrant crusades, flores-yeffal et al. ( ) observed how the indexing of social media communications through hashtags like #illegalsarecriminals and #wakeupamerica fostered networked discourses and connectedness, helping to construct scapegoats, circulate calls for action, and ensure that xenophobic rhetoric echoed throughout cyber-space (see also morgan and shaffer, ) . additionally, preceding the brexit referendum, supporters of the far-right uk independence party utilized digital platforms to trigger and inflate fears about foreigners, circulating contentious claims about workforce competition, cultural displacement, crime and terrorist infiltration (vaidhyanathan, ) . computational crusades. finally, social media unleash crusades that are data-driven, granular, and highly dynamic in their transmission and targeting. here, the digital surveillance and marketing infrastructures that underpin social media's profitability permit computational modelling of user data, promising greater awareness of audiences and encouraging claims-making practices involving extensive narrowcasting; behavioural and psychometric profiling; and the production of predictive knowledge. while empowering users as participants and agents of communication, digital platforms also render them legible as vast tranches of information about their attributes (e.g. gender, race, income), activities (e.g. hobbies, movements, browsing habits), and associations (e.g. relational ties, organizational memberships) are continuously scrutinized for commercial, legal and political purposes (nissenbaum, ) . once harvested, user data undergoes deep profiling, producing digital dossiers which sort individuals based on dozens, and potentially hundreds, of variables. consequently, audiences are less collectivities to be influenced en masse, than individually calculable units, arrangements that permit those possessing the necessary resources and technological literacy to target users with highly customized messages (zuboff, ) . accompanying geodemographic criteria, algorithms can identify and calculate expressive energies and subjective orientations -moods, sensibilities, and emotions. with advances in machine learning and sentiment analysis, digital communications can be analysed to map meaning structures, and discern personality traits on scales previously unimaginable (andrejevic, ; stark, ) . for example, cambridge analytica, a consulting firm hired to assist the trump campaign's online messaging, harvested data concerning online engagement for over million facebook users, pooling it with other information to develop a sprawling collection of psychographic profiles on potential voters and gauge their receptiveness to various messaging strategies (vaidhyanathan, ) . heralding the rise of communications that, while reaching immense audiences, are highly differentiated, it is estimated that, with the assistance of big data analytics, trump's campaign disseminated over million distinct online ads, with variations of individual messages, at times, surpassing , (singer and brooking, ) . big data also yields inferential and predictive knowledge, with computer models unearthing correlations, extrapolating information about users, and forecasting reactions. here, digital enclosures are mined to identify regularities against which users are continuously compared, outcomes that allow claims makers to anticipate content's likely resonance and develop flexible outreach strategies (baym, ) . practices of dataveillance are also recursive, as feedback in the form of engagement patterns is reflexively monitored to elaborate correlations and deepen knowledge of users (neuman et al., ) . accordingly, digital communications double as iterative experiments where multiple messages can be distributed simultaneously to survey reactions and refine techniques of persuasion (andrejevic, ) . in relation to panics, profiling user data liberates crusaders from 'monolithic massappeal, broadcast approach[es]' to issue mobilization (tufekci, ) . rather than attracting support through unifying, 'big tent' issues, dataveillance facilitates agile micro-targeted crusades. able to cleave populations into demographic and affective types, moral guardians can precisely 'hail' subjectivities, allowing them to combine mass transmission with individual connection and overcome what has traditionally been a hobson's choice between maximal exposure and intimate resonance. consequently, moral contests promise to become exponentially more sophisticated, ensuring overwrought discourse reaches, motivates and energizes its intended targets. moreover, given the expressive contours of panics, and the importance of emotions -anxiety, hostility, even hysteria -as levers of action (walby and spencer, ) , the mining, measurement and classification of affective states allows crusaders to viscerally connect with audiences and strengthen their messaging. as a distinct species of collective behaviour, moral panics represent contentious and intensely affective campaigns to police the parameters of public knowledge and morality. as such, they are necessarily dependent upon and constituted by claims-making, with interested parties historically seeking to actuate alarm by influencing the imagery and representations of the mainstream press, arrangements disrupted by recent upheavals in media space. to illuminate the complex relationship between panics and the broader socio-technical context in which they unfold, this article has surveyed the impact of digital communications, presenting a taxonomy of social media's effects on the issues, conditions and practices that incite collective alarm. while displaying elite-challenging potential, social media are ultimately janus-faced and contradictory. alongside providing emergent sources of unease, they cultivate facilitating conditions and offer ideal venues for constructing social problems. specifically, by elevating agitational discourse and promoting homophily, social media generate social friction and hostility. moreover, as instruments of panic production, new technologies reshape the identification and construction of deviance, both permitting lay participation and allowing various parties to manipulate public communications in ways that produce outsized, imperceptible and highly efficient influence. while gauging the precise effects of social media requires more rigorous scrutiny than can be provided here, the available evidence indicates that, all things considered, they inflate the incidence and severity of panics. on the one hand, various studies suggest that, as architectures of amplification, digital platforms reduce transaction costs and transform peripheral (as well as automated and artificial) voices into conspicuous claimants (vaidhyanathan, ) . they also appear to enhance the spread of information pollution, with scholarship revealing that, whether transmitted by algorithms or human agents, 'misinformation, polarizing, and conspiratorial content' (howard et al., : ) not only 'diffuse[s] significantly further, faster, [and] deeper' on social media (vosoughi et al., (vosoughi et al., : albright, ) but also, during the final days of the election, represented the most popular informational content on facebook, leading many to speculate that it played a decisive role in trump's victory (waisbord, ) . finally, evidence surrounding the extent to which gross distortions, extremist views and readily falsifiable conspiracies (such as the views that: climate change is a manufactured crisis, violent crime is at historic highs, undocumented migrants are overwhelmingly violent criminals, etc.) are being normalized as public idiom gives considerable cause for concern (mcintyre, ; scheufele and krause, ). beyond advancing understanding of the media-moral panic relationship, an important task in its own right, by initiating dialogue between theoretical expectations and empirical instances, the preceding analysis promotes conceptual refinement and renewal. specifically, accounting for social media's effects on panic production illuminates significant mutations surrounding the interactants, functions and communicative patterns that define contemporary crusades. first, as many-to-many systems of communication, social media promote novel patterns of participation, offering ordinary persons a greater role, facilitating spontaneous outbursts driven by multitudes and introducing automated, machine-led campaigns. additionally, in enabling new techniques of media manipulation, digital platforms contribute to the weaponization of panics. while conventional wisdom suggests that panics represent domestic affairs, oriented towards mobilizing support, acquiring power and status or manufacturing consent, the case of russia and information warfare suggests that normative conflict may be exogenously engineered to provoke significant social and psychological disruption. finally, in place of uniform messages and mass appeal, the combination of data-mining and behavioural profiling unleashes claims-making techniques that are inhabited and hyper-targeted. drawing attention to these features exposes significant transformations and bolsters the versatility and explanatory capacity of cohen's paradigm. thus, mirroring other recent interventions (falkof, ; joosse, ; wright, ) , by accounting for emergent social conditions, this article advances a nuanced, flexible framework rather than a fixed, uniform model. ultimately, exposing anomalous findings that push the limits of existing perspectives extends the concept's range of applicability, promoting a more robust framework capable of accommodating pivotal shifts in media space and the social relations they engender. alongside laying the foundation for further empirical applications, given the depth and rapidity of social change, such conceptual dexterity is an asset rather than a liability jewkes, ) . as an account of reaction and social problems construction, moral panic theory has traditionally emphasized the mass media's role in sculpting collective knowledge, arbitrating between the real and represented, and generating significant discrepancy between risk and response. this article suggests that, while the legacy press continues to play a significant role, with the ubiquity of digital platforms and technologies, the emergence and spread of panics is being reconstituted. in particular, scholars can further refine and expand the concept's range and impact by engaging with social media's diverse and far-reaching effects on the contours of collective alarm. while it is admittedly premature to predict what new attributes media systems will assume, and there is too much contingency to suggest that future developments will follow an inexorable path, it is hoped that, by taking technological change into account, the idea of moral panic will continue to influence understandings of how fear and transgression are mobilized for varied purposes. the author received no financial support for the research, authorship, and/or publication of this article. notes . in a striking example of online shaming and digitally mediated outrage, moral entrepreneurs associated with anti-paedophile activism in canada, the uk, and russia have all employed digital platforms to investigate, identify, expose and censure suspected sex offenders (favarel-garrigues, ; trottier, ) . . citing donald trump's rise as a charismatic political maverick, joosse ( ) argues that non-traditional media are ideally suited for producing and reiterating simplistic and highly resonant moral categories, outcomes that can endow otherwise peripheral parties with significant power and influence. . while a full discussion exceeds the scope of this article, the alt-right encompasses an illdefined amalgam of actors (white nationalists, men's rights activists, palaeo-conservatives, nativists, etc.) united by opposition to 'identity politics', multiculturalism, and perceived 'political correctness' (hawley, ; nagle, ) . . for instance, videos of chinese citizens eating bats, rodents, snakes and other 'dirty' or 'exotic' wildlife were quickly posted and widely distributed across various social networking sites (palmer, ) . . surveys from the usa reveal one-quarter of respondents have knowingly shared misinformation on social media (barthel et al., ) . . russian operatives also contributed to such efforts, distributing content and even organizing protests through fake twitter and facebook accounts (singer and brooking, ) . . research reveals, for instance, that various attributes -sexuality, religiosity, education, etc. -can be reliably predicted from patterns involving the single data point, facebook 'likes' (markovikj et al., ) . . for example, during the election, content from just six russian-backed facebook accounts garnered million shares and nearly million interactions on the platform (matsakis, ) . additionally, whether deployed by foreign agents or domestic extremists, bots produced one-third of posts concerning the brexit vote, despite representing just % of active twitter accounts (narayanan et al., ) . . for instance, over two-thirds of americans claim that fake news has left them disoriented and confused about basic facts (barthel et al., ) , while another survey revealed % of americans familiar with a fake news headline thought it was accurate (roozenbeek and van der linden, ). mcrobbie and thornton's ( ) critique continues to be cited as a core 'dimension of dispute facebook's . billion active users leave roughly , comments per minute and share over billion posts per day two-thirds of american adults obtained some of their news from social media (shearer and gottfried, ), while, for british and north american youth, it represents their primary news source an exemplary case is pekka-eric auvinen a finnish shooter deemed the 'youtube gunman' after using the video-sharing site to publicize his actions, espouse nihilistic views, and share a final message immediately before killing eight people one study of facebook found the 'click-through' rate for socially divisive content exceeded typical ads by tenfold welcome to the era of fake news infoglut: how too much information is changing the way we think and know sexual harassment on the internet liquid fear many americans believe fake news is sowing confusion the brexit botnet and user-generated hyperpartisan news data not seen: the uses and shortcomings of social media metrics personal connections in the digital age network propaganda: manipulation, disinformation, and radicalization in american politics what makes online content viral coronavirus misinformation is running rampant on social media. abc science social network sites as networked publics: affordances, dynamics, and implications troops, trolls and troublemakers: a global inventory of organized social media manipulation. the computational propaganda project moral panic, moral breach: bernhard goetz, george zimmerman, and racialized news reporting in contested cases of self-defense crime, culture and the media digital objects, digital subjects the agency. new york times magazine folk devils and moral panics the manufacture of news: a reader the media and the public moral panics what are the police doing on twitter? social media, the police and the public the idea of moral panic: ten dimensions of dispute considering the agency of folk devils we should deescalate the war on the coronavirus. wired dangerous media? panic discourses and dilemmas of modernity extremist propaganda in social media #islamexit: inter-group antagonism on twitter. information on moral panic: some directions for further development digital vigilantism and anti-paedophile activism in russia. global crime, epub ahead of print october the rise of social bots sex and harm in the age of consent #wakeupamerica #illegalsarecriminals: the role of the cyber public sphere in the perpetuation of the latino cyber-moral panic in the us social media: a critical introduction sexting, selfies and self-harm: young people, social media and the performance of self-development on the concept of moral panic crime social media and populism: an elective affinity? media social news, citizen journalism and democracy bushfires, bots and arson claims: australia flung in the global disinformation spotlight. the conversation making sense of the alt-right interview with stuart hall. communication and critical/cultural studies moral panics and digital-media logic: notes on a changing research agenda. crime, media, culture, epub ahead of print social media, news and political information during the us election arxiv, epub ahead of print new media, new panics fans, bloggers, and gamers: exploring participatory culture intimate enemies media and crime the digital outcry: what incites participation behaviour in an online firestorm? new media & society expanding moral panic theory to include the agency of charismatic entrepreneurs fake news democracy and media decadence bots and trolls spread false arson claims in australian fires 'disinformation campaign'. the guardian experimental evidence of massive-scale emotional contagion through social networks the ashgate research companion to moral panics. farnham: ashgate moralising discourse and the dialectical formation of class identities youtube gunmen? mapping participatory media discourse on school shooting videos. media pirate panics: comparing news and blog discourse on illegal file sharing in sweden. information pedophiles and cyber-predators as contaminating forces association for the advancement of artificial intelligence digital sociology: the reinvention of social research to catch a predator? the myspace moral panic media manipulation and disinformation online the ftc is officially investigating facebook's data practices crime, media and moral panic in an expanding european union folk devils fight back rethinking 'moral panic' for multi-mediated social worlds it plays to our worst fears': coronavirus misinformation fuelled by social media. cbc cyberbullying in us mainstream media surveillance in the time of insecurity sockpuppets, secessionists, and breitbart. medium, march twitter: microphone for the masses? media bits of falling sky and global pandemics: moral panic and severe acute respiratory syndrome (sars). illness kill all normies: online culture wars from chan and tumblr to trump and the alt-right russian involvement and junk news during brexit. the computational propaganda project (comprop) data memo the dynamics of public attention: agenda-setting theory meets big data privacy in context racial profiling and moral panic: operation thread and the al-qaeda sleeper cell that never was don't blame bat soup for the wuhan virus the filter bubble hooligan: a history of respectable fears the oxygen of amplification not everyone in advanced economies is using social media digital criminology: crime and justice in digital society my news feed is filtered? awareness of news personalization among college students on microtargeting socially divisive ads: a case study of russia-linked ad campaigns on facebook the fake news game: actively inoculating against the risk of misinformation deception detection and rumor debunking for social media youth and media science audiences, misinformation, and fake news news use across social media platforms likewar: the weaponization of social media theory of collective behavior youth, popular culture and moral panics algorithmic psychometrics and the scalable subject pop culture panics #republic: divided democracy in the age of social media the ill effects of 'opium for the spirit': a critical cultural analysis of china's internet addiction moral panic journalism fights for survival in the post-truth era club cultures from statistical panic to moral panic: the metadiscursive construction and popular exaggeration of new media language in the print media moral panics digital vigilantism as weaponisation of visibility coming to terms with shame: exploring mediated visibility against transgressions engineering the public: big data, surveillance and computational politics ordinary people and the media: the demotic turn is this one it? viral moral panics beating the hell out of fake news the culture of connectivity the spread of true and false news online truth is what happens to news: on journalism, fake news, and post-truth social media 'outstrips tv' as news source for young people how emotions matter to moral panics moral panics by design: the case of terrorism the handbook of social control desired panic: the folk devil as provocateur. deviant behavior social media and border security: twitter use by migration policing agencies. policing and society social media and policing: a review of recent research policing desire detention of asylum seekers in the uk and usa: deciphering noisy and quiet constructions automation, algorithms, and politics| political communication, computational propaganda, and autonomous agents: introduction moral panics as enacted melodramas making sense of moral panics a new virus stirs up ancient hatred. cnn the cultural imaginary of the internet dynamic debates: an analysis of group polarization over time on twitter the drugtakers. london: macgibbon and kee big other: surveillance capitalism and the prospects of an information civilization international journal of cultural studies ( ) james p walsh is an assistant professor of criminology at the university of ontario institute of technology. in addition to moral panics, his research focuses on crime and media; surveillance; and border security and migration policing. key: cord- -pnjt aa authors: ordun, catherine; purushotham, sanjay; raff, edward title: exploratory analysis of covid- tweets using topic modeling, umap, and digraphs date: - - journal: nan doi: nan sha: doc_id: cord_uid: pnjt aa this paper illustrates five different techniques to assess the distinctiveness of topics, key terms and features, speed of information dissemination, and network behaviors for covid tweets. first, we use pattern matching and second, topic modeling through latent dirichlet allocation (lda) to generate twenty different topics that discuss case spread, healthcare workers, and personal protective equipment (ppe). one topic specific to u.s. cases would start to uptick immediately after live white house coronavirus task force briefings, implying that many twitter users are paying attention to government announcements. we contribute machine learning methods not previously reported in the covid twitter literature. this includes our third method, uniform manifold approximation and projection (umap), that identifies unique clustering-behavior of distinct topics to improve our understanding of important themes in the corpus and help assess the quality of generated topics. fourth, we calculated retweeting times to understand how fast information about covid propagates on twitter. our analysis indicates that the median retweeting time of covid for a sample corpus in march was . hours, approximately minutes faster than repostings from chinese social media about h n in march . lastly, we sought to understand retweet cascades, by visualizing the connections of users over time from fast to slow retweeting. as the time to retweet increases, the density of connections also increase where in our sample, we found distinct users dominating the attention of covid retweeters. one of the simplest highlights of this analysis is that early-stage descriptive methods like regular expressions can successfully identify high-level themes which were consistently verified as important through every subsequent analysis. monitoring public conversations on twitter about healthcare and policy issues, provides one barometer of american and global sentiment about covid . this is particularly valuable as the situation with covid changes every day and is unpredictable during these unprecedented times. twitter has been used as an early warning notifier, emergency communication channel, public perception monitor, and proxy public health surveillance data source in a variety of disaster and disease outbreaks from hurricanes [ ] , terrorist bombings [ ] , tsunamis [ ] , earthquakes [ ] , seasonal influenza [ ] , swine flu [ ] , and ebola [ ] . in this paper, we conduct an exploratory analysis of topics and network dynamics of covid tweets. since january , there have been a growing number of papers that analyze twitter activity during the covid pandemic in the united states. we provide a sample of papers published since january , in table i . chen, et al. analyzed the frequency of different keywords such as "coronavirus", "corona", "cdc", "wuhan", "sinophobia", and "covid- " analyzed across million tweets from january , to march , [ ] . thelwall also published an analysis of topics for english-language tweets from march - , . [ ] . singh et al. [ ] analyzed distribution of languages and propogation of myths, sharma et al. [ ] implemented sentiment modeling to understand perception of public policy, and cinelli et al. [ ] compared twitter against other social media platforms to model information spread. our contributions are applying machine learning methods not previously analyzed on covid twitter data, mainly uniform manifold approximation and projection (umap) to visualize lda generated topics and directed graph visualizations of covid retweet cascades. topics generated by lda can be difficult to interpret and while there exist coherence values [ ] that are intended to score the interpretability of topics, they continue to be difficult to interpret and are subjective. as a result, we apply umap, a dimensionality reduction algorithm and visualization tool that "clusters" documents by topic. vectorizing the tweets using term-frequency inverse-document-frequency (tf-idf) and plotting a umap visualization with the assigned topics from lda allowed us to identify strongly localized and distinct topics. we then visualized "retweet cascades", which describes how a social media network propagates information [ ] , through the use of graph models to understand how dense networks become over time and which users dominate the covid conversations. in our retweeting time analysis, we found that the median time for covid messages to be retweeted is approximately minutes faster than h n messages during a march outbreak in china, possibly indicating the global nature, volume, and intensity of the covid pandemic. our keyword analysis and topic modeling were also rigorously explored, where we found that specific topics were triggered to uptick by live white house briefings, implying that covid twitter [ ] , , jan. -apr , x medford, et al. [ ] , jan. -jan. , x x x x singh, et al. [ ] , , jan. , -mar. , x x x x lopez, et al. [ ] , , jan. -mar. , x x x cinelli, et al. [ ] , , jan. -feb. , x x x kouzy, et al. [ ] feb , x x alshaabi, et al. [ ] unknown mar. -mar , x x sharma, et al. [ ] , , mar. , -mar. , x x x x x x x chen, et al. [ ] , , mar. , -mar. , x schild [ ] , , nov. , -mar. , x x x x yang, et al. [ ] unknown mar. , -mar. , x x ours , , mar. -apr. , x x x x x yasin-kabir, et al. [ ] , , mar. , -apr. , x x x x users are highly attuned to government broadcasts. we think this is important because it highlights how other researchers have identified that government agencies play a critical role in sharing information via twitter to improve situational awareness and disaster response [ ] . our lda models confirm that topics detected by thelwall et al. [ ] and sharma et al. [ ] , who analyzed twitter during a similar period of time, were also identified in our dataset which emphasized healthcare providers, personal protective equipment such as masks and ventilators, and cases of death. this paper studies five research questions: ) what high-level trends can be inferred from covid tweets? ) are there any events that lead to spikes in covid twitter activity? ) which topics are distinct from each other? ) how does the speed of retweeting in covid compare to other emergencies, and especially similar infectious disease outbreaks? ) how do covid networks behave as information spreads? the paper begins with data collection, followed by the five stages of our analysis: keyword trend analysis, topic modeling, umap, time-to-retweet analysis, and network analysis. our methods and results are explained in each section. the paper concludes with limitations of our analysis. the appendix provides additional graphs as supporting evidence. ii. data collection similar to researchers in table i , we collected twitter data by leveraging the free streaming api. from march , to april , , we collected , , ( gb) tweets. note, in this paper, we refer to the twitter data interchangeably as both "dataset" and "corpora" and refer to the posts as "tweets". our dataset is a collection of tweets from different time periods shown in table v . using the twitter api through tweepy, a python twitter mining and authentication api, we first queried the twitter track on twelve query terms to capture a healthcare-focused dataset: 'icu beds', 'ppe', 'masks', 'long hours', 'deaths', 'hospitalized', 'cases', 'ventilators', 'respiratory', 'hospitals', '#covid', and '#coronavirus'. for the keyword analysis, topic modeling, and umap tasks, we analyzed non-retweets that brought the corpus down to , , tweets. in the time-to-retweet and network analysis, we included retweets but selected a sample out of the larger . million corpus of , tweets. our preprocessing steps are described in the data analysis section that follows. prior to applying keyword analysis, we first had to preprocess the corpus on the "text" field. first, we removed retweets using regular expressions, in order to focus the text on original tweets and authorship, as opposed to retweets that can inflate the number of messages in the corpus. we use no-retweeted corpora for both the keyword trend analysis and the topic modeling and umap analyses. further we formatted datetime to utc format, removed digits, short words less than characters, extended the nltk stopwords list to also exclude "coronavirus", "covid ", " ", "covid", removed "https:" hyperlinks, removed "@" signs for usernames, removed non-latin characters such as arabic or chinese characters, and implemented lower-casing, stemming, and tokenization. finally, using regular expressions, we extracted tweets that table vi and the frequencies of tweets per minute here in table ii . the greatest rate of tweets occurred for the tweets consisting of the term "mask" (mean . ) in table ii , followed by "hospital" (mean . ) and "vent" (mean . ). tweets of less than . mean tweets per minute, came from groups about testing positive, being in serious condition, exposure, cough, and fever. this may indicate that people are discussing the issues around covid more frequently than symptoms and health conditions in this dataset. we will later find out that several themes consistent with these keyword findings are mentioned in topic modeling to include personal protective equipment (ppe) like ventilators and masks, and healthcare workers like nurses and doctors. lda are mixture models, meaning that documents can belong to multiple topics and membership is fractional [ ] . further, each topic is a mixture of words, where words can be shared among topics. this allows for a "fuzzy" form of unsupervised clustering where a single document can belong to multiple topics, each with an associated probability. lda is a bag of words model where each vector is a count of terms. lda requires the number of topics to be specified. similar to methods described by syed et al. [ ] , we ran different lda experiments varying the number of topics from to , and selected the model with the highest coherence value score. we selected the lda model that generated topics, with a medium coherence value score of . . roder et al. [ ] developed the coherence value as a metric that calculates the agreement of a set of pairs and word subsets and their associated word probabilities into a single score. in general, topics are interpreted as being coherent if all or most of terms are related. our final model generated topics using the default figure and include the terms generated and each topic's coherence score measuring interpretability. similar to the high-level trends inferred from extracting keywords, themes about ppe and healthcare workers dominate the nature of topics. the terms generated also indicate emerging words in public conversation including "hydroxychloroquine" and "asymptomatic". our results also show four topics that are in non-english languages. in our preprocessing, we removed non-latin characters in order to filter out a high volume of arabic and chinese characters. in twitter there exists a tweet object metadata field of "lang" for language to filter tweets by a specific language like english ("eng"). however, we decided not to filter against the "lang" element because upon observation, approximately . % of the dataset consisted of an "undefined" language tag, meaning that no language was indicated. although it appears to be a small fraction, removing even the "undefined" tweets would have removed several thousand tweets. some of these tweets that are tagged as "undefined" are in english but contain hashtags, emojis, and arabic characters. as a result, we did not filter out for english language, leading our topics to be a mix of english, spanish, italian, french, and portuguese. although this introduced challenges in interpretation, we feel it demonstrates the global nature of worldwide conversations about covid occurring on twitter. this is consistent with what singh et al. singh et al. [ ] reported as a variety of languages in covid tweets upon analyzing over million tweets. as a result, we labeled the four topics by the language of the terms in the respective topics: "spanish" (topic ), "portuguese" (topic ), "italian" (topic ) and "french" (topic ). we used google translate to infer the language of the terms. when examining the distribution of the topics across the corpora in figure , topics ("potus"), ("case.death.new"), ("mask.ppe.ventil"), and ("like.look.work") were the top five in the entire corpora. for each plot, we labeled each topic with the first three terms of each topic for interpretability. in our trend analysis, we summed the number of tweets per minute, and then applied a moving weighted average of minutes for topics march -march , and minutes for topics march to april th. we provided two different plots in order to visualize smaller time frames such as march of minutes compared to figure and figure show similar trends on a time-series basis per minute across the entire corpora of , , tweets. these plots are in a style of "broken axes" to indicate that the corpora are not continuous periods of time, but discrete time frames, which we selected to plot on one axis for convenience and legibility. we direct the reader to table v for reference on the start and end datetimes, which are in utc format, so please adjust accordingly for time zone. the x-axis denotes the number of minutes, where the entire https://github.com/bendichter/brokenaxes corpora is total minutes of tweets. figure shows that for the corpora of march , , and , the topics (denoted in hash-marked lines) focused on topic "potus" and topic "mask.ppe.ventil" trended greatest. for the later time periods of march , march , april , and in figure , topic "potus" and topic "mask.ppe.ventil" (also in hash-marked lines) continued to trended high. it is also interesting that topic was never replaced as the top trending topic, across a span of days (april , also includes early hours of april est), potentially as this may have been a proxy for active government listening. the time series would temporally decrease in frequency during overnight hours, between the we applied change point detection in the time series of tweets per minute for topic in the datasets march , , april - , , april - , , and april , , to identify whether the live press briefings coincided with inflections in time. using the ruptures python package [ ] containing a variety of change point detection methods, we used binary segmentation [ ] , a standard method for change point detection. given a sequence of data y :n = (y , ..., y n ) the model will have m changepoints with their positions τ :m = (τ , ..., τ m ). each changepoint position is an integer between and n − . the m changepoints split the time series data into m + segments, with the ith segment containing y ( τ i− + ) : τ i . changepoints are identified by minimizing a cost function, c for a given segment, where βf (m) is a penalty to prevent overfitting. where twice the negative log-likelihood is a commonly used cost function. binary segmentation detects multiple changepoints across the time series by repeatedly testing on different subsets of the sequence. it checks to see if a τ exists that satisfies: c(y :τ + c(y (τ + ):n ) + β < c(y :n ) if not, then no changepoint is detected and the method stops. but if a changepoint is detected, the data are split into two segments consisting of the time series before (figure blue) and after (figure pink) the changepoint. we can clearly see in figure that the timing of the white house briefing indicates a changepoint in time, giving us the intuition that this briefing influenced an uptick in the the number of tweets. we provide additional examples in the appendix. our topic findings are consistent with the published analyses on covid and twitter, such as [ ] who found major themes of healthcare and illness and international dialogue, as we noticed in our four non-english topics. they are also similar to by thelwall et al. [ ] who manually reviewed tweets from a corpus of million tweets occurring earlier and overlapping our dataset (march - ). similar topics from their findings to ours includes "lockdown life", "politics", "safety messages", "people with covid- ", "support for key workers", "work", and "covid- facts/news". further, our dataset of covid tweets from march to april , occurred during a month of exponential case growth. by the end of our data collection period, the number of cases had increased by times to , cases on april , [ ] . the key topics we identified using our multiple methods were representative of the public conversations being had in news outlets during march and april, including: term-frequency inverse-document-frequency (tf-idf) [ ] is a weight that signifies how valuable a term is within a document in a corpus, and can be calculated at the n-gram level. tf-idf has been widely applied for feature extraction on tweets used for text classification [ ] [ ] , analyzing sentiment [ ] , and for text matching in political rumor detection [ ] with tf-idf, unique words carry greater information and value than common, high frequency words across the corpus. tf-idf can be calculated as follows: where i is the term, j is the document, and n is the total number of documents in the corpus. tf-idf calculates the term frequency tf i,j multiplied by the log of the inverse document frequency n dfi . the term frequency tf i,j is calculated as the frequency of i in j divided by all terms i in given j. the inverse document frequency is n dfi is the log of the total number of documents j in the corpus divided by the number of documents j containing term, i. using the scikit-learn implementation of tfidfvectorizer and setting max_features to , we transformed our corpus of , , tweets into a r n×k sparse dimensional matrix of shape ( , ). note, prior to fitting the vectorizer, our corpus of tweets was pre-processed during the keyword analysis stage. we chose to visualize how the topics grouped together using uniform manifold approximation and projection (umap) [ ] . umap is a dimension reduction algorithm that finds a low dimensional representation of data with similar topological properties as the high dimensional space. it measures the local distance of points across a neighborhood graph of the high dimensional data, capturing what is called a fuzzy topological representation of the data. optimization is then used to find the closest fuzzy topological structure by first approximating nearest neighbors using the nearest-neighbor-descent algorithm and then minimizing local distances of the approximate topology using stochastic gradient descent [ ] . when compared to t-distributed stochastic neighbor embedding (t-sne), umap has been observed to be faster [ ] with clearer separation of groups. due to compute limitations in fitting the entire high dimensional vector of nearly . m records, we randomly sampled one million records. we created an embedding of the vectors along two components to fit the umap model with the hellinger metric which compares distances between probability distributions, as follows: we visualized the word vectors with their respective labels, which were the assigned topics generated from the lda model. we used the default parameters of n_neighbors = and min_dist = . . figure presents the visualization of the tf-idf word vectors for each of the million tweets with their labeled topics. umap is supposed to preserve local and global structure of data, unlike t-sne that separates groups but does not preserve global structure. as a result, umap visualizations intend to allow the reader to interpret distances between groups as meaningful. in figure each topic is colorcoded by its respective topic. the umap plots appear to provide further evidence of the quality and number of topics generated. our observations is that many of these topic "clusters" appear to have a single dominant color indicating distinct grouping. there is strong local clustering for topics that were also prominent in the keyword analysis and topic modeling time series plots. a very distinct and separated mass of purple tweets represents the " : n/a" topic which is an undefined topic. this means that the lda model outputted equal scores across all topics for any single tweet. as a result, we could not assign a topic to these tweets because they all had uniform scores. but this visualization informs us that the contents of these tweets were uniquely distinct from the others. examples of tweets in this " : n/a" cateogry include "see, #democrats are always guilty of whatever", "why are people still getting in cruise ships?!?", "thank you mike you are always helping others and sponsoring anchors media shows.", "we cannot let this woman's brave and courageous actions go to waste! #chinaliedpeopledied #chinaneedstopay", "i wish people in this country would just stay the hell home instead of going to the beach". other observations reveal that the mask-related topic in purple, and potentially a combination of and in red are distinct from the mass of noisy topics in the center of the plot. we can also see distinct separation of aqua-colored topic "potus" and potentially topics and in yellow. we refer the reader to other examples where umap has been leveraged for twitter analysis, to include darwish et al. [ ] for identifying clusters of twitter users with controversial topic similarity, vargas [ ] for event detection, political polarization by darwish et al. [ ] and estimating political leaning of users by [ ] . retweeting is a special activity reserved for twitter where any user can "retweet" messages which allows them to disseminate their messages rapidly to their followers. further, a highly retweeted tweet might signal that an issue has attracted attention in the highly competitive twitter environment, and may give insight about issues that resonate with the public [ ] . whereas in the first three analyses we used no retweets, in the time-series and network modeling that follows, we exclusively use retweets. we began by measuring time-toretweet. wang et al. [ ] calls this "response time" and used it to measure response efficiency and speed of information dissemination during hurricane sandy. wang analyzed , tweets and found that % of re-tweets occur within h [ ] . we researched how fast other users retweet in emergency situations, such as what spiro [ ] reported for natural disasters, and how earle [ ] reported as seconds for retweeting about an earthquake. we extracted metadata from our corpora for the tweet, user, and entities objects. for reference, we direct the reader to the twitter developer guide that provides a detailed overview of each object [ ] . due to compute limitations, we selected a sample that consisted of , tweets that included retweets from the corpora of march - , . however, since we were only focused on retweets, out of the corpus of , tweets, we reduced it to , ( %) that were only retweets. the metadata we used for both our time-to-retweet and directed graph analyses in the next section, included: ) created_at (string) -utc time when this tweet was created. ) text (string) -the actual utf- text of the status update. see twitter-text for details on what characters are currently considered valid. ) from the user object, the id_str (string) -the string representation of the unique identifier for this user. ) from the retweeted_status object (tweet) -the cre-ated_at utc time when the retweet was created. ) from the retweeted_status object (tweet) -the id_str which is the unique identifier for the retweeting user. we used the corpus of retweets and analyzed the time between the tweet created_at and the retweeted created_at. here, the rt_object is the datetime in utc format for when the message that was retweeted was originally posted. the tw_object is the datetime in utc format when the current tweet was posted. as a result, the datetime for the rt_object is older than the datetime for the current tweet. this measures the time it took for the author of the current tweet to retweet the originating message. this is similar to kuang et al. [ ] who defined response time of the retweet to be the time difference between the time of the first retweet and that of the origin tweet. further, spiro et al. [ ] calls these "waiting times". the median time-to-retweet for our corpus was . hours meaning that half of the tweets occurred within this time (less than what wang reported as . hour), and the mean was . hours. figure shows the histogram of the number of tweets by their time to retweet in seconds and figure shows it in hours. further, we found that compared to the avian influenza outbreak (h n ) in china described by zhang et al. [ ] covid retweeters sent more messages earlier than h n . zhang analyzed the log distribution of , h n related posts during april and plotted reposting time of messages on sina weibo, a chinese twitter-like platform and one of the largest microblogging sites in china figure . zhang found that h n reposting occurred with a median time of minutes (i.e. . hours) and a mean of minutes (i.e. hours). compared to zhang's study, we found our median retweet time to be . hours, about minutes faster than the reposting time during h n of . hours. when comparing figure and figure , it appears that covid retweeting does now completely slow down until . hours later ( seconds). for h n it appears to slow down much earlier by seconds. unfortunately few studies appear to document retweeting times during infectious disease outbreaks which made it hard to compare how covid retweeting behavior against similar situations. further, the h n outbreak in china occurred seven years ago and may not be a comparable set of data for numerous reasons. chinese social media may not represent similar behaviors with american twitter and this analysis does not take into account multiple factors that imply retweeting behavior to include the context, the user's position, and the time the tweet was posted [ ] . we also analyzed what rapid retweeters, or those retweeting messages even faster than the median, in less than , seconds were saying. in figure we plotted the top tf-idf features by their scores for the text of the retweets. it is intuitive to see that urls are being retweeted quickly by the presence of "https" in the body of the retweeted text. this is also consistent with studies by suh et al. [ ] who indicated that tweets with urls were a significant factor impacting retweetability. we found terms that were frequently mentioned during the early-stage keyword analysis and topic modeling mentioned again: "cases", "ventilators", "hospitals", "deaths", "masks", "test", "american", "cuomo", "york", "president", "china", and "news". when analyzing the descriptions of the users who were retweeted in figure , we ran the tf-idf vectorizer on bigrams in order to elicit more interpretable terms. user accounts whose tweets were rapidly retweeted, appeared to describe themselves as political, news-related, or some form of social media account, all of which are difficult to verify as real or fake. vii. network modeling we analyzed the network dynamics of nine different time periods within the march - , covid dataset, and visualized them based on their speed of retweeting. these types of graphs have been referred to as "retweet cascades" which describes how a social media network propagates information [ ] . similar methods have been applied for visualizing rumor propogation by jin et al. [ ] we wanted to analyze how covid retweeting behaves at different time points. we used published disaster retweeting times to serve as benchmarks for selecting time periods. as a result, the graphs in figure are plotted by retweeting time of known benchmarks -the median time to retweet after an earthquake which implies rapid notification, the median time to retweet after a funnel cloud has been seen, all the way to a one-day or hour time period. we did this to visualize a retweet cascade of fast to slow information propogation. we used median retweeting times published spiro et al. [ ] for the time it took users to retweet messages based on hazardous keywords like "funnel cloud", "aftershock", and "mudslide". we also used the h n reposting time which zhang et al. [ ] published of . hours. we generated a directed graph for each of the nine time periods, where the network consisted of a source which was the author of the tweet (user object, the id_str) and a target which was the original retweeter shown in table iv . the goal was to analyze how connections change as the retweeting speed increases. the nine networks are visualized in figure . graphs were plotted using networkx and drawn using the kamada kawai layout [ ] , a force-directed algorithm. we modeled users for each graph. we found that more nodes became too difficult to interpret. the size of the node indicates the number of degrees, or users that it is connected to. it can mean that the node has been retweeted by others several times. or, it can also mean that the node itself has been retweeted by others several times. the density of each network increases over time shown in figure and figure . very rapid retweeters, in the time it takes to retweet after an earthquake, start off with a sparse network with a few nodes in the center being the focus of retweets in figure a . by the time we reach figure d , the retweeted users are much more clustered in the center and there are more connections and activity. the top retweeted user in our median time network figure g , was a news network and tweeted "the team took less than a week to take the ventilator from the drawing board to working prototype, so that it can". by hours out in figure h , we see a concentrated set of users being retweeted and by figure i , one account appears to dominate the space being retweeted times. this account was retweeting the following message several times "she was doing #chemotherapy couldn't leave the house because of the threat of #coronavirus so her line sisters...". in addition, the number of nodes generally decreased from in "earthquake" time to in one week, and the density also generally increased, shown in table iv. these retweet cascade graphs provide only an exploratory analysis. network structures like these have been used to predict virality of messages, for example memes over time as the message is diffused across networks [ ] . but, analyzing them further could enable ) an improved understanding about how covid information diffusion is different than other outbreaks, or global events, ) how information is transmitted differently from region to region across the world, and ) what users and messages are being concentrated on over time. this would support strategies to improve government communications, emergency messaging, dispelling medical rumors, and tailoring public health announcements. there are several limitations with this study. first, our dataset is discontinuous and trends seen in figure and figure where there is an interruption in time should be taken with caution. although there appears to be a trend between one discrete time and another, without the missing data, it is impossible to confirm this as a trend. as a result, it would be valuable to apply these techniques on a larger and continuous corpus without any time breaks. we aim to repeat the methods in this study on a longer continuous stream of twitter data in the near future. next, the corpus we analyzed was already pre-filtered with thirteen "track" terms from the twitter streaming api that focused the dataset towards healthcare related concerns. this may be the reason why the high level keywords extracted in the first round of analysis were consistently mentioned throughout the different stages of modeling. however, after review of similar papers indicated in table i , we found that despite having filtered the corpus on healthcare-related terms, topics still appear to be consistent with analyses where corpora were filtered on limited terms like "#coronavirus". third, the users and conversations in twitter are not a direct representation of the u.s. or global population. the pew research foundation found that only % of american adults use twitter [ ] and that this group is different from the majority of u.s. adults, because they are on average younger, more likely to identify as democrats, more highly educated and possess higher incomes [ ] . the users were also not verified and should be considered as a possible mixture of human and bot accounts. fourth, we reduced our corpus to remove retweets for the keyword and topic modeling anlayses since retweets can obscure the message by introducing virality and altering the perception of the information [ ] . as a result, this reduced the size of our corpus by nearly % from , , tweets to , , tweets. however, there appears to be variability in terms of consistent corpora sizes in the twitter analysis literature both in table i fifth, our compute limitations prohibited us from analyzing a larger corpus for the umap, time-series, and network modeling. for the lda models we leveraged the gensim mul-ticorelda model that allowed us to leverage multiprocessing across workers. but for umap and the network modeling, we were constrained to use a cpu. however, as stated above, visualizing more than nodes for our graph models was unintepretable. applying our methods across the entire . million corpora for umap and the network models may yield more meaningful results. sixth, we were only able to iterate over different lda models based on changing the number of topics, whereas syed et al. [ ] iterated on models to select coherent models. we believe that applying a manual gridsearch of the lda parameters such as iterations, alpha, gamma threshold, chunksize, and number of passes would lead to a more diverse representation of lda models and possibly more coherent topics. seven, it was challenging to identify papers that analyzed twitter networks according to their speed of retweets for public health emergencies and disease outbreaks. zhang et al. [ ] points out that there are not enough studies of temporal measurement of public response to health emergencies. we were lucky to find papers by zhang et al. [ ] and spiro et al. [ ] who published on disaster waiting times. chew et al. [ ] and szomszor et al. [ ] have published about twitter analysis in h n and the swine flu, respectively. chew analyzed the volume of h n tweets and categorized different types of messages such as humor and concern. szomszor correlated tweets with uk national surveillance data and tang et al. [ ] generated a semantic network of tweets on measles during the measles outbreak to understand keywords mentioned about news updates, public health, vaccines and politics. however, it was difficult to compare our findings against other disease outbreaks due to the lack of similar modeling and published retweet cascade times and network models. we answered five research questions about covid tweets during march , -april , . first, we found highlevel trends that could be inferred from keyword analysis. second, we found that live white house coronavirus briefings led to spikes in topic ("potus"). third, using umap, we found strong local "clustering" of topics representing ppe, healthcare workers, and government concerns. umap allowed for an improved understanding of distinct topics generated by lda. fourth, we used retweets to calculate the speed of retweeting. we found that the median retweeting time was . hours. fifth, using directed graphs we plotted the networks of covid retweeting communities from rapid to longer retweeting times. the density of each network increased over time as the number of nodes generally decreased. lastly, we recommend trying all techniques indicated in table i to gain an overall understanding of covid twitter data. while applying multiple methods for an exploratory strategy, there is no technical guarantee that the same combination of five methods analyzed in this paper will yield insights on a different time period of data. as a result, researchers should attempt multiple techniques and draw on existing literature. models were calculated using the ruptures python package. we also applied exponential weighted moving average using the ewm pandas function. we applied a span of for march , and a span of for april - datasets, april - datasets, and april - datasets. our parameters for binary segmentation included selecting the "l " model to fit the points for topic , using n_bkps (breakpoints). crisis information distribution on twitter: a content analysis of tweets during hurricane sandy evaluating public response to the boston marathon bombing and other acts of terrorism through twitter twitter tsunami early warning network: a social network analysis of twitter information flows twitter earthquake detection: earthquake monitoring in a social world a case study of the new york city - influenza season with daily geocoded twitter data from temporal and spatiotemporal perspectives what can we learn about the ebola outbreak from tweets? covid- : the first public coronavirus twitter dataset retweeting for covid- : consensus building, information sharing, dissent, and lockdown life a first look at covid- information and misinformation sharing on twitter coronavirus on social media: analyzing misinformation in twitter conversations the covid- social media infodemic using twitter and web news mining to predict covid- outbreak a large-scale covid- twitter chatter dataset for open scientific research-an international collaboration an" infodemic": leveraging highvolume twitter data to understand public sentiment for the covid- outbreak understanding the perception of covid- policies by mining a multilanguage twitter dataset coronavirus goes viral: quantifying the covid- misinformation epidemic on twitter how the world's collective attention is being paid to a pandemic: covid- related -gram time series for languages on twitter an early look on the emergence of sinophobic behavior on web communities in the face of covid- prevalence of low-credibility information on twitter during the covid- outbreak coronavis: a real-time covid- tweets analyzer exploring the space of topic coherence measures detection and analysis of us presidential election related rumors on twitter analysis of twitter users' sharing of official new york storm response messages latent dirichlet allocation full-text or abstract? examining topic coherence scores using latent dirichlet allocation selective review of offline change point detection methods optimal detection of changepoints with a linear computational cost get your mass gatherings or large community events ready trump says fda will fast-track treatments for novel coronavirus, but there are still months of research ahead the white house. presidential memoranda using tf-idf to determine word relevance in document queries twitter trending topic classification predicting popular messages in twitter opinion mining and sentiment polarity on twitter and correlation between events and sentiment umap: uniform manifold approximation and projection for dimension reduction how umap works ¶ understanding umap unsupervised user stance detection on twitter event detection in colombian security twitter news using fine-grained latent topic analysis predicting the topical stance of media and popular twitter users bad news travel fast: a content-based analysis of interestingness on twitter waiting for a retweet: modeling waiting times in information propagation omg earthquake! can twitter improve earthquake response? introduction to tweet json -twitter developers predicting the times of retweeting in microblogs social media as amplification station: factors that influence the speed of online public response to health emergencies want to be retweeted? large scale analytics on factors impacting retweet in twitter network an algorithm for drawing general undirected graphs virality prediction and community structure in social networks share of u.s. adults using social media, including facebook, is mostly unchanged since how twitter users compare to the general public retweets are trash characterizing diabetes, diet, exercise, and obesity comments on twitter comparing twitter and traditional media using topic models empirical study of topic modeling in twitter characterizing twitter discussions about hpv vaccines using topic modeling and community detection topic modeling in twitter: aggregating tweets by conversations twitter-network topic model: a full bayesian treatment for social network and text modeling pandemics in the age of twitter: content analysis of tweets during the h n outbreak tweeting about measles during stages of an outbreak: a semantic network approach to the framing of an emerging infectious disease software framework for topic modelling with large corpora lda model parameters patient -china.thank.lockdown -case.spread.slow -day.case.week -test.case.hosp -die.world.peopl -mask.face.wear -make.home.stay -hospit.nurs.le -case.death.new -mask.ppe.ventil -portuguese -case.death.number -italian -great.god.news -potus -spanish -like.look.work -hospit.realli.patient -china.thank.lockdown -case.spread.slow -day.case.week -test.case.hosp -die.world.peopl -mask.face.wear -make.home.stay -hospit.nurs.le -case.death.new -mask.ppe.ventil -portuguese -case.death.number -italian -great.god.news -potus -spanish -like.look.work -hospit.realli.patient -china.thank.lockdown -case.spread.slow -day.case.week -test.case.hosp -die.world.peopl -mask.face.wear -make the authors would like to acknowledge john larson from booz allen hamilton for his support and review of this article. [ , ] . it provides four different coherence metrics. we used the "c_v" metric for coherence developed by roder [ ] . coherence metrics are used to rate the quality and human interpretability of a topic generated. all models were run with the default parameters using a ldamulticore model parallel computing on workers, default gamma threshhold of . , chunksize of , , iterations, passes. note -sudden decreases in figure signal may be due to temporary internet disconnection. key: cord- - k j kjv authors: kawchuk, greg; hartvigsen, jan; harsted, steen; nim, casper glissmann; nyirö, luana title: misinformation about spinal manipulation and boosting immunity: an analysis of twitter activity during the covid- crisis date: - - journal: chiropr man therap doi: . /s - - - sha: doc_id: cord_uid: k j kjv background: social media has become an increasingly important tool in monitoring the onset and spread of infectious diseases globally as well monitoring the spread of information about those diseases. this includes the spread of misinformation, which has been documented within the context of the emerging covid- crisis. understanding the creation, spread and uptake of social media misinformation is of critical importance to public safety. in this descriptive study, we detail twitter activity regarding spinal manipulative therapy (smt) and claims it increases, or “boosts”, immunity. spinal manipulation is a common intervention used by many health professions, most commonly by chiropractors. there is no clinical evidence that smt improves human immunity. methods: social media searching software (talkwalker quick search) was used to describe twitter activity regarding smt and improving or boosting immunity. searches were performed for the months and months before march , using terms related to ) smt, ) the professions that most often provide smt and ) immunity. from these searches, we determined the magnitude and time course of twitter activity then coded this activity into content that promoted or refuted a smt/immunity link. content themes, high-influence users and user demographics were then stratified as either promoting or refuting this linkage. results: twitter misinformation regarding a smt/immunity link increased dramatically during the onset of the covid crisis. activity levels (number of tweets) and engagement scores (likes + retweets) were roughly equal between content promoting or refuting a smt/immunity link, however, the potential reach (audience) of tweets refuting a smt/immunity link was times higher than those promoting a link. users with the greatest influence on twitter, as either promoters or refuters, were individuals, not institutions or organizations. the majority of tweets promoting a smt/immunity link were generated in the usa while the majority of refuting tweets originated from canada. conclusion: twitter activity about smt and immunity increased during the covid- crisis. results from this work have the potential to help policy makers and others understand the impact of smt misinformation and devise strategies to mitigate its impact. more than half of all persons on earth ( . %) are estimated to now have regular internet access with % in low-middle income countries and . % in high income countries [ ] . with this level of penetration, the internet is the most influential tool on earth for distributing information, whether it be accurate or otherwise. therefore, understanding the creation, spread and uptake of internet misinformation is of critical importance [ ] given that misinformation can be given credibility and create negative impacts [ , ] . social media has been used in recent decades to anticipate various health events including the spread of infectious disease [ ] and new cases of back pain [ ] . with recent advances in social media analytics, it is now possible to not only apply these tools to anticipate the onset and spread of various health conditions, but to also identify the onset and spread of information about those conditions. specifically, various studies have been conducted that show how social media can be used in this regard [ ] , how social media is consumed [ ] and how it can be used to set agendas [ , ] . importantly, social media is not always a positive force. many publications now document how social media can create and disseminate misinformation [ ] [ ] [ ] [ ] . even in the short time since the covid crisis was declared a pandemic on march , [ ] , several publications have now documented various types of misinformation arising during the covid crisis [ ] [ ] [ ] including potential treatments, methods of prevention and protection, dietary recommendations and disease transmission [ ] . while all misinformation is concerning, the public does not expect misinformation to be propagated by regulated health professions whose activities are overseen for public protection. unfortunately, this has not been the case during the covid- outbreak. claims that personal immunity can be improved or "boosted" through spinal manipulative therapy (axén i bergström c, bronson m, côté p, glissman cn goncalves g, hebert j, hertel aj, innes s, larsen ko, meyer a, perle sm, o'neill s, weber k, young k, leboeuf-yde c: putting lives at risk: misinformation, chiropractic and the covid- pandemic, in submission), an intervention applied by many professions but most commonly by chiropractors [ ] , appeared on social media as the covid crisis evolved. not only is there no clinical evidence of this claim [ ] , major organizations representing those who provide smt reacted immediately to condemn the promotion of this idea as potentially dangerous to public health [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . in this descriptive study, we detail how twitter activity can be used to not only document the magnitude and time course of misinformation describing a link between spinal manipulative therapy (smt) and boosting immunity, but how social media activity promotes or refutes these claims. specifically, our study aimed to answer the following research questions: has twitter activity describing a relation between smt and "boosting" immunity increased during the covid- crisis? what is the magnitude and engagement of twitter activity that promotes or refutes an smt/immunity link? does twitter activity differ between health professions that are mentioned in relation to smt and immunity? what are the demographics (i.e. language, country) of twitter authors who promote or refute a smt/ immunity link? we anticipate that knowledge gained from answering these questions will be important not only in predicting future internet misinformation about smt, but also in preventing and/or mitigating its impact. social media searching was performed using talkwalker quick search (luxembourg, luxembourg). similar to tools used for searching health literature (e.g. embase), talkwalker performs searches of specific internet content including social media platforms, news agencies, forums and blogs. talkwalker's functionality allows searching to be limited to specific content sources, date ranges, electronic devices and many other parameters using standard boolean syntax. analysis of search results can be performed in several ways including descriptive metrics generated by talkwalker using existing data (e.g. sex distribution), derived metrics generated by talkwalker using artificial intelligence algorithms (e.g. sentiment) and user-generated metrics obtained by downloading raw search results directly into other software (e.g. excel, spss). for this project, talkwalker searches were performed exclusively on twitter for the months before march , . twitter was searched preferentially for the following reasons. first, the entirety of twitter is searchable (except for direct messaging which is a private discussion between twitter users) compared to sources such as facebook whose users must purposefully make their activity available for searching. second, twitter is a one-to-one communication model where direct dialogue is possible between all users compared to news media where unbalanced communication occurs through a one-to-many model. finally, twitter activity is unmoderated creating potential for a full range of conversation (except for content excluded by twitter's rules and policies). our primary search (search # ) was constructed of three main components using boolean syntax: ) [procedure] terms related to smt, ) [profession] terms related to professions most often associated with smt and ) an immunity term [immun*]. in this study, we limited professions to be those that most often provide smt (chiropractic, physiotherapy, naturopathy, osteopathy and naprapathy). no additional filters were used (e.g. language). procedure terms included wildcard representations of words commonly used to describe smt including manipulation, adjustment and smt. profession terms included wildcard card representations of chiropractic, physical therapy, naturopathy, osteopathy and naprapathy. as talkwalker lacks the ability to perform boolean operations between searches (i.e. union, intersection, difference), we performed additional searches to explore how search terms contributed to the primary search. search # and search # were performed to understand the impact of procedures and professions on the main search. similarly, we conducted searches # - to understand if procedure terms occurred more frequently for specific professions. searches # - were performed to understand how individual professions were linked specifically to immunity. finally, search # was performed again for the months before march , as this is the longest period talkwalker can search backwards in time (not listed in table ). the above searches identified tweets that contained the search terms in the body of the tweet as words and/ or hashtags (e.g. #chiropractic). for each individual tweet identified, multiple attributes describing its content were provided including date, creator, content, country of origin, language, likes, retweets, followers etc. a glossary of twitter-related terms such as #hashtag can be found in table . the above searches resulted in mentions (see table ) over time that were then tallied and plotted. tweets arising from search # were first coded for their tone using the twitter tone index (tti). the tti (table ) is a nominal index constructed for the purpose of this paper from a training set of tweets that resulted in four coding options: ) promoting a relation between smt and/or a profession providing smt and improved immunity, ) refuting that same relation, ) neutral content or ) irrelevant content. this sample of tweets was then scored independently by four evaluators (ln, sh, cn, jw) to calibrate their use of the tti. this calibration resulted in % of tweets having at least three authors in agreement, and a fleiss kappa score of . interpreted as 'almost perfect agreement' [ ] . these same evaluators then independently assessed each tweet arising from search # using the tti. tweets not having at least evaluators in agreement were discussed to agree on a majority tti rating. unresolved ties were broken by a fifth evaluator (gk). additionally, the sentiment score of each tweet as determined by a proprietary talkwalker artificial intelligence algorithm scored tweets using positive or negative integers. the sentiment score is a rolling sum. if tweets have sentiment scores of , , and another tweets have scores of − , − , − , then the resulting sentiment score for that topic is . following tti scoring, four evaluators (ln, sh, cn, jw) individually scored tweets arising from search # regarding professions mentioned within each tweet (chiropractic, physical therapy, naturopathy, osteopathy, naprapathy). tweets that did not mention a relevant profession were coded as "none mentioned". tweets not having at least evaluators in agreement were discussed to agree on a majority rating. unresolved ties were broken by a fifth evaluator (gk). importantly, it was possible to code only whether tweets mentioned a profession; it was not possible to determine if or how the author was associated with a specific profession. the content of all tweets obtained from search # were pooled, analyzed for word frequency by a public website [ ] , then separated by tti value (promoting or refuting). influencers were considered to be tweet authors having an engagement score (retweets + likes) of greater than zero. tweets from each author were segregated by their tti value and sorted by engagement score. descriptive statistics from search # were derived for each twitter user including language, and country of origin using geographical coordinates. total mentions over the month study period are described in table . ). this figure also shows that mentions involving a profession differ over time; twitter activity related to most professions peaked near march , and then waned or oscillated. in contrast, twitter activity related to mentions of "chiropractic" increased on march and were sustained until the end of the study period. in the months before march , (fig. ) , baseline twitter activity consisted of a relatively low volume of mentions punctuated by small activity peaks. this baseline activity preceded a large activity peak coinciding with the onset of the covid crisis. the number of times a tweet is liked and retweeted. follower a twitter user who subscribes to the tweets (i.e. posts) of another twitter user. a word or phrase preceded by a hash sign (#) used on social media to identify a specific theme or topic. an individual who has the power to affect purchase decisions of others because of their authority, knowledge, position, or relationship with their audience (talkwalker's definition). like when a twitter user acknowledges another user's tweet (i.e. post). any twitter activity that contains the search terms (tweets, retweets, likes etc.) the number of potential followers (i.e. subscribers) reached by the tweet. when a tweet is retweeted (re-posted) by another twitter user. sentiment score sentiment is an expression of the emotional tone behind the tweet that attempts to summarize the attitudes and opinions being expressed. the sentiment score is an integer value which sums the sentiment values of individual mentions [ ] . tweet a post on twitter made by an individual on their own behalf or as a representative of a group/organization. there were individual tweets generated from search # (table ) . when coded to the tti, tweets were classified as not relevant with the remaining tweets divided between promoting ( ( %)), refuting ( ( %)) and neutral ( ( %)). although both promoting and refuting tweets were similar in their engagement scores ( vs. ), refuting tweets had a potential reach that was times greater than promoting tweets ( , , vs. , , ). overall, talkwalker sentiment scores were positive for promoting tweets and negative for refuting tweets. when these tweets were coded for the professions related to smt, there were tweets where a profession was not mentioned and tweets mentioning an irrelevant profession. of tweets mentioning a profession relevant to smt, some mentioned a single profession while others mentioned multiple professions; a distinction retained in our coding (table ) . from all mentions of professions ( ), chiropractic was mentioned most often ( ( %)) compared to naturopathy ( ( %)). tweets mentioning chiropractic had a potential reach of , , twitter users with a total engagement of and a total sentiment score of − while for naturopathy, the potential reach was , with a total engagement of and a total sentiment score of + . when analyzing mentions of profession for tweets that either promoted ( mentions) or refuted ( procedures are terms related to smt where health professions include chiropractic, physiotherapy, naturopathy, osteopathy and naprapthy mentions) a link between smt and immunity, chiropractic was mentioned / times ( %) in promoting tweets and / times ( %) in refuting tweets. naturopathy was the next-most mentioned profession with / ( %) mentions in promoting tweets and / ( %) mentions in refuting tweets. the major themes (frequent words) contained within the tweets from search # are presented in table . terms related to chiropractic and the term "boost" were the most common themes with "evidence" mentioned only in the refuting themes. the expression "adjustment" was used more frequently than the expression "manipulation" or "spinal manipulation". in total, there were twitter authors having engagement scores of > for the study period. table stratifies these authors into those creating promoting or refuting tweets. while total engagement was similar between both these groups, the potential reach in the refuting group was . times larger. demographics from search # were segregated by tti value (promoting or refuting) and are displayed in table . for both promoting and refuting tweets, the majority of authors were male. english was the predominant language. the country of origin differed between promoting and refuting tweets. tweets promoting a link between spinal manipulation and immunity were created most often in the united states. canada generated the greatest number tweets refuting this link (table ) . figures and were plotted using longitude and latitude data associated with each tweet. this paper presents the novel finding that twitter misinformation regarding a smt/immunity link increased dramatically during the onset of the covid crisis. further, activity levels and engagement were roughly equal between tweets promoting a smt/immunity link and tweets refuting this claim. interestingly, the potential audience (reach) of tweets refuting these claims was times higher than those promoting these claims. the majority of search results (i.e. mentions) from search # were coded as not relevant on the tti and did not mention a specific profession ( ( %)). combined with tweets having a neutral tone ( ( %)), the vast majority of mentions from search # were not relevant to our analysis. while our search terms could have been made more restrictive to reduce this number of irrelevant mentions (e.g. using "spinal manip*"), we preferred to err on the side of having too many search results that were then coded by our team rather than construct too narrow a search that potentially missed relevant tweets. clearly, twitter mentions about a smt/immunity link increased during the onset of the covid- crisis with peak activity being almost x higher on march , ( . k mentions) compared to any other peak activity in the prior months (e.g. september , , . k mentions). this suggests that mentions during the covid- crisis were intentional and not an aberration of baseline activity. to further assess baseline twitter activity, we evaluated the second largest peak of mentions in the preceding months (september , , . k mentions). this activity consisted almost entirely of twitter content unrelated to the aims of the paper. however, our analysis did reveal a smaller activity peak on october , that appeared to be related to an automated message delivered from a web content subscription service. "chiropractic care can improve your immune system, mobility, strength, and so much more. if you want to see a positive change in your health, schedule an appointment with us". this specific tweet appeared in / unique tweets on october , within hours of each other. these tweets generated a total potential reach of users and an engagement score of (retweets + likes). in contrast, a single tweet in the same time period that refuted this message generated a potential reach of users with an engagement score of . when a tweet is made, it automatically goes out to all persons who follow (i.e. subscribe) the author's twitter account. while sometimes the potential reach of that author is in the thousands or even millions, there is no guarantee that their followers open their device and see the tweet let alone read it. therefore, the number of followers, or the potential reach of an author is a measure of the potential impact of a tweet. in contrast, if someone acknowledges a tweet by giving it a like or retweeting it (i.e. rebroadcasting it to their own followers), this confirms that the original tweet was both read and acknowledged indicating a true interaction between users. considering this, tweets that refute a smt/immunity link had almost times the potential reach compared to those that promoted this link although the engagement between these two groups was similar. this is an important finding as it suggests that promoting tweets create as much engagement as refuting tweets but with the important note that refuting tweets have the potential of reaching many more persons with their message. still, it is highly likely that the engagement and potential reach of promoting and refuting tweets have differing audiences who are unaligned in their belief systems about smt and immunity [ ] . regarding professions mentioned in tweets, our coding revealed that our initial wildcard search terms for physiotherapy and osteopathy were too broad resulting in tweets having topics related to physiology and osteology for example. following coding to eliminate these tweets, chiropractic was the profession most often referenced with times more mentions than the next profession (naturopathy). these data suggest that the majority of twitter activity regarding a smt/immunity link is associated with the chiropractic profession with the total number of posts being roughly equal between those promoting and those refuting this link. tweet themes do not appear to be a good indicator of the impact of specific content as the frequency of the theme is not related to the potential reach or engagement associated with the message; an infrequent theme may be posted in a tweet with far greater reach and engagement than higher ranked themes with lower reach and engagement. the top influencers for tweets promoting and refuting a smt/immunity link each had engagement scores that were~ points higher than the next influencer. this shows influence distribution is not equal within each group. even more so, top influencers appear to be individuals and not academic institutions, regulatory bodies or professional organizations. thus, few institutions (e.g. universities, associations) were identified as influencers although some individuals with a specific institutional affiliation could be identified. although twitter data is publicly available, and twitter users agree to make their information available publicly, we have chosen not to identify user names of influencers so as not to inadvertently legitimize those who promote misinformation. the majority of those promoting or refuting a smt/immunity link were male and english speakers. interestingly, tweets promoting a smt/immunity link most commonly originated in the united states. although tweets rarely were affiliated with specific institutions, we note that the majority of chiropractic, naturopathic and osteopathic schools in the world are in the united states. in contrast, the majority of tweets refuting a smt/immunity link were from canada which suggests that geographic proximity between countries is not a factor in establishing a position on this topic. these data likely reflect the distribution of twitter use around the world. the united states is the number one user of twitter with japan in second place and canada in th place [ ] . results from this work have the potential to help policy makers and others understand the impact of smt misinformation and devise strategies to mitigate its impact. specifically, our results suggest that while the potential reach of messaging that refutes misinformation about smt was substantial, very few institutions added to this total. assuming that most institutions related to smt stand to gain from combating misinformation about smt (educational programs, associations, regulators, health care administrators etc), these same institutions should re-evaluate their social media strategies lest their silence be taken to be complicit of misinformation or lead to their own demise from an erosion of public trust. the results reported here are different from those presented previously by investigators who explored chiropractic messaging on twitter in december of [ ] . in this prior work, tweets refuting claims about questionable benefits from smt, including changes in immunity, appeared to be less in proportion compared to those promoting such claims. possible explanations for these incongruent results include the methodologies used, the year/month of data collection and an increasing awareness of social media misinformation especially during the covid crisis. while talkwalker can assess other electronic data sources, only twitter provides full access to its "firehose", the entirety of its activity except for direct messaging between users (a private channel of communication between users). as a result, the data from this paper are presumed to be robust in that they represent all activity taking place on a single social media platform although search results from talkwalker have not been compared against other services/techniques for accessing twitter data. although twitter provides a window into conversations within a social media community, it is limited in that it does not represent all persons in the world. presently, twitter ranks th in total monthly users; facebook has . billion active monthly users compared to twitter's million [ ] . some of the data used in this study were obtained from proprietary algorithms available from talker-walker quick search but whose methods of calculation were not available to us (e.g. sentiment scores). similarly, talkwalker quick search uses artificial intelligence to derive some demographic information not directly included in twitter user profiles (age, occupation and interests). these proprietary metrics of defining user profiles were not used in our analysis. twitter activity regarding misinformation about spinal manipulation and immunity increased above baseline levels during the covid crisis. direct twitter activity (posts, likes, retweets, engagement) was similar between tweets promoting and refuting a smt/immunity link. importantly, tweets refuting a smt/immunity link had the potential to be viewed by times more people than tweets promoting this link. whether promoting or refuting in tone, the chiropractic profession was most often mentioned in tweets compared to other professions associated with smt provision. results from this work have the potential to help policy makers and others understand the impact of smt misinformation and devise strategies to mitigate its impact. in: international telecommunication union (itu) public health and online misinformation: challenges and recommendations expressions of pro-and anti-vaccine sentiment on youtube twitter, #alternativefacts, careless whispers and rheumatology applications of google search trends for risk communication in infectious disease management: a case study of covid- outbreak in taiwan tweeting back: predicting new cases of back pain with mass social media data twitter as a tool for health research: a systematic review are public health organizations tweeting to the choir? understanding local health department twitter followership social media and flu: media twitter accounts as agenda setters micro agenda setters: the effect of social media on young adults' exposure to and attitude toward news. social media society how organisations promoting vaccination respond to misinformation on social media: a qualitative investigation how do we respond to the challenge of vaccine misinformation? perspect public health russiagate and propaganda: disinformation in the age of social media: routledge fact or fiction? misinformation and social media in the era of covid- : an infodemiology study (preprint) who director-general's opening remarks at the media briefing on covid- . in: who [internet coronavirus: the spread of misinformation blocking information on covid- can fuel the spread of misinformation covid- : emerging compassion, courage and resilience in the face of misinformation and adversity covid- related misinformation on social media: a qualitative study from iran epidemiology: spinal manipulation utilization the effect of spinal adjustment / manipulation on immunity and the immune system: a rapid review of relevant literature rcc research bulletins -spinal manipulation and the immune system vetenskapliga rådets kommentar på ica-dokumentet the canadian chiropractic association false and misleading advertising on covid- covid- : false and misleading advertising urgent covid- statement let's work together to protect and serve our patients, staff, families and communities talkwalker the measurement of observer agreement for categorical data online-utility.org how health care professionals use social media to create virtual communities: an integrative review statistica [internet chiropractic and spinal manipulation therapy on twitter: case study examining the presence of critiques and debates social media users by platform publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations the authors would like to thank the staff at talkwalker for their technical support and expertise, khilesh jairamdas for his assistance with our figures and jessica wong (jw) for her feedback. all authors (gk, jh, sh, cn, ln) developed, wrote, edited and proofread this work. the author(s) read and approved the final manuscript. funds from the canadian chiropractic research foundation were used to purchase talkwalker access.availability of data and materials all data generated or analysed during this study are included in this published article.ethics approval and consent to participate approval for this project was provided by the university of alberta human research ethics board (pro ). not applicable. key: cord- - a flu authors: pandya, abhinay; oussalah, mourad; kostakos, panos; fatima, ummul title: mated: metadata-assisted twitter event detection system date: - - journal: information processing and management of uncertainty in knowledge-based systems doi: . / - - - - _ sha: doc_id: cord_uid: a flu due to its asynchronous message-sharing and real-time capabilities, twitter offers a valuable opportunity to detect events in a timely manner. existing approaches for event detection have mainly focused on building a temporal profile of named entities and detecting unusually large bursts in their usage to signify an event. we extend this line of research by incorporating external knowledge bases such as dbpedia, wordnet; and exploiting specific features of twitter for efficient event detection. we show that our system utilizing temporal, social, and twitter-specific features yields improvement in the precision, recall, and derate on the benchmarked events corpus compared to the state-of-the-art approaches. social media serve as important social sensor to capture the zeitgeist of the society. with real-time, online, asynchronous message-sharing supporting text, audio, video and images, these platforms offer valuable opportunities to detect events such as natural disasters and terrorist activity in a timely manner. twitter is certainly a leader among microblogging platforms. with over . billion users and about million messages (called tweets) posted per day and over billion api calls per day, twitter provides a massive source to detect event occurrences as they happen. tweets are restricted to be maximum characters in length and it is this design choice that enables information sharing extremely fast and in real-time. owing to this scale and speed of information updates, twitter is at the focus of attention to be monitored for new event detection. also, twitter's social network features allow user interactions which helps further in gaining an insight on an event's reception by people (by analysing their sentiments/opinions expressed in their tweets). while twitter analytics in general may entail a comprehensive multi-modal approach, for example, to harness relevant information from media (images, videos, audio), we argue that since the textual part of the message is authored by the creator of the message, it is more reliable, authentic, and credible personal source to gain insights on the information shared. however, while analyzing language is as such challenging because of the inherent lack of structure in its expression, twitter exacerbates this by supporting short message text, slang and emojis, user-created meta-information tags (hashtags), and urls. traditional methods in natural language processing (nlp) were designed to work with large discourses and perform poorly on social media text. this is owing to the following: -tweets are short mainly because of the restrictions imposed by the underlying platform (i.e. twitter), but sometimes also under the influence of the cultural norms of the cyberspace ; -tweets are often ungrammatical data type. indeed, under the restriction of typing short messages from a mobile phone, grammar often takes a back seat. also, because of the growth of social media encompassing the global scale, significant number of users are non-native speakers of english; -tweets are highly contextual, i.e., the message cannot be understood without the context -geopolitical, cultural, topical (current affairs), and even conversational context (e.g. replies to previous messages); -tweets use a lot of slang and emojis; exacerbating this, new slang words and emojis are invented and popularized every day and their meanings are volatile. indeed, tweets contain polluted content [ ] , and rumors [ ] , which negatively affect the performance of the event detection algorithms. besides, only very few tweets actually carry a message about a newsworthy event [ ] . these limitations motivate the current work, which aims to develop an enhanced twitter event detection system that integrates external sources of information such as dbpedia [ ] and wordnet [ ] ; exploits social network features of twitter; and integrates knowledge obtained from the annotated resources (such as urls cited in the tweets pointing to external web pages). topic detection and tracking (tdt) project [ ] defines an event as "some unique thing that happens at some point in time". [ ] defines an event as "a real-world occurrence e with an associated time period t e and a time-ordered stream of twitter messages m e of substantial volume, discussing the occurrence and published during time t e ". we use these definitions as a guideline in developing our system of event detection from twitter. we exploit specific features of twitter that allow users to share others' tweets by re-tweeting them, quoting them, liking them; allows users to embed hashtags, urls. inspired by the existing event detection methods for detecting and tracking event-related segments [ , , ] , our system is based on identifying important phrases from individual tweets and creating a temporal profile of these phrases to identify if they are bursty enough to signify an occurrence of an event. however, our system differs from the state-of-the-art in the following: -we utilize dbpedia to efficiently identify named entities of import. -we use wordnet to assist in identifying event-specific words and phrases. -we expand the urls embedded in tweets to find important phrases from the titles of the web pages that these urls point to. -we harness the meta-data from twitter such as 'quoted status', 'retweet', 'liked', 'in reply to', and also, the social network of the twitter user. the rest of the paper is organized as follows. in sect. , we discuss the work closely related to ours. section describes our system for twitter event detection. section explains our experimental setup and results followed by discussion. finally, in sect. , we present our conclusions. event detection from twitter has been extensively studied in the past as evidenced by the rich body of work. [ , , , ] , among others. especially, techniques for event detection from twitter can be classified according to the event type (specified or unspecified events), detection method (supervised or unsupervised learning), and detection task (new event detection or retrospective event detection). however, most of the techniques described in the aforementioned surveys suffer from rigorous evaluation. on the other hand, a major acknowledged obstacle in measuring the accuracy and performance of an event detection methods is the lack of large-scale, benchmarked corpora. some authors have created their own manually annotated datasets and made them available publicly [ , ] . our work focuses on unsupervised, unspecified event detection retrospectively from a large body of tweets. among many approaches to event detection from twitter such as keyword volume approach, topic modeling, and sentiment analysis based methods, our work is based on keyphrase/segment detection and tracking which aim to identify keyphrases/segments whose occurrences grow unusually within the corpus [ , , , , , , ] . some of the most related works to ours are by [ , , , ] . edcow [ ] proposed a three-step approach. first, a wavelet transform and auto-correlation are applied to measure the bursty energy of each word and words associated with high energies are retained as event features. then, they measure the similarity between each pair of event features by using cross correlation. at last, modularity-based graph partitioning is used to detect the events, each of which contains a set of words with high cross correlation. [ ] presented a system called twevent that analyzes the tweets by breaking them into non-overlapping segments and subsequently identifying bursty segments. these bursty segments are then clustered to obtain event-related segments. [ ] contributed a large manually labeled corpus of million tweets containing events in categories. they used locality sensitive hashing (lsh) technique followed by cluster summarization and employed wikipedia as an external knowledge source. [ ] employed a statistical analysis of historical usage of words for finding bursty words -those with burstiness degree above two standard deviation from the mean are selected clustered. however, their method was used to find localized events only. [ ] proposed mention-anomaly based approach incorporating social aspect of tweets by leveraging the creation frequency of mentions that users insert in tweets to engage discussion. [ ] advocated the importance of named entities in twitter event detection. they used a clustering technique which partitions tweets based upon the entities they contain, burst detection and cluster selection techniques to extract clusters related to ongoing real-world events. recently, [ ] employed extracting a structured representation from the tweets' text using nlp, which is then integrated with dbpedia and wordnet in an rdf knowledge graph. their system enabled security analysts to describe the events of interest precisely and declaratively using sparql queries over the graph. our system, metadata-assisted twitter event detection (mated) is an extension of a previous work twevent [ ] ; however, our system makes use of several other features of twitter ignored in previous research. figure shows the architecture of mated which consists of four components: i) detection of important phrases from tweets; ii) creating temporal profiles of these phrases to identify bursty phrases; iii) clustering bursty phrases with an aim to group related phrases about an event, and iv) characterizing an event from the clusters obtained above. we parse the tweet json object after receiving it from a stream, and the first component of our system identifies important segments/phrases not just from the tweet text but also from the titles of the webpages that url links in the tweet points to. since event-related phrases are mostly named entities, we harness the dbpedia to extract such phrases. the resultant phrases along with tweet timestamps are then fed to the next component of our system which estimates their burstiness behavior using statistical modeling of their occurrence frequency. subsequently, we group the event-related phrases using a graph-based clustering algorithm. in the rest of this section, we describe each component in detail following the order of their usage in our framework. in this component, we parse the tweet json object to obtain tweet text, hashtags, urls, user mentions and other available metadata. we then create a set of phrases/keywords to be monitored consisting of the following items: -list of named entities obtained after inputting the tweet text (after preprocessing and cleaning) to dbpedia spotlight [ ] web service. -list of wordings related to action or activity present in the original tweet message. these are identified as words that are either a direct or indirect hyponym of 'event.n. ' synset of wordnet. -list of hashtags included in the tweet. -for each url that is cited in the tweet, we obtain the title text of the web page that url is pointing to. we submit this title text to our locally running dbpedia spotlight web service to obtain named entities and include these into the list of items to be monitored. -for each user mention, we include the 'name' the user mention handle is associated to. figure illustrates the overall process. the tweet text "overall fatalities caused by the disease rose to from on tuesday. the daily death toll reached a record in #spain." does not mention coronavirus which is important entity for event detection, tracking and monitoring. our system fetches and processes the title of the webpage linked to the url cited in the tweet (http:// bit.ly/ xohjz). concurrently, finding words from the tweet text that are direct or indirect hyponyms of the event.n. synset of the wordnet finds important words to track/monitor ('cause', 'reach', and 'record'). after creating a set of phrases from the dataset as indicated above, using a model proposed by twevent [ ] , we find bursty phrases potentially indicative of an event. however, our model includes several other factors in finding burstiness score of a phrase than considered in [ ] . below we outline our method. let n t denote the number of tweets published within the current time window t and n i,t be the number of tweets containing phrase i in t. the probability of observing i with a frequency n i,t can be modeled by a binomial distribution b(n t , p i ) where p i is the expected probability of observing phrase i in a random time window. since n t is very large in case of twitter stream, this probability distribution can be approximated by a normal distribution p with parameters we consider a phrase i as bursty if n i,t ≥ e[i|t]. using the formula for the burstiness probability p b (i, t) for phrase i in time window t defined by [ ] : where s() is the sigmoid function, and since s(x) smooths reasonably well for x in the range [− , ], the constant is introduced. in addition to finding the importance of a phrase based on how many times it was used in the given time window, we further assign various weight values based on who authored the phrase, how many times it was retweeted, quoted, liked, replied to. more formally, let u i,t denote the number of distinct users authoring phrase i in time window t. let retweet count of a phrase i in t be rt i,t which corresponds to the sum of the retweet counts of all tweets containing i in t. similarly, let l i,t be the liked count, q i,t be the quoted count, and rp i,t be the 'replied to' count. also, in order to assign an importance degree to the phrases used by those twitter users who have a significant following, we assign the weight fc i,t as the follower count which is the sum of the follower count of all users using phrase i in t. incorporating all the above, the burstiness weight w b (i, t) for a phrase i in t can be defined as: ( ) after finding the burstiness weight for all phrases, the top k are selected in decreasing order of weights. empirically, we find that decreasing k results in low recall, while increasing k brings in a significant noise. [ ] suggest an optimal value of k is set to √ n t . we adopt the approach by [ ] without any modification to group bursty phrases to derive event-related clusters. each time window is evenly split into m subwindows t = . let n t (i, m) be the tweet frequency of phrase i in the subwindow t m and t t (i, m) be the concatenation of all the tweets in the subwindow t m that contain phrase i. the similarity sim t (i a , i b ) between phrases i a and i b in time window t is calculated as follows: where sim(t t (i a , m), t t (i b , m)) is tf-idf similarity between tweets t t (i a , m) and t t (i b , m); and w t (i a , m) is the fraction of frequency of segment i a in the time subwindow t m as calculated as follows: using the above similarity measures, all the bursty phrases are clustered using a graph-based clustering algorithm [ ] . in this method, all bursty phrases are considered as nodes and initially, all nodes are disconnected. an edge is added between phrases i a and i b if k-nearest neighbors of i a contain i b and vice versa. all connected components of the resultant graph are considered as candidate event clusters. each connected component is essentially a set of phrases which are related to a single event. disconnected nodes (phrases) are discarded as non-significant. we characterize an event as a group of phrases associated to it. to visualize this, we adopt the approach by mabed [ ] . an interface is designed to allow us to visualize the list of relevant tweets defining the event by 'clicking' on the event name. we use the corpus collected by [ ] events which contains million tweets and labeled events. these tweets were collected from oct till nov , and were filtered to remove tweets containing more than hashtags, user mentions, or urls discarding them as spam [ ] . however, not all tweets were available due to some users' data being not available because of account inactivation, privacy mode setting changes, etc. our final dataset contain ∼ million tweets of which , are related to events. it should be noted that these tweets are limited to maximum characters in length since the increased length (up to characters) was introduced in late . in table we show some of the important events in the events [ ] dataset which have more than , tweets associated to them. we perform the following steps sequentially as part of preprocessing the tweet text and the titles of the webpages linked by the cited urls in the tweet message: . we use the stanford tokenizer to tokenize the tweets. . use of words like cooooolll, awesommmme, are sometimes used in tweets to emphasize emotion. we use a simple trick to normalize such occurrences. namely, let n denote the number of such letters that have three or more consecutive occurrences in a given word. we first replace three or more consecutive occurrences of the same character with two occurrences. then we generate n prototypes that are at edit distance (only delete operation, deleting only repeated character) and look for this prototype in the dictionary to find the word. for example, coooooolllll → cooll → cool. . we use an acronym dictionary from an online resource to find expansions of the tokens such as gr , lol, rotfl, etc. after the pre-processing task, to obtain a list of named entities, we submit the text to the dbpedia spotlight [ ] web service. we chose dbpedia over other named-entity recognition software (such as standford ner [ ] , opennlp [ ] , nltk [ ] , etc.) because employing such tools yields phrases that induce noise in the resulting system. as evaluation metrics, we used precision, recall, and derate (duplicate event rate, proposed by [ ] ). precision conveys the fraction of the detected events that are related to a realistic event. recall indicates the fraction of events detected from the manually labeled ground truth set of events. however, if two detected events are related to the same realistic event within the same time window, then both are considered correct in terms of precision, but only one realistic event is considered in counting recall. therefore, [ ] defined a metric derate to denote the fraction of events (in percentage values) that are duplicately detected among all events detected. in order to evaluate our proposal, we compare our approach with closely related works: edcow [ ] , twevent [ ] , need [ ] , and mabed [ ] . table shows the comparative performance of our system to selected state-of-art approaches. for mabed, we modified their online available code to include hashtags, instead of user mentions to measure anomaly (ref as mabed+ht). also, for our system, in order to observe the effect of wordnet words related to events, we conducted two sets of experiments where mated-wn is the system without using wordnet words. we share our source code and dataset used online. table shows the results we obtained compared to the baseline method. table shows some of the events detected by mated that were not detected by any of the above systems. several parameters impact the performance of the resulting system and the results shown in table are obtained by an optimal combination of them. it is evident from table that the performance of existing bursty segment detection based systems is enhanced by including social and twitter-specific features incorporated in our system. especially, we notice a significant improvement in recall by including title texts of the web pages pointed to by the urls in the tweets. a tweet is often a comment on the web page that is shared and therefore, by including the title text, the system incorporates a better context for the tweet. further, because of misspellings, dbpedia spotlight sometimes fail to find the named entity and in such cases, tracking event specific words from wordnet (total words) helps identify an event. for example, in table , the event on . . about ford would be missed if the word 'fault' was not included in the list of important key-phrases to be considered. better results are observed for mabed+ht as opposed to the original mabed owing to the fact that hashtags are better indicators of events than user mentions. we attribute our system's less precision value than [ ] 's system to including several more event-specific phrases from the web page titles, hashtags, and event-specific words from the wordnet resulting in a higher recall but at a slight loss of precision ( . as opposed to . of [ ] ). finally, we also noticed that many events were not reported in the crowd-sourced ground-truth events corpus. event on . . about amanda todd suicide is one example of many events we found which were not included in the corpus. a phenomenal growth in online social network services generate massive amounts of data posing a lot of challenges especially owing to the volume, variety, velocity, and veracity of the data. concurrently, methods to detect events from social streams in an efficient, accurate, and timely manner are also evolving. in this paper, we build on an existing system twevent [ ] by incorporating external knowledge bases of dbpedia and wordnet together with exploiting user's mentions and hashtags contained in twitter messages for efficient event detection. in addition, harnessing the fact that a tweet is often just a remark/comment on the news/information shared in the url cited in it, we improve event detection performance by detecting and tracking important event-related phrases from the titles of the web pages linked to the urls. we examined the effect of adding our novel features incrementally and concluded that our model outperforms the state-of-the-art on the benchmarked events [ ] corpus. future research includes investigating usage of distributed semantics (e.g., word embeddings) incorporated in a larger framework of a deep learning inspired model towards achieving higher accuracy on event detection from a massive-scale collection of social media messages. eventweet: online localized event detection from twitter topic detection and tracking: event-based information organization a survey of techniques for event detection in twitter dbpedia: a nucleus for a web of open data beyond trending topics: real-world event identification on twitter information credibility on twitter wordnet. the encyclopedia of applied linguistics parameter free bursty events detection in text streams discovery of significant emerging trends mention-anomaly-based event detection and tracking in twitter analyzing feature trajectories for event detection bursty feature representation for clustering text streams searching twitter: separating the tweet from the chaff processing social media messages in mass emergency: a survey clustering using a similarity measure based on shared near neighbors bursty and hierarchical structure in streams a survey of emerging trend detection in textual data mining seven months with the devils: a long-term study of content polluters on twitter twevent: segment-based event detection from tweets nltk: the natural language toolkit the stanford corenlp natural language processing toolkit real-time entity-based event detection for twitter building a large-scale corpus for evaluating event detection on twitter dbpedia spotlight: shedding light on the web of documents detecting events in online social networks: definitions, trends and challenges real-time event detection in massive streams breaking news detection and tracking in twitter armatweet: detecting events by semantic tweet analysis mining correlated bursty topic patterns from coordinated text streams survey and experimental analysis of event detection techniques for twitter event detection in twitter text annotation with opennlp and uima acknowledgment. this work is partly supported by eu youngres project (# ) on polarization detection. key: cord- -k f cmyn authors: shahrezaye, morteza; meckel, miriam; steinacker, l'ea; suter, viktor title: covid- 's (mis)information ecosystem on twitter: how partisanship boosts the spread of conspiracy narratives on german speaking twitter date: - - journal: nan doi: nan sha: doc_id: cord_uid: k f cmyn in late , the gravest pandemic in a century began spreading across the world. a state of uncertainty related to what has become known as sars-cov- has since fueled conspiracy narratives on social media about the origin, transmission and medical treatment of and vaccination against the resulting disease, covid- . using social media intelligence to monitor and understand the proliferation of conspiracy narratives is one way to analyze the distribution of misinformation on the pandemic. we analyzed more than . m german language tweets about covid- . the results show that only about . % of all those tweets deal with conspiracy theory narratives. we also found that the political orientation of users correlates with the volume of content users contribute to the dissemination of conspiracy narratives, implying that partisan communicators have a higher motivation to take part in conspiratorial discussions on twitter. finally, we showed that contrary to other studies, automated accounts do not significantly influence the spread of misinformation in the german speaking twitter sphere. they only represent about . % of all conspiracy-related activities in our database. in november , a febrile respiratory illness caused by sars-cov- infected people in the city of wuhan, china. on january th , the world health organization (who) declared the spread of the virus a worldwide pandemic [bbc, ] . shortly after, the who reported multiple covid- -related knowledge gaps relating to its origin, transmission, vaccinations, clinical considerations, and concerns regarding the safety of healthcare workers [who, a] . the organization warned of an "infodemic", defined by "an overabundance of information and the rapid spread of misleading or fabricated news, images, and videos" [who, b] . by august , more than million people worldwide had contracted the virus [who, c] . the organization for economic co-operation and development (oecd) put forward estimates of negative gdp growth for all member countries in due to the crisis [oecd, ] . covid- 's indomitable dissemination around the globe combined with a lack of effective medical remedies [guo et al., ; xie et al., ] and its psychological and economic side effects [oecd, ; ho et al., ; rajkumar, there are two main conditions conducive to the emergence of conspiracy narratives: individuals' psychological traits and socio-political factors. regarding psychological traits, numerous laboratory studies demonstrate the correlation between conspiracy beliefs and psychological features like negative attitude toward authorities [imhoff and bruder, ] , self-esteem [abalakina-paap et al., ] , paranoia and threat [mancosu et al., ] , powerlessness [abalakina-paap et al., ] , education, gender and age [van prooijen, ] , level of agreeableness [swami et al., ] , and death-related anxiety [newheiser et al., ] . another part of reasoning sees conspiracy mentality as a generalized political attitude [imhoff and bruder, ] and correlates conspiracy beliefs to socio-political factors like political orientation. enders et al. showed that conspiracy beliefs can be a product of partisanship [enders et al., ] . several other studies show a quadratic correlation between partisanship and the belief in certain conspiracy theories [van prooijen et al., ] . these insights imply that extremists on both sides of the political spectrum are more prone to believe in and to discuss conspiracy narratives. we define conspiracy narratives as part of the overall phenomenon of misinformation on the internet. we use misinformation as the broader concept of fake or inaccurate information that is not necessarily intentionally produced (distinguished from disinformation which is regularly based on the intention to mislead the recipients). among all the conspiracy narratives, we are interested in those propagated in times of pandemic crises. the spread of health-related conspiracy theories is not a new phenomenon [geissler and sprinkle, ; bogart et al., ; klofstad et al., ] but seems to be even accelerated in world connected via social media. the covid- pandemic's unknown features, its psychological and economical side effects, the ubiquitous availability of online social networks (osns) [pew, ] , and high levels of political polarization in many countries [fletcher et al., ; yang et al., ] make this pandemic a potential breeding ground for the spread of conspiracy narratives. from the outset of the crisis, "misleading rumors and conspiracy theories about the origin circulated the globe paired with fear-mongering, racism, and the mass purchase of face masks [...] . the social media panic travelled faster than the covid- spread" [depoux et al., ] . such conspiracy narratives can obstruct the efforts to properly inform the general public via medical and scientific findings [grimes, ] . therefore, investigating the origins and circulation of conspiracy narratives as well as the potential political motives supporting their spread on osns is of vital public relevance. with this objective, we analyzed more than . m german language tweets about covid- to answer the following research questions: research question : what volume of german speaking twitter activities comprises covid- conspiracy discussions and how much of this content is removed from twitter? research question : does the engagement with covid- conspiracy narratives on twitter correlate with political orientation of users? to what degree do automated accounts contribute to the circulation of conspiracy narratives in the german speaking twitter sphere? we collected the data for this study during the early phase of the crisis, namely, between march th, the day on which the who declared the spread of the sars-cov- virus a pandemic [bbc, ] and may st, . the data was downloaded using the twitter's streaming api by looking for the following keywords: "covid", "covid- ", "corona", and "coronavirus". only tweets posted by german speaking users or with german language were included. the final dataset comprises more than . m tweets from which two categories of conspiracy narratives were selected: conspiracy narratives about the origin of the covid- illness (table ) and those about its potential treatments ( table ) . the conspiracy narratives about the origin of the covid- illness were selected based on shahsavari et al., who automatically detected the significant circulation of the underlying conspiracy theories on twitter using machine learning methods [shahsavari et al., ] . the second group of conspiracy narratives were chosen based on the fact that they were in the center of attention in german media [tagesschau, ] and thus a considerable number of tweets discussed them [netzpolitik, ] . table indicates the number of tweets belonging to each conspiracy narrative and the keywords that are used to filter them out . there were , tweets in total discussing the underlying conspiracy narratives. figure shows the timeline of the tweets. in addition to the six conspiracy narratives, tweets were randomly extracted from the dataset and served as a control group. to answer research question , a list was extracted from official party websites; this list contains members of parliament (mps) who are active on twitter and belong to one of the six political parties in germany's federal legislature. each party runs several official twitter pages that were added to the list of twitter pages of each political party; for example, the official twitter page of the social democratic party (spd) in the federal state of bavaria, called "bayernspd", was added to the spd list. for each twitter account in the extracted list a maximum of tweet handles were downloaded from the twitter api. table shows the relevant statistics on the political tweets. in the next step, for each of the , users spreading conspiracy narratives ( table ) the lists of their tweet handles were downloaded (table ). finally, for each of them we counted the number of times they retweeted one of the political tweets in table . based on boyd et al. retweets are mainly a form of endorsement [boyd et al., ] . therefore, we assume if a user collects a discernible number of retweets from members of a certain political party, this user will most likely share the corresponding political orientation. this method of inference about the political orientation of users has been applied in similar studies [garimella et al., ] . there are multiple studies showing that exposure to misinformation can lead to persistent negative effects on citizens. the respondents in a study adjusted their judgment proportional to their cognitive ability after they realized that their initial evaluation was based on inaccurate information. in other words, respondents with lower levels of cognitive ability tend to keep biased judgments even after exposure to the truth [keersmaecker and roets, ] . in another study, tangherlini et al. found that conspiracy narratives stabilize based on the alignment of various narratives, domains, people, and places such that the removal of one or some of these entities would cause the conspiracy narrative to quickly fall apart [tangherlini et al., ]. imhoff and lamberty have shown that believing covid- to be a hoax negatively correlated with compliance with self-reported, infection-reducing, containment-related behavior [imhoff and lamberty, ] . on that account, to assess a democratic information ecosystem that is balanced rather towards reliable information than misinformation we need to monitor and estimate if covid- conspiracy theory narratives circulate significantly on twitter. based on a survey in mid-march , about % of respondents stated that they have seen some pieces of likely misinformation about covid- [pew, ]. shahsavari et al. used automated machine learning methods to automatically detect covid- conspiracy narratives on reddit, chan, and news data [shahsavari et al., ] . multiple other studies found evidence of covid- misinformation spread on different osns [boberg et al., ; ahmed et al., ; serrano et al., ] . to address the public concerns many of the service providers claimed that they will remove or tag this sort of content on their platforms. on march th , facebook, microsoft, google, twitter and reddit said they are teaming up to combat covid- misinformation on their platforms [bloomberg, ] . on april nd, twitter stated that they have removed over tweets containing misleading and potentially harmful covid- -related content [twitter, ] . on june th , we examined how many of the german conspiracy-related tweets still exist on twitter in order to understand if conspiracy-related tweets tend to exist on twitter for a longer period of time compared to non conspiracy-related tweets. table shows the results. based on table , only about . % of all covid- german tweets are about one of the conspiracy narratives under consideration. these german tweets are posted by more than , unique twitter users. while . % is small in magnitude, it still comprises a relevant number of citizens. it is important to note though that this finding does not imply that only about , twitter users believe in conspiracy theories. while our data shows the spread of conspiracy narratives, they do not reveal a user's stance towards the respective content. in terms of content moderation by twitter, on average . % of conspiracy narrative tweets are deleted after a certain period of time which is significantly higher than % of tweets in the control group. we speculate that more of the conspiracy-related tweets are deleted because of twitter's content moderation efforts that have been enforced due to recent public debates about misinformation on osns. there is a long list of laboratory studies that show a correlation between conspiracy mentality and extreme political orientation [enders et al., ; van prooijen et al., ] . in this study we answer the slightly different question if the partisanship of twitter users correlates with their contribution to conspiracy theory narrative discussions. table shows the distribution of the political orientations of users who discuss each of the underlying conspiracy narratives. table demonstrates that users who are likely to be supporters of afd and spd most actively discuss and spread covid-related conspiracy narratives on twitter. to check if contributions to conspiracy narratives are correlated with the political orientation of users, we ran a saturated poisson log-linear model on the contingency table . the model defines the counts as independent observations of a poisson random variable and includes the linear combination and the interaction between conspiracy narratives and the political orientation of users [agresti, ] . where µ ij = e(n ij ) represents the expected counts, λs are parameters to be estimated and n and p stand for narrative and political orientation. λ n p ij s corresponds to the interaction and association between conspiracy narratives and also reflects the departure from independence [agresti, ] . since we suspect that beliefs in certain mutually contradictory conspiracy theories can be positively correlated [wood et al., ] , we aggregated the six conspiracy theory cases to two based on to which category they belong and formed table to remove any possible correlation. table shows the anova analysis of the underlying saturated poisson log-linear model applied on table . the last line of resulting p-values in table shows that in interaction parameter, µ ij = e(n ij ), is statistically significant. therefore, we can reject the hypothesis that the contribution to conspiracy narratives is independent of the political orientation of users. the fact that there is evidence of a correlation between the contribution to conspiracy narratives and the political orientation of users, however, does not imply any causality. to further estimate the relative effect of political orientation on the contribution to conspiracy narratives on twitter, we applied six chi-square goodness of fit tests on the control group and each of the other six conspiracy narratives. for all of the six tests the p-values were significantly less than . , which suggests that the distributions of the contribution to the six different conspiracy narratives are statistically different compared to the control group. figure shows the distribution of the tests' residuals. the last column of figure shows that the twitter users without a certain political orientation contributed relatively less to conspiracy narratives in comparison to the control group. in other words, compared to the control group, users with certain political orientations contributed more to the circulation of conspiracy narratives. automated accounts, or users who post programmatically, make up a significant amount of between % and % of twitter users worldwide [davis et al., ] . multiple studies hold automated accounts responsible for political manipulation and undue influence on the political agenda [shao et al., ; . however, more recent studies shed light on these previous results and showed that the influence of automated accounts is overestimated. ferrara finds that automated accounts comprise less than % of users who post generally about covid- [ferrara, ] . there are multiple methods to automatically detect automated accounts on osns [alothali et al., ] . for this study, we used the method developed by . they applied random forest classification trees on more than a thousand public meta-data available using the twitter api and on other human engineered features. table figure : distribution of residuals of chi-square goodness of fit tests displays the percentage of automated accounts (users with complete automation probability higher than . ) and verified users who contribute to conspiracy narratives. based on this analysis, . % of covid- conspiracy narrative tweets are suspected to be posted by automated accounts. this number is significantly lower than many other studies on bot activities on twitter. we speculate that this occurs due to three reasons. first, the importance of the topic might have captured a lot of public attention, so that significantly more users discuss covid- -related topics compared to usual twitter discussions. second, many service providers, including twitter, have started to combat covid- misinformation because of widespread warnings. finally, we have concentrated on german tweets while the past estimates apply to tweets in english. in this study we analyzed more than . m german language tweets and showed that the volume of tweets that discuss one of the six considered conspiracy narratives represents about . % of all covid- tweets. this translates to more than , unique german speaking twitter users. imhoff and lamberty found that "believing that covid- was a hoax was a strong negative prediction of containment-related behaviors like hand washing and keeping physical distance". to provide the public with accurate information about the importance of such measures, social media intelligence can help elevate potential pitfalls of the twitter information ecosystem. using more than , tweets and , unique twitter users, we formed the contingency table of political orientation and of contribution to covid- conspiracy narratives (table ) . we then applied a saturated poisson log-linear regression and showed that we cannot statistically reject independence among the underlying variables. this implies partisans have a higher motivation for taking part in covid- -related conspiracy discussions. this shows that politically polarized citizens increase the spread of health misinformation on twitter. finally, we employed an automated accounts detection tool and showed that on average about . % of the users who discuss covid- conspiracy narratives are potentially automated accounts or bots. this number is much lower than estimations on general bot activity on twitter, which is assumed to be up to % [davis et al., ; . this study holds new insights as well as some limitations: • our results shed light on the problem of misinformation on twitter in times of crises for a certain cultural and language context: germany. we showed that the political orientation of politically polarized users translates to higher circulation of health-related conspiracy narratives on twitter. further research could compare the results of this study with other countries and language realms on twitter. • we also offer indications between political or ideological partisanship and engagement in the dissemination of misinformation on twitter. in this study we examined if political partisanship motivates individuals to take part in conspiracy discussions. in other words, we did not distinguish between tweets promoting the conspiracy narratives and those rejecting them. one could extend the analysis and study the effect of partisanship on promoting conspiracy theories. further research will also need to combine quantitative data analysis and qualitative content analysis to better understand the underlying motivations for engaging in conspiracy communication on osns. • finally, we offer a more nuanced view on the role of automated tweets regarding a highly emotionally-charged topic. there are numerous studies showing contradictory estimates of bot activity on osns. we found only about . % of users who spread covid- conspiracy tweets are potentially bots. this number is much lower than many of those put forward by other researchers. further research could investigate this result in order to understand the reasons why this estimation is lower than other case studies. the origin, transmission and clinical therapies on coronavirus disease (covid- ) outbreak-an update on the status severe covid- : a review of recent progress with a look toward the future mental health strategies to combat the psychological impact of covid- beyond paranoia and panic covid- and mental health: a review of the existing literature. asian journal of psychiatry covid- concerns and worries in patients with mental illness lacking control increases illusory pattern perception the open society and its enemies: hegel and marx. routledge groups as epistemic providers: need for closure and the unfolding of group-centrism speaking (un-)truth to power: conspiracy mentality as a generalised political attitude beliefs in conspiracies believing in conspiracy theories: evidence from an exploratory analysis of italian survey data why education predicts decreased belief in conspiracy theories conspiracist ideation in britain and austria: evidence of a monological belief system and associations between individual psychological differences and real-world and fictitious conspiracy theories the functional nature of conspiracy beliefs: examining the underpinnings of belief in the da vinci code conspiracy are all 'birthers' conspiracy theorists? on the relationship between conspiratorial thinking and political orientations political extremism predicts belief in conspiracy theories disinformation squared: was the hiv-from-fort-detrick myth a stasi success? conspiracy beliefs about hiv are related to antiretroviral treatment nonadherence among african american men with hiv what drives people to believe in zika conspiracy theories? how polarized are online and offline news audiences? a comparative analysis of twelve countries why are "others" so polarized? perceived political polarization and media use in countries the pandemic of social media panic travels faster than the covid- outbreak on the viability of conspiratorial beliefs conspiracy in the time of corona: automatic detection of covid- conspiracy theories in social media and the news tweet, tweet, retweet: conversational aspects of retweeting on twitter aristides gionis, and michael mathioudakis. mary, mary, quite contrary: exposing twitter users to contrarian news incorrect, but hard to correct. the role of cognitive ability on the impact of false information on social impressions. intelligence an automated pipeline for the discovery of conspiracy and conspiracy theory narrative frameworks: bridgegate, pizzagate and storytelling on the web a bioweapon or a hoax? the link between distinct conspiracy beliefs about the coronavirus disease (covid- ) outbreak and pandemic behavior pandemic populism: facebook pages of alternative news media and the corona crisis-a computational content analysis covid- and the g conspiracy theory: social network analysis of twitter data nlp-based feature extraction for the detection of covid- misinformation videos on youtube categorical data analysis dead and alive: beliefs in contradictory conspiracy theories botornot: a system to evaluate social bots the spread of fake news by social bots the rise of social bots what types of covid- conspiracies are populated by twitter bots? first monday detecting social bots on twitter: a literature review online human-bot interactions: detection, estimation, and characterization key: cord- -bfc h b authors: shanthakumar, swaroop gowdra; seetharam, anand; ramesh, arti title: analyzing societal impact of covid- : a study during the early days of the pandemic date: - - journal: nan doi: nan sha: doc_id: cord_uid: bfc h b in this paper, we collect and study twitter communications to understand the societal impact of covid- in the united states during the early days of the pandemic. with infections soaring rapidly, users took to twitter asking people to self isolate and quarantine themselves. users also demanded closure of schools, bars, and restaurants as well as lockdown of cities and states. we methodically collect tweets by identifying and tracking trending covid-related hashtags. we first manually group the hashtags into six main categories, namely, ) general covid, ) quarantine, ) panic buying, ) school closures, ) lockdowns, and ) frustration and hope}, and study the temporal evolution of tweets in these hashtags. we conduct a linguistic analysis of words common to all hashtag groups and specific to each hashtag group and identify the chief concerns of people as the pandemic gripped the nation (e.g., exploring bidets as an alternative to toilet paper). we conduct sentiment analysis and our investigation reveals that people reacted positively to school closures and negatively to the lack of availability of essential goods due to panic buying. we adopt a state-of-the-art semantic role labeling approach to identify the action words and then leverage a lstm-based dependency parsing model to analyze the context of action words (e.g., verb deal is accompanied by nouns such as anxiety, stress, and crisis). finally, we develop a scalable seeded topic modeling approach to automatically categorize and isolate tweets into hashtag groups and experimentally validate that our topic model provides a grouping similar to our manual grouping. our study presents a systematic way to construct an aggregated picture of peoples' response to the pandemic and lays the groundwork for future fine-grained linguistic and behavioral analysis. covid- (also known as the novel coronavirus) is a truly global pandemic and has affected humans in all countries of the world. while humanity has seen numerous epidemics including a number of deadly ones over the last two decades (e.g., sars, mers, ebola), the grief and disruption that covid- has already inflicted is incomparable. at the time of writing this paper, covid- is still rapidly spreading around the world and projections for the next few months are grim and extremely disconcerting. the learnings from covid- will also enable humankind to prevent such epidemics from transforming into global pandemics and minimize the socio-economic disruption. in this work, our goal is to analyze the societal impact of covid- in the united states of america during its early days, understand the chain of events that occurred during the spread of the infection, and draw meaningful conclusions so that similar mistakes can be avoided in the future. though twitter data has previously been shown to be biased [ ] , twitter has emerged as the primary media for people to express their opinion especially during this time and our study offers a perspective into the impact as self-disclosed by people in a form that is easily understandable and can be acted upon. we summarize our main contributions below. we collect , tweets from twitter between march th to march th , a time period when the virus made its first significant inroads into the us and quantitatively demonstrate the disruption and distress experienced by the people. we group the hashtags into six main categories, namely ) general covid, ) quarantine, ) school closures, ) panic buying, ) lockdowns, and ) frustration and hope, to quantitatively and qualitatively understand the chain of events. we observe that general covid and quarantinerelated messages remain trending throughout the duration of our study. in comparison, we observe calls for closing schools and universities peaking in the middle of march and then reducing when the closures go into effect (e.g., #closenycschools). we also observe a similar trend with panic buying with essential items particularly toilet paper becoming unavailable in stores (e.g., #panicbuying, #toiletpapercrisis). we conduct a linguistic analysis of the tweets in the different hashtag groups and present the words that are representative of each group. we observe that words such as family, life, health, and death are common across hashtag groups. additionally, for example, if we consider the school closures category, we observe that unigrams (e.g., teacher, learn) and bigrams (e.g., home school, kid home) reflect the most discussed issues. we also conduct sentiment analysis to unearth the overall sentiment of the people. our investigation reveals that people reacted positively to school closures and negatively to the lack of availability of essential goods due to panic buying. we next adopt a state-of-the-art semantic role labeling approach to identify the action words (e.g., fear, test) that are uniquely representative in each hashtag group. these action words help understand peoples' actions in each group. we leverage a lstm dependency parsing model to analyze the context of the above-mentioned action words (e.g., verb deal is accompanied by nouns such as anxiety and stress). finally, we develop a scalable seeded topic modeling (seeded lda) approach to automatically categorize tweets into specific topics of interest, especially when the topics are rarer in the dataset. we experimentally validate our seeded lda model and observe that it provides a grouping similar to our manual grouping. our study summarizes the critical public responses surrounding covid- , paving the way for future fine-grained linguistic and graph analysis. in this section, we discuss our methodology for data collection from twitter to investigate the societal impact of covid- in the united states during its early days. we collect data using the twitter search api. the results presented in this paper are based on the data collected from march to march , . we track the trending covid related hashtags every day and collect the tweets in those specific hashtags. we repeat this process to collect a total of , tweets during this time period. we group the hashtags into six main categories, namely ) general covid, ) quarantine, ) school closures, ) panic buying, ) lockdowns, and ) frustration and hope to quantitatively and qualitatively understand the chain of events. we collect data on per day basis for the different hashtags as and when they become trending. table i shows the number of tweets in each category, while table ii shows the grouping of some of the representative hashtags by category. we observe that the total number of tweets as grouped by hashtags is , , which is higher than the total number of tweets. this is because tweets can contain multiple hashtags and thus the same tweet can be grouped into multiple categories. we present some example tweets in table iii to illustrate the types of communications occurring on twitter during this period. alongside, people also rallied to support workers working hard to keep essential services running. with the beginning of april approaching, many people started to worry about their next month's rent. due to the data collection limits imposed by twitter, we are able to only collect and analyze a portion of the tweets. though we started collecting data as quickly as we conceived of this project, we were unable to collect data during the first week of march. though we ran our script to collect data as far back as march , because of the way twitter provides data, we obtained a limited number of tweets from march to march . additionally, due to the rapidly evolving situation, it is likely that we have inadvertently missed some important hashtags, despite our best efforts. as is the case with most studies based on twitter data, we also acknowledge the presence of bias in data collection [ ] . having said that, the goal of this study is to provide a panoramic summarized view of the impact of the pandemic on people's lives and aggregate public opinion as expressed by them. due to the nature of this study, we are confident that the results presented here help in appreciating the sequence of events that transpired and better prepare ourselves from possible future waves of covid- or another pandemic. in this section, we present observations and results based on our linguistic analysis of the tweets. we study the popularity and temporal evolution of individual hashtags and hashtag when are we going to #cancelrent in this state? hundreds of thousands are filing for unemployment and can't pay rent. sure, we can't be evicted, but what's preventing companies from coming after us after this is over? groups. we explore the word-usage (i.e, unigram and bigram) frequencies for each hashtag group to understand the main points of discussion. we then conduct a sentiment analysis to understand the prevailing sentiments in the tweets. we adopt a semantic role labeling approach to identify the action words (i.e., verbs) as well as the corresponding contextual analysis of these action words. finally, we develop a scalable seeded lda based topic model to automatically group tweets and validate its effectiveness with our manual grouping. figure a shows the top hashtags observed in our data. as expected, we see that hashtags corresponding directly to covid or coronavirus are the most popular hashtags as most communications are centered around them. we observe that hashtags around social isolation, staying at home, and quarantining are also popular. figure b shows the most popular hashtags by date. similar to figure a , we observe that hashtags related directly to covid and social distancing trend most on twitter. the figures and the number of tweets highlight how the pandemic gripped the united states with its rate of spread. we investigate the evolution of the number of tweets in various hashtag groups over time. to calculate the number of tweets in each hashtag group, we count the number of mentions of hashtags in that group across all the tweets. if the tweet contains more than one hashtag, it is counted as part of all the hashtags mentioned in it. as the number of tweets for hashtag groups vary significantly, we plot the groups that have similar number of tweets together. similar to figure , we observe from figure a that the total number of tweets in the general covid and quarantine categories are relatively high throughout the time period of the study. interestingly, from figure b , we observe that panic buying and calls for school closures peak around the middle of in this section, we present results from a linguistic word usage analysis across the different hashtag groups. our goal is to identify the words that are uniquely representative of the particular group. to accomplish this, first, we identify and present the most commonly used words across all the hashtags. to construct the group of common words across all hashtags, we remove the words that are same or similar to the hashtags mentioned in table ii as those words are redundant and tend to also be high in frequency. we also remove the names of places and governors such as new york, massachusetts and andrew cuomo. after filtering out these words, we then rank the words based on their occurrence in multiple groups and their combined frequency across all the groups. we observe words such as family, health, death, life, work, help, thank, need, time, love, crisis. in table iv , we present some notable example tweets containing the common words. while one may think that health refers to the virusrelated health issues, we notice that many people also refer to mental health in their tweets as a possible consequence to social distancing and anxiety caused by the virus. we also observe the usage of words such as death and crisis to indicate the seriousness of the situation. supporting workers and showing gratitude toward them is another common tweet pattern that is worth mentioning. second, we present the most semantically meaningful and uniquely identifying words in each hashtag group. to do this, we remove the common words calculated in the above step from each group. from the obtained list of words after the filtering, we then select the top words. due to space constraints, we only present results for four hashtag groups. figure gives us the uniquely identifying and semantically meaningful words in each hashtag group. in the general death. we must act very fast. first, we take care of the health care and emergency workers. then, we take care of whoever is in charge of keeping netflix and hulu running or it's going to get ugly #distancesocializing #coronavirus covid group, we find words such as impact, response, resource, and doctor. similarly, for school closures, we find words such as teacher, schedule, educator, book, and class. the panic buying top words mostly resonate the shortages experienced by people such as roll and tissue (referring to toilet paper), hoard, bidet (as an alternative to toilet paper), wipe, and water. top words in the lockdown group include immigration, shelter, safety, court, and petition, signifying the different issues surrounding lockdown. we analyze words that co-occur to understand the contextual information surrounding the words. co-occurring bigrams capture pairs of words that frequently co-occur in each group. to do this, we first filter out stop words and perform stemming and lemmatization. we calculate the overall frequencies of each word and its frequency within each class and calculate the bigram association using pearson's chi squared independence test, which determines if pairs of words occur together more than they would randomly. we select the top bigrams with the highest collocation statistics that are most intuitive for the human reader. figure shows the top bigrams for each group. we can clearly see how bigrams give better understanding compared to unigrams. bigrams such as 'toilet paper', 'panic buy', 'wash hand' clearly articulate the intents of the tweets in the panic buying group. similarly, in the lockdown group, we see 'stay home', 'work home', and 'minimize spread' emerging as top bigrams capturing what people are talking about in that group. to understand the sentiment across the different hashtag groups, we perform a comparative sentiment analysis. we use a pre-trained sentiment analysis model [ ] , which has . % accuracy on stanford sst test dataset and apply it to our dataset. our model, roberta base model, classifies the data into five sentiment categories: strongly positive, positive, neutral, negative, and strongly negative. we present the results in figure . since the neutral category is not useful for our analysis, we exclude it and scale the rest of the categories to %, normalizing for the number of tweets in each category. we notice that the school closures group has a significantly higher number of positive tweets that capture the overall positive sentiment around the closure of schools. in contrast, the panic buying group has a higher number of negative tweets showing the frustration in relation to panic buying. overall, we observe strongly positive tweets when compared to strongly negative tweets in all categories. this is especially interesting in the quarantine and frustration and hope groups, where more tweets are showing support for quarantine and hopefulness. we use allennlp bert based model [ ] to run semantic role labeling and identify the action words (verbs), which capture the actions people are referring to in the tweets. to identify the uniquely representative verbs in each group, we identify all the verbs in each group and use tf-idf vectorization to remove the common verbs across the groups. we then compute the verb frequency of remaining verbs in each group. figure shows the verbs and their frequencies. the results capture the top verbs defining each group. for example, the school closures group has close as its top verb which signifies the closing of schools while learn, read, and teach emphasize the actions corresponding to learning online because of the pandemic. in comparison some words such as mean and post are challenging to understand without additional context, so we present examples tweets containing these words to understand the context in which they are used in table v . all tweets with mean have a similar context but post is used in two different contexts. one refers to send and the other refers to the post pandemic period. along the same lines, verbs in other groups also signify people's actions during the pandemic. the panic buying group captures actions such as buy, wash, hoard, sell, and to further analyze the context in which the action words discussed in section iii-e are used, we analyze the words associated with them using dependency parsing. dependency parsing breaks down each sentence into linguistic dependency structures organized in the form of a tree. we focus on identifying the nouns that are connected to the action words. in the dependency parse, the action words/verbs form the root of the parse and the dependencies are in the left and right subtrees. to identify the nouns associated with the verbs, we traverse the dependency parse to the sub-tree where the action word of interest is present and then extract the corresponding noun. we also analyze the link associated with noun and verb and find that "nsubj" (nominal subject), "pobj" (object of a preposition) and "dobj" (the direct object) are the most related link tags that contribute to the action words. figure gives some notable dependency parse subtrees with the action word and the corresponding noun. we can see that by decoding the parse structure, we can identify additional contextual information such as the nouns they refer to. we use the allennlp implementation of a neural model for dependency parsing using biaffine classifiers on top of a bidirectional lstm [ ] . we parse the sentences associated with the top verbs in figure and find their associated nouns to understand what the action verbs are used to signify. tables vi, vii, viii, ix represent the different nouns associated with the most prominent action words in each group, respectively. general covid being the diverse group, it contains a myriad of tweets from offering support to the fear of getting infected. apart from the verb-noun combinations that we expect to see in the group (such as test virus, confirm case, offer support), the other most notable verb-noun combinations in this group are: deal stress, deal anxiety, fear system, and fear safety. and in other groups, the verb-noun combinations narrow down on the specific actions relevant to the group. for instance, in school closures, the action word close mostly talks about closing the schools for benefit of students, and action word offer co-occurs with teaching aids through online sources. in the panic buying group, tweets about the panic experienced by people is captured by verb-noun pairs such as stop madness, buy paper, find store. in the lockdown group, some interesting combinations surface such as believe information, guess trust, which captures the possible distrust people have with the lockdown measures. nouns such as insanity also help in capturing peoples' reaction to the lockdown measures. in this section, we use seeded lda [ ] to categorize the tweets and check the closeness of these automatically obtained groups with our manual grouping using the hashtags. as we are specifically interested in isolating the tweets in specific topics of our interest than general topics identified by a topic model, we leverage a seeded variant of lda, seeded lda [ ] to guide the topic model to discover them. seeded lda allows seeding of topics by providing a small set of keywords to guide topic discovery influencing both the document-topic and the topic-word distributions. the seed words need not be exhaustive as the model is able to detect other words in the same category via co-occurrence in the dataset. our goal with seeded lda is to i) present a way to automatically categorize tweets into specific topics of interest, especially when the topics are rarer in the dataset, ii) passively evaluate the effectiveness of our word analysis thus far, and iii) develop a scalable approach that can be extended to millions of tweets with minimal manual intervention. we develop a seeded lda model to categorize tweets into the five hashtag groups: i) general covid, ii) school closures, iii) panic buying, iv) lockdowns, and v) quarantine by seeding each group with seed words from our analysis in section iii-b. we leave out the frustration and hope topic due to the inherent polarizing nature of the keywords and the lack of identifying keywords that are unique for the topic. we select the top few words from our words in figure as seed words for our seeded lda model. table x gives the seed words for the different covid categories. we include k unseeded topics in our model to account for messages that do not fall into these topic categories. after experimenting with different values of k and manually evaluating the topics, we find that k = gives us the best separation and categorization. we use α = . and β = . to give us sparse documenttopic and topic-word distributions where fewer topics and words with high values emerge, so we can classify the tweets to the predominant category. we train the seeded lda models for iterations. we first use the documenttopic distribution to get the best topic for each tweet. if the best topic of the message is one of the seeded topics which correspond to the categories, then, we classify the tweet into that category. in the event that a clear best topic does not emerge, we randomly assign the tweet to one of the topics that have the same document topic distribution. ) analyzing effectiveness of seeded lda model: to check how closely the hashtag groups match with the seeded lda groups, we measure the accuracy by comparing the document topic distribution from the lda against the grouping determined by the hashtags. we do this by calculating the confusion matrix which gives us four metrics such as true positives, true negatives, false positives, and false negatives to further calculate accuracy, precision, recall, and f scores, which gives the value of correctness. the results we obtained are shown in the table xi . this endeavor helps in determining the effectiveness of our word analysis (seeds) and our seeded lda model. also, to verify that our model had best results in the groups that we are interested in, we calculate the precision, recall, and f scores for the school closures, panic buying, lockdowns, and quarantine groups. we exclude general covid and frustration and hope groups as they are too general, and we are interested in isolating the more specific covid groups. table xii shows the results of each group. by examining the result, we observe that the manual grouping of the hashtags have significant match with the seeded lda groups. we also note that the seeded lda model is able to correctly isolate the tweets in rarer groups where there is less data, such the school closures group. this shows the effectiveness of our model to analyze rarer groups in the data. additionally, from the precision of classification for the quarantine group, we observe that the false positives were significantly low and further adds credibility to our model. iv. related work in this section, we outline existing research related to modeling and analyzing twitter and web data to understand social, political, psychological, and economic impacts of a variety of different events. due to the recent nature of the outbreak, there is little to no published work on covid- . we primarily focus on discussing work that analyze twitter communications. ahmed et al. focus on the conspiracy theories surrounding the novel coronavirus, especially in relationship with g [ ] . the authors analyze twitter communications and discuss the possibility of using bots for propagating misinformation and political conspiracies during the pandemic [ ] , [ ] . in comparison, the authors in [ ] conduct infodemiology studies on twitter communications to understand how information is spreading during this time, while the the stigma created by referencing the novel coronavirus as "chinese virus" is investigated in [ ] . twitter has been used to study political events and related stance [ ] , [ ] , human trafficking [ ] , and public health [ ] , [ ] , [ ] , [ ] , [ ] . several work perform fine-grained linguistic analysis on social media data [ ] , [ ] , [ ] . v. discussion and concluding remarks in this paper, we studied twitter communications in the united states during the early days of the covid- outbreak. as the disease continued to spread, we observed panic buying as well as calls for closures of schools, bars, cities, social distancing and quarantining. we conducted a linguistic word-usage analysis and identified the most frequently occurring unigrams and bigrams in each group that provide us an idea of the main discussion points. we conducted sentiment analysis to understand the extent of positive and negative sentiments in the tweets. we then performed semantic role labeling to identify the key action words and then obtained the corresponding contextual words using dependency parsing. finally, we designed a scalable seeded topic modeling approach to automatically identify the key topics in the tweets. discovering, assessing, and mitigating data bias in social media roberta: a robustly optimized bert pretraining approach simple bert models for relation extraction and semantic role labeling deep biaffine attention for neural dependency parsing incorporating lexical priors into topic models covid- and the g conspiracy theory: social network analysis of twitter data coronavirus goes viral: quantifying the covid- misinformation epidemic on twitter what types of covid- conspiracies are populated by twitter bots?" first monday creating covid- stigma by referencing the novel coronavirus as the ?chinese virus? on twitter: quantitative analysis of social media data conversations and medical news frames on twitter: infodemiological study on covid- in south korea all i know about politics is what i read in twitter: weakly supervised models for extracting politicians' stances from twitter bumps and bruises: mining presidential campaign announcements on twitter the impact of environmental stressors on human trafficking using twitter to understand the human bowel disease community: exploratory analysis of key topics how social media will change public health detecting and characterizing mental health related self-disclosure in social media predicting depression via social media characterizing sleep issues using twitter fine-grained analysis of cyberbullying using weakly-supervised topic models a socio-linguistic model for cyberbullying detection weakly supervised cyberbullying detection using co-trained ensembles of embedding models key: cord- -rx cux i authors: sarker, abeed; lakamana, sahithi; hogg-bremer, whitney; xie, angel; al-garadi, mohammed ali; yang, yuan-chi title: self-reported covid- symptoms on twitter: an analysis and a research resource date: - - journal: j am med inform assoc doi: . /jamia/ocaa sha: doc_id: cord_uid: rx cux i objective: to mine twitter and quantitatively analyze covid- symptoms self-reported by users, compare symptom distributions across studies, and create a symptom lexicon for future research. materials and methods: we retrieved tweets using covid- -related keywords, and performed semiautomatic filtering to curate self-reports of positive-tested users. we extracted covid- -related symptoms mentioned by the users, mapped them to standard concept ids in the unified medical language system, and compared the distributions to those reported in early studies from clinical settings. results: we identified positive-tested users who reported symptoms using unique expressions. the most frequently-reported symptoms were fever/pyrexia ( . %), cough ( . %), body ache/pain ( . %), fatigue ( . %), headache ( . %), and dyspnea ( . %) amongst users who reported at least symptom. mild symptoms, such as anosmia ( . %) and ageusia ( . %), were frequently reported on twitter, but not in clinical studies. conclusion: the spectrum of covid- symptoms identified from twitter may complement those identified in clinical settings. the outbreak of the coronavirus disease (covid- ) is of the worst pandemics in the known world history. , as of may , , over million confirmed positive cases have been reported globally, causing over deaths. as the pandemic continues to ravage the world, numerous research studies are being conducted whose focuses range from trialing possible vaccines and predicting the trajectory of the outbreak to investigating the characteristics of the virus by studying infected patients. early studies focusing on identifying the symptoms experienced by those infected by the virus mostly included patients who were hospitalized or received clinical care. [ ] [ ] [ ] many infected people only experience mild symptoms or are asymptomatic and do not seek clinical care, although the specific portion of asymptomatic carriers is unknown. [ ] [ ] [ ] to better understand the full spectrum of symptoms experienced by infected people, there is a need to look beyond hospital-or clinic-focused studies. with this in mind, we explored the possibility of using social media, namely twitter, to study symptoms self-reported by users who tested positive for covid- . our primary goals were to (i) verify that users report their experiences with covid- -including their positive test results and symptoms experienced-on twitter, and (ii) compare the distribution of self-reported symptoms with those reported in studies conducted in clinical settings. our secondary objectives were to (i) create a covid- symptom corpus that captures the multitude of ways in which users express symptoms so that natural language processing (nlp) systems may be developed for automated symptom detection, and (ii) collect a cohort of covid- -positive twitter users whose longitudinal self-reported information may be studied in the future. to the best of our knowledge, this is the first study that focuses on extracting covid- symptoms from public social media. we have made the symptom corpus public with this article to assist the research community, and it will be part of a larger, maintained data resource-a social media covid- data bundle (https://sarkerlab. org/covid_sm_data_bundle/). we collected tweets, including texts and metadata, from twitter via its public streaming application programming interface. first, we used a set of keywords/phrases related to the coronavirus to detect tweets through the interface: covid, covid , covid- , coronavirus, and corona and virus, including their hashtag equivalents (eg, #covid ). due to the high global interest on this topic, these keywords retrieved very large numbers of tweets. therefore, we applied a first level of filtering to only keep tweets that also mentioned at least of the following terms: positive, negative, test, and tested, along with at least of the personal pronouns: i, my, us, we, and me; and only these tweets were stored in our database. to discover users who self-reported positive covid- tests with high precision, we applied another layer of filtering using regular expressions. we used the expressions "i.*test[ed] positive," "we.*test [ed] positive," "test.*came back positive," "my.*[covidjcoronavirusjcovid ].*symptoms," and "[covidjcoro-navirusjcovid ].*[testjtested].*us." we also collected tweets from a publicly available twitter dataset that contained ids of over million covid- -related tweets and applied the same layers of filers. three authors manually reviewed the tweets and profiles to identify true self-reports, while discarding the clear false positives (eg, ". . . i dreamt that i tested positive for covid . . ."). we further removed users from our covid- -positive set if their self-reports were deemed to be fake or were duplicates of posts from other users, or if they stated that their tests had come back negative despite their initial beliefs about contracting the virus. these multiple layers of filtering gave us a manageable set of potential covid- -positive users (a few hundred) whose tweets we could analyze semiautomatically. the filtering decisions were made iteratively by collecting sample data for hours and days and then updating the collection strategy based on analyses of the collected data. for all the covid- -positive users identified, we collected all their past posts dating back to february , . we excluded non-english tweets and those posted earlier than the mentioned date. we assumed that symptoms posted prior to february were unlikely to be related to covid- , particularly because our data collection started in late february, and most of the positive test announcements we detected were from late march to early april. since we were interested only in identifying patient-reported symptoms in this study, we attempted to shortlist tweets that were likely to mention symptoms. to perform this, we first created a meta-lexicon by combining meddra, consumer health vocabulary (chv), and sider. lexicon-based approaches are known to have low recall-particularly for social media data, since social media expressions are often nonstandard and contain misspellings. , therefore, instead of searching the tweets for exact expressions from the tweets, we performed inexact matching using a string similarity metric. specifically, for every symptom in the lexicon, we searched windows of term sequences in each tweet, computed their similarities with the symptom, and extracted sequences that had similarity values above a prespecified threshold. we used the levenshtein ratio as the similarity metric, computed as À lev: dist: maxðlengthÞ , where lev. dist. represents the levenshtein distance between the strings and max(length) represents the length of the longer string. our intent was to attain high recall, so that we were unlikely to miss possible expressions of symptoms while filtering out many tweets that were completely off topic. we set the threshold via trial and error over sample tweets, and because of the focus on high recall, this approach still retrieved many false positives (eg, tweets mentioning body parts but not in the context of an illness or a symptom). after running this inexact matching approach on approximately user profiles, we manually extracted the true positive expressions (ie, those that expressed symptoms in the context of a covid- ) and added them to the meta-lexicon. following these multiple filtering methods, we manually reviewed all the posts from all the users, identified each true symptom expressed, and removed the false positives. we semiautomatically mapped the expressions to standardized concept ids in the unified medical language system using the meta-lexicon we developed and the national center for biomedical ontology bioportal. in the absence of exact matches, we searched the bioportal to find the most appropriate mappings. using twitter's web interface, we manually reviewed all the profiles, paying particularly close attention to those with less than potential symptom-containing tweets, to identify possible false negatives left by the similarity-based matching algorithms. all annotations and mappings were reviewed, and the reviewers' questions were discussed at meetings. in general, we found that it was easy for annotators to detect expressions of symptoms, even when the expressions were nonstandard (eg, "pounding in my head" ¼ headache). each detected symptom was reviewed by at least authors, and the first author of the article reviewed all the annotations. once the annotations were completed, we computed the frequencies of the patient-reported symptoms on twitter and compared them with several other recent studies that used data from other sources. we also identified users who reported that they had tested positive and also specifically stated that they showed "no symptoms." we excluded nonspecific statements about symptoms, such as "feeling sick" and "signs of pneumonia." when computing the frequencies and percentages of symptoms, we used models: (i) computing raw frequencies over all the detected users, and (ii) computing frequencies for only those users who reported at least symptom or explicitly stated that they had no symptoms. we believe the frequency distribution for (ii) was more reliable since for users who reported no specific symptoms, we could not verify if they had actually not experienced any symptoms (ie, asymptomatic) or just did not share any symptoms over twitter. our initial keyword-based data collection and filtering from the different sources retrieved millions of tweets, excluding retweets. we found many duplicate tweets, which were mostly reposts (not retweets) of tweets posted by celebrities. removing duplicates left us with users ( tweets). of them were labeled as "negatives"-users who stated that their tests had come back negative, removed their original covid- -positive self-reports, or posted fake information about testing positive (eg, we found some users claiming they tested positive as an april fools' joke). this left us with covid- -positive users with tweets since february . the similarity-based symptom detection approach reduced the number of unique tweets to review to . the users expressed total symptoms (mean: . ; median: ) using unique expressions, which we grouped into categories, including a "no symptoms" category (table ) . users expressed at least symptom or stated that they were asymptomatic ( . %). ( . %) users did not mention any symptoms or only expressed generic symptoms, which we did not include in the counts (we provide these expressions in the lexicon accompanying this paper). users explicitly mentioned that they experienced no symptoms. as table shows, fever/pyrexia was the most commonly reported symptom, followed by cough, body ache & pain, headache, fatigue, dyspnea, chills, anosmia, ageusia, throat pain and chest pain-each mentioned by over % of the users who reported at least symptom. figure illustrates the first detected report of each symptom from the cohort members on a timeline, and figure shows the distribution of the number of symptoms reported by the cohort. table compares the symptom percentages reported by our twitter cohort with several early studies conducted in clinical settings (ie, patients who were either hospitalized or visited hospitals/ clinics for treatment). the top symptoms remained fairly consistent across the studies-fever/pyrexia, cough, dyspnea, headache, body ache, and fatigue. the percentage of fever ( %), though the highest in our dataset, is lower than all the studies conducted in clinical settings. in our study, we distinguished, where possible, between myalgia and arthralgia and combined pain (any pain other than those explicitly specified) and body ache. combining all these into cate- gory, as some studies had done, would result in a higher proportion. we found considerable numbers of reports of anosmia ( %) and ageusia ( %), with approximately one-fourth of our cohort reporting these symptoms. reports of these symptoms, however, were missing from the referenced studies conducted in clinical settings. our study revealed that there were many self-reports of covid- positive tests on twitter, although such reports are buried in large amounts of noise. we observed a common trend among twitter users of describing their day-to-day disease progression since the onset of symptoms. this trend perhaps became popular as celebrities started describing their symptoms on twitter. we saw many reports from users who reported to have tested positive but initially showed no symptoms, and some who expressed anosmia and/or ageusia (first reported on march ) as the only symptoms, which were undocumented in the comparison studies. there are some studies that suggest that anosmia and ageusia may be the only symptoms of covid- among otherwise asymptomatic patients. [ ] [ ] [ ] the most likely explanation behind the differences between symptoms reported on twitter and the clinical studies is that the former were reported mostly by users who had milder infections, while people who visited hospitals often went there to receive treatment for serious symptoms. also, the median ages of the patients studied in clinical studies tended to be much higher than the median age of twitter users (in the us, median twitter user age is ). in contrast to the clinical studies, in our cohort, some users expressed mental healthrelated consequences (eg, stress/anxiety) of testing positive. it was difficult in many cases to ascertain if the mental health issues were directly related to covid- or whether the users had prior histories of such conditions. to the best of our knowledge, this is the first study to have utilized twitter to curate symptoms posted by covid- -positive users. in the interest of community-driven research, we have made the symptom lexicon available with this publication. the cohort of users detected over social media will enable us to conduct targeted studies in the future, enable us to study relatively unexplored topics such as the mental health impacts of the pandemic, and the longterm health-related consequences of those infected by the virus. the work reported in this article was supported by funding from emory university, school of medicine. funding for computational huang et al. chen et al. wang et al. chen et al. guan for users who expressed at least symptom or expressed that they did not have any symptoms. *the study provided a combined number for myalgia and fatigue. headache and dizziness was combined for this study. the reported number is for myalgia/muscle ache and/or arthralgia. in our study, we separated myalgia, arthralgia, body ache, and pain. who director-general's opening remarks at the media briefing on covid- - effects of the covid- pandemic on the world population: lessons to adopt from past years global pandemics covid- map -johns hopkins coronavirus resource center clinical characteristics of coronavirus disease in china clinical progression of patients with covid- in shanghai clinical features of patients infected with novel coronavirus in wuhan, china presumed asymptomatic carrier transmission of covid- a familial cluster of pneumonia associated with the novel coronavirus indicating person-to-person transmission: a study of a family cluster early transmission dynamics in wuhan, china, of novel coronavirus-infected pneumonia tracking social media discourse about the covid- pandemic: development of a public meddra: an overview of the medical dictionary for regulatory activities exploring and developing consumer health vocabularies the sider database of drugs and side effects semi-supervised approach to monitoring clinical depressive symptoms in social media combining lexicon-based and learning-based methods for twitter sentiment analysis welcome to the ncbo bioportal j ncbo bioportal clinical characteristics of hospitalized patients with novel coronavirus-infected pneumonia in wuhan, china epidemiological and clinical characteristics of cases of novel coronavirus pneumonia in wuhan, china: a descriptive study report of the who-china joint mission on coronavirus disease isolated sudden onset anosmia in covid- infection. a novel syndrome? loss of smell or taste as the only symptom of covid- . covid- med nedsatt lukte-og smakssans som eneste symptom anosmia and dysgeusia in the absence of other respiratory diseases: should covid- infection be considered? sizing up twitter users j pew research center none declared. as designed the study and data collection/filtering strategies. all authors contributed to the analyses, annotation process, and the writing of the manuscript. the authors would like to acknowledge the feedback provided by collaborators from emory university and the georgia department of public health (gdph) through the emory-gdph partnership for covid- . key: cord- -wes my e authors: masud, sarah; dutta, subhabrata; makkar, sakshi; jain, chhavi; goyal, vikram; das, amitava; chakraborty, tanmoy title: hate is the new infodemic: a topic-aware modeling of hate speech diffusion on twitter date: - - journal: nan doi: nan sha: doc_id: cord_uid: wes my e online hate speech, particularly over microblogging platforms like twitter, has emerged as arguably the most severe issue of the past decade. several countries have reported a steep rise in hate crimes infuriated by malicious hate campaigns. while the detection of hate speech is one of the emerging research areas, the generation and spread of topic-dependent hate in the information network remain under-explored. in this work, we focus on exploring user behaviour, which triggers the genesis of hate speech on twitter and how it diffuses via retweets. we crawl a large-scale dataset of tweets, retweets, user activity history, and follower networks, comprising over million tweets from more than $ $ million unique users. we also collect over k contemporary news articles published online. we characterize different signals of information that govern these dynamics. our analyses differentiate the diffusion dynamics in the presence of hate from usual information diffusion. this motivates us to formulate the modelling problem in a topic-aware setting with real-world knowledge. for predicting the initiation of hate speech for any given hashtag, we propose multiple feature-rich models, with the best performing one achieving a macro f score of . . meanwhile, to predict the retweet dynamics on twitter, we propose retina, a novel neural architecture that incorporates exogenous influence using scaled dot-product attention. retina achieves a macro f -score of . , outperforming multiple state-of-the-art models. our analysis reveals the superlative power of retina to predict the retweet dynamics of hateful content compared to the existing diffusion models. for the past half-a-decade, in synergy with the sociopolitical and cultural rupture worldwide, online hate speech has manifested as one of the most challenging issues of this century transcending beyond the cyberspace. many hate crimes against minority and backward communities have been directly linked with hateful campaigns circulated over facebook, twitter, gab, and many other online platforms [ ] , [ ] . online social media has provided an unforeseen speed of information spread, aided by the fact that the power of content generation is handed to every user of these platforms. extremists have exploited this phenomenon to disseminate hate campaigns to a degree where manual monitoring is too costly, if not impossible. thankfully, the research community has been observing a spike of works related to online hate speech, with a vast majority of them focusing on the problem of automatic detection of hate from online text [ ] . however, as ross et al. [ ] pointed it out, even manual identification of hate speech comes with ambiguity due to the differences in the definition of hate. also, an important signal of hate speech is the presence of specific words/phrases, which vary significantly across topics/domains. tracking such a diverse socio-linguistic phenomenon in realtime is impossible for automated, large-scale platforms. an alternative approach can be to track potential groups of users who have a history of spreading hate. as matthew et al. [ ] suggested, such users are often a very small fraction of the total users but generate a sizeable portion of the content. moreover, the severity of hate speech lies in the degree of its spread, and an early prediction of the diffusion dynamics may help combat online hate speech to a new extent altogether. however, a tiny fraction of the existing literature seeks to explore the problem quantitatively. matthew et al. [ ] put up an insightful foundation for this problem by analyzing the dynamics of hate diffusion in gab . however, they do not tackle the problem of modeling the diffusion and restrict themselves to identifying different characteristics of hate speech in gab. hate speech on twitter: twitter, as one of the largest micro-blogging platforms with a worldwide user base, has a long history of accommodating hate speech, cyberbullying, and toxic behavior. recently, it has come hard at such contents multiple times , and a certain fraction of hateful tweets are often removed upon identification. however, a large majority of such tweets still circumvent twitter's filtering. in this work, we choose to focus on the dynamics of hate speech on twitter mainly due to two reasons: (i) the wide-spread usage of twitter compared to other platforms provides scope to grasp the hate diffusion dynamics in a more realistic manifestation, and (ii) understanding how hate speech emerges and spreads even in the presence of some top-down checking measures, compared to unmoderated platforms like gab. diffusion patterns of hate vs. non-hate on twitter: hate speech is often characterized by the formation of echochambers, i.e., only a small group of people engaging with such contents repeatedly. in figure , we compare the temporal diffusion dynamics of hateful vs. non-hate tweets (see sections vi-a and vi-b for the details of our dataset and hate detection methods, respectively). following the standard information diffusion terminology, the set of susceptible nodes at any time instance of the spread is defined by all such nodes which have been exposed to the information (followers of those who have posted/retweeted the tweet) up to that instant but did not participate in spreading (did not retweet/like/comment). while hateful tweets are retweeted in a significantly higher magnitude compared to non-hateful ones (see figure (a)), they tend to create lesser number of susceptible users over time (see figure (b)). this is directly linked to two major phenomena: primarily, one can relate this to the formation of hate echo-chambers -hateful contents are distributed among a well-connected set of users. secondarily, as we define susceptibility in terms of follower relations, hateful contents, therefore, might have been diffusing among connections beyond the follow network -through paid promotion, etc. also one can observe the differences in early growth for the two types of information; while hateful tweets acquire most of their retweets and susceptible nodes in a very short time and stall, later on, non-hateful ones tend to maintain the spread, though at a lower rate, for a longer time. this characteristic can again be linked to organized spreaders of hate who tend to disseminate hate as early as possible. topic-dependence of twitter hate: hateful contents show strong topic-affinity: topics related to politics and social issues, for example, incur much more hateful content compared to sports or science. hashtags in twitter provide an overall mapping for tweets to topics of discussion. as shown in figure , the degree of hateful content varies significantly for different hashtags. even when different hashtags share a common theme (such as topic of discussion #jamiaunderattack, #jamiaviolence and #jamiacctv), they may still incur a different degree of hate. previous studies [ ] tend to denote users as hate-preachers irrespective of the topic of discussion. however, as evident in figure , the degree of hatefulness expressed by a user is dependent on the topic as well. for example, while some users resort to hate speech concerning covid- and china, others focus on topics around the protests against the citizenship amendment act in india. exogenous driving forces: with the increasing entangle- the color of a cell corresponds to a user, and a hashtag signifies the ratio of hateful to non-hate tweets posted by that user using that specific hashtag. ment of virtual and real social processes, it is only natural that events happening outside the social media platforms tend to shape the platform's discourse. though a small number of existing studies attempt to inquire into such inter-dependencies [ ] , [ ] , the findings are substantially motivating in problems related to modeling information diffusion and user engagement in twitter and other platforms. in the case of hate speech, exogenous signals offer even more crucial attributes to look into, which is global context. for both detecting and predicting the spread of hate speech over short tweets, the knowledge of context is likely to play a decisive role present work: based on the findings of the existing literature and the analysis we presented above, here we attempt to model the dynamics of hate speech spread on twitter. we separate the process of spread as the hate generation (asking for who will start a hate campaign) and retweet diffusion of hate (who will spread an already started hate campaign via retweeting). to the best of our knowledge, this is the very first attempt to delve into the predictive modeling of online hate speech. our contributions can be summarized as follows: ) we formalize the dynamics of hate generation and retweet spread on twitter subsuming, the activity history of each user and signals propagated by the localized structural properties of the information network of twit-ter induced by follower connections as well as global endogenous and exogenous signals (events happening inside and outside of twitter) (see section iii). ) we present a large dataset of tweets, retweets, user activity history, and the information network of twitter covering versatile hashtags, which made to trend very recently. we manually annotate a significant subset of the data for hate speech. we also provide a corpus of contemporary news articles published online (see section vi-a for more details). ) we unsheathe rich set of features manifesting the signals mentioned above to design multiple prediction frameworks which forecast, given a user and a contemporary hashtag, whether the user will write a hateful post or not (section iv). we provide an in-depth feature ablation and ensemble methods to analyze our proposed models' predictive capability, with the best performing one resulting in a macro f -score of . . ) we propose retina (retweeter identifier network with exogenous attention), a neural architecture to predict potential retweeters given a tweet (section v-b). retina encompasses an attention mechanism which dictates the prediction of retweeters based on a stream of contemporary news articles published online. features representing hateful behavior encoded within the given tweet as well as the activity history of the users further help retina to achieve a macro f -score of . , significantly outperforming several state-of-the-art retweet prediction models. we have made public our datasets and code along with the necessary instructions and parameters, available at https://github.com/lcs -iiitd/retina. hate speech detection. in recent years, the research community has been keenly interested in better understanding, detection, and combating hate speech on online media. starting with the basic feature-engineered logistic regression models [ ] , [ ] to the latest ones employing neural architectures [ ] , a variety of automatic online hate speech detection models have been proposed across languages [ ] . to determine the hateful text, most of these models utilize a static-lexicon based approach and consider each post/comment in isolation. with lack of context (both in the form of individual's prior indulgence in the offense and the current world view), the models trained on previous trends perform poorly on new datasets. while linguistic and contextual features are essential factors of a hateful message, the destructive power of hate speech lies in its ability to spread across the network. however, only recently have researchers started using network-level information for hate speech detection [ ] , [ ] . rathpise and adji [ ] proposed methods to handle class imbalance in hate speech classification. a recent work showed how the anti-social behavior on social media during covid- led to the spread of hate speech. awal et al. [ ] coined the term, 'disability hate speech' and showed its social, cultural and political contexts. ziems et al. [ ] explained how covid- tweets increased racism, hate, and xenophobia in social media. while our work does not involve building a new hate speech detection model, yet hate detection underpins any work on hate diffusion in the first place. inspired by existing research, we also incorporate hate lexicons as a feature for the diffusion model. the lexicon is curated from multiple sources and manually pruned to suit the indian context [ ] . meanwhile, to overcome the problem of context, we utilize the timeline of a user to determine her propensity towards hate speech. information diffusion and microscopic prediction. predicting the spread of information on online platforms is crucial in understanding the network dynamics with applications in marketing campaigns, rumor spreading/stalling, route optimization, etc. the latest in the family of diffusion being the chassis [ ] model. on the other end of the spectrum, the sir model [ ] effectively captures the presence of r (recovered) nodes in the system, which are no longer active due to information fatigue . even though limited in scope, the sir model serves as an essential baseline for all diffusion models. among other techniques, a host of studies employ social media data for both macroscopic (size and popularity) and microscopic (next user(s) in the information cascade) prediction. while highly popular, both deepcas [ ] and deephawkes [ ] focus only on the size of the overall cascade. similarly, khosla et al. [ ] utilized social cues to determine the popularity of an image on flickr. while independent cascade (ic) based embedding models [ ] , [ ] led the initial work in supervised learning based microscopic cascade prediction; they failed to capture the cascade's temporal history (either directly or indirectly). meanwhile, yang et al. [ ] presented a neural diffusion model for microscopic prediction, which employs recurrent neural architecture to capture the history of the cascade. these models focus on predicting the next user in the cascade from a host of potential candidates. in this regard, topolstm [ ] considers only the previously seen nodes in any cascade as the next candidate without using timestamps as a feature. this approximation works well under limited availability of network information and the absence of cascade metadata. meanwhile, forest [ ] considers all the users in the global graph (irrespective of one-hop) as potential users, employing a time-window based approach. work by wang et al. [ ] lies midway of topolstm and forest, in that it does not consider any external global graph as input, but employs a temporal, two-level attention mechanism to predict the next node in the cascade. zhou et al. [ ] compiled a detailed outline of recent advances in cascade prediction. compared to the models discussed above for microscopic cascade prediction, which aim to answer who will be the next participant in the cascade, our work aims to determine whether a follower of a user will retweet (participate in the probability of ui retweeting (static vs. j th interval) x t , x n feature tensors for tweet and news x t,n output from exogenous attention cascade) or not. this converts our use case into a binary classification problem, and adds negative sampling (in the form on inactive nodes), taking the proposed model closer to realworld scenario consisting of active and passive social media users. the spread of hate and exploratory analysis by mathew et al. [ ] revealed exciting characteristics of the breadth and depth of hate vs. non-hate diffusion. however, their methodology separates the non-haters from haters and studies the diffusion of two cascades independently. real-world interactions are more convoluted with the same communication thread containing hateful, counter-hateful, and non-hateful comments. thus, independent diffusion studies, while adequate at the exploratory analysis of hate, cannot be directly extrapolated for predictive analysis of hate diffusion. the need is a model that captures the hate signals at the user and/or group level. by taking into account the user's timeline and his/her network traits, we aim to capture more holistic hate markers. exogenous influence. as early as , myers et al. [ ] exposed that external stimuli drive one-third of the information diffusion on twitter. later, hu et al. [ ] proposed a model for predicting user engagement on twitter that is factored by user engagement in real-world events. from employing world news data for enhancing language models [ ] to boosting the impact of online advertisement campaigns [ ] , exogenous influence has been successfully applied in a wide variety of tasks. concerning social media discourse, both de et al. [ ] in opinion mining and dutta et al. [ ] in chatter prediction corroborated the superiority of models that consider exogenous signals. since our data on twitter was collected based on trending indian hashtags, it becomes crucial to model exogenous signals, some of which may have triggered a trend in the first place. while a one-to-one mapping of news keywords to trending keywords is challenging to obtain, we collate the most recent (time-window) news w.r.t to a source tweet as our ground-truth. to our knowledge, this is the first retweet prediction model to consider external influence. an information network of twitter can be defined as a directed graph g = {u, e}, where every user corresponds to a unique node u i ∈ u, and there exists an ordered pair (u i , u j ) ∈ e if and only if the user corresponding to u j follows user u i . (table i summarizes important notations and denotations. ) typically, the visible information network of twitter does not associate the follow relation with any further attributes, therefore any two edges in e are indistinguishable from each other. we associate unit weight to every e ∈ e. every user in the network acts as an agent of content generation (tweeting) and diffusion (retweeting). for every user u i at time t , we associate an activity history the information received by user u i has three different sources: (a) peer signals (s p i ): the information network g governs the flow of information from node to node such that any tweet posted by u i is visible to every user u j if (u i , u j ) ∈ e; (b) non-peer endogenous signals (s en ): trending hashtags, promoted contents, etc. that show up on the user's feed even in the absence of peer connection; (c) exogenous signals (s ex ): apart from the twitter feed, every user interacts with the external world-events directly (as a participant) or indirectly (via news, blogs, etc.). hate generation. the problem of modeling hate generation can be formulated as assigning a probability with each user that signifies their likelihood to post a hateful tweet. with our hypothesis of hateful behavior being a topic-dependent phenomenon, we formalize the modeling problem as learning the parametric function, where t is a given topic, t is the instance up to which we obtain the observable history of u i , d is the dimensionality of the input feature space, and θ is the set of learnable parameters. though ideally p (u i |t ) should be dependent on s p i as well, the complete follower network for twitter remains mostly unavailable due to account settings, privacy constraints, inefficient crawling, etc. hate diffusion. as already stated, we characterize diffusion as the dynamic process of retweeting in our context. given a tweet τ (t ) posted by some user u i , we formulate the problem as predicting the potential retweeters within the interval [t , t + ∆t]. assuming the probability density of a user u j retweeting τ at time t to be p(t), then retweet prediction problem translates to learning the parametric function eq. is the general form of a parametric equation describing retweet prediction. in our setting, the signal components s p j , h j,t , and the features representing the tweet τ incorporates the knowledge of hatefulness. henceforth, we call τ the root tweet and u i the root user. it is to be noted that, the features representing the peer, non-peer endogenous, and exogenous signals in eq. and may differ due to the difference in problem setting. beyond organic diffusion. the task of identifying potential retweeters of a post on twitter is not straightforward. in retrospect, the event of a user retweeting a tweet implies that the user must have been an audience of the tweet at some point of time (similar to 'susceptible' nodes of contagion spread in the sir/sis models [ ] , [ ] ). for any user, if at least one of his/her followees engages with the retweet cascade, then the subject user becomes susceptible. that is, in an organic diffusion, between any two users u i , u j there exists a finite path u i , u i+ . . . , u j in g such that each user (except u i ) in this path is a retweeter of the tweet by u i . however, due to account privacy etc., one or more nodes within this path may not be visible. moreover, contents promoted by twitter, trending topics, content searched by users independently may diffuse alongside their organic diffusion path. searching for such retweeters is impossible without explicit knowledge of these phenomena. hence, we primarily restrict our retweet prediction to the organic diffusion, though we experiment with retweeters not in the visibly organic diffusion cascade to see how our models handle such cases. to realize eq. , we signify topics as individual hashtags. we rely purely on manually engineered features for this task so that rigorous ablation study and analysis produce explainable knowledge regarding this novel problem. the extracted features instantiate different input components of f in eq. . we formulate this task in a static manner, i.e., assuming that we are predicting at an instance t , we want to predict the probability of the user posting a hateful tweet within [t , ∞]. while training and evaluating, we set t to be right before the actual tweeting time of the user. the activity history of user u i , signified by h i,t is substantiated by the following features: • we use unigram and bigram features weighted by tf-idf values from most recent tweets posted by u i to capture its recent topical interest. to reduce the dimensionality of the feature space, we keep the top features sorted by their idf values. • to capture the history of hate generation by u i , we compute two different features her most recent tweets: (i) ratio of hateful vs. non-hate tweets and (ii) a hate lexicon vector hl = {h i |h i ∈ ii + and i = , . . . , |h|}, where h is a dictionary of hate words, and h i is the frequency of the i th lexicon from h among the tweet history. • users who receive more attention from fellow users for hate propagation are more likely to generate hate. therefore, we take the ratio of retweets of previous hateful tweets to nonhateful ones by u i . we also take the ratio of total number of retweets on hateful and non-hateful tweets of u i . • follower count and date of account creation of u i . • number of topics (hashtags) u i has tweeted on up to t. we compute doc vec [ ] representations of the tweets, along with the hashtags present in them as individual tokens. we then compute the average cosine similarity between the user's recent tweets and the word vector representation of the hashtag, this serves as the topical relatedness of the user towards the given hashtag. to incorporate the information of trending topics over twitter, we supply the model with a binary vector representing the top trending hashtags for the day the tweet is posted. we compute the average tf-idf vector for the most recent news headlines from our corpus posted before the time of the tweet. again we select the top features. using the above features, we implement six different classification models(and their variants). details of the models are provided in section vi-c. v. retweet prediction while realizing eq. for retweeter prediction, we formulate the task in two different settings: the static retweeter prediction task, where t is fixed, and ∆t is ∞ (i.e., all the retweeters irrespective of their retweet time) and the dynamic retweeter prediction task where we predict on successive time intervals. for these tasks, we rely on features both designed manually as well as extracted using unsupervised/self-supervised manner. for the task of retweet prediction, we extract features representing the root tweet itself, as well as the signals of eq. corresponding to each user u i (for which we predict the possibility of retweeting). henceforth, we indicate the root user by u . here, we incorporate s p i using two different features: shortest path length from u to u i in g, and number of times u i has retweeted tweets by u . all the features representing h i,t and s en remain same as described in section iv. we incorporate two sets of features representing the root tweet τ : the hate lexicon vector similar to section iv-a and top . we varied the size of features from to , and the best combination was found to be . for the retweet prediction task, we incorporate the exogenous signal in two different methods. to implement the attention mechanism of retina, we use a doc vec representations of the news articles as well as the root tweet. for the rest of the models, we use the same feature set as section iv-d. guided by eq. , retina exploits the features described in section v-a for both static and dynamic prediction of retweeters. exogenous attention. to incorporate external information as an assisting signal to model diffusion, we use a variation of scaled dot product attention [ ] in retina (see figure ). given the feature representation of the tweet x t and news static prediction of retweeters: to predict whether u j will retweet, the input feature x uj is normalized and passed through a feed-forward layer, concatenated with x t,n , and another feed-forward layer is applied to predict the retweeting probability p uj . (c) dynamic retweet prediction: in this case, retina predicts the user retweet probability for consecutive time intervals, and instead of the last feed-forward layer used in the static prediction, we use a gru layer. feature sequence x n = {x n , x n , . . . , x n k }, we compute three tensors q t , k n , and v n , respectively as follows: where w q , w k , and w v are learnable parameter kernels (we denote them to belong to query, key and value dense layers, respectively in figure ). the operation (·) | (− , ) (·) signifies tensor contraction according to einstein summation convention along the specified axis. in eq. , (− , ) signifies last and first axis of the first and second tensor, respectively. therefore, x each of w q , w k , and w v is a two-dimensional tensor with hdim columns (last axis). next, we compute the attention weight tensor a between the tweet and news sequence as where sof tmax(x[. . . , i, j]) = e x[...,i,j] j e x[...,i,j] . further, to avoid saturation of the softmax activation, we scale each element of a by hdim − . [ ] . the attention weight is then used to produce the final encoder feature representation x t,n by computing the weighted average of v n as follows: retina is expected to aggregate the exogenous signal exposed by the sequence of news inputs according to the feature representation of the tweet into x t,n , using the operations mentioned in eqs. - via tuning the parameter kernels. final prediction. with s ex being represented by the output of the attention framework, we incorporate the features discussed in section v-a in retina to subsume rest of the signals (see eq. ). for the two separate modes of retweeter prediction (i.e., static and dynamic), we implement two different variations of retina. for the static prediction of retweeters, retina predicts the probability of each of the users u , u , . . . , u n to retweet the given tweet with no temporal ordering (see figure (b)). the feature vector x ui corresponding to user u i is first normalized and mapped to an intermediate representation using a feedforward layer. it is then concatenated with the output of the exogenous attention component, x t,n , and finally, another feed-forward layer with sigmoid nonlinearity is applied to compute the probability p ui . as opposed to the static case, in the dynamic setting retina predicts the probability of every user u i to retweet within a time interval t + ∆t i , t + ∆t i+ , with t being the time of the tweet published and ∆t = . to capture the temporal dependency between predictions in successive intervals, we replace the last feed-forward layer with a gated recurrent unit (gru), as shown in figure (c). we experimented with other recurrent architectures as well; performance degraded with simple rnn and no gain with lstm. cost/loss function. in both the settings, the task translates to a binary classification problem of deciding whether a given user will retweet or not. therefore, we use standard binary cross-entropy loss l to train retina: where t is the ground-truth, p is predicted probability (p ui in static and p ui j in dynamic settings), and w is a the weight given to the positive samples to deal with class imbalance. we initially started collected data based on topics which led to a tweet corpus spanning across multiple years. to narrow down our time frame and ease the mapping of tweets to news, we restricted our time span from - - to - - and made use of trending hashtags. using twitter's official api , we tracked and crawled for trending hashtags each day within this duration. overall, we obtained , tweets from , users. we also crawled the retweeters for each tweet along with the timestamps. table ii describes the hashtag-wise detailed statistics of the data. to build the information network, we collected the followers of each user up to a depth of , resulting in a total of , , unique users in our dataset. we also collect the activity history of the users, resulting in a total of , , tweets in our dataset. one should note that the lack of a wholesome dataset (containing textual, temporal, network signals all in one) is the primary reason why we decided to collect our own dataset in the first place. we also, crawled the online news articles published within this span using the news-please crawler [ ] . we managed to collect a total of , news articles for this period. after filtering for language, title and date, we were left with , processed items. there headlines were used as the source of the exogenous signal. we employ three professional annotators who have experience in analyzing online hate speech to annotate the tweets manually. all of these annotators belong to an age group of - years and are active on twitter. as the contextual knowledge of real-world events plays a crucial role in identifying hate speech, we ensure that the annotators are well-aware of the events related to the hashtags and topics. annotators were asked to follow twitter's policy as guideline for identifying hateful behavior . we annotated a total of , tweets with an inter-annotator agreement of . krippendorf's α. the low value of inter-annotator's agreement is at par with most hate speech annotation till date, pointing out the hardness of the task even for human subjects. this further strengthens the need for contextual knowledge as well as exploiting beyondthe-text dynamics. we select the final tags based on majority voting. based on this gold-standard annotated data, we train three different hate speech classifiers based on the designs given by davidson et al. [ ] (dubbed as davidson model), waseem and hovy [ ] , and pinkesh et al. [ ] . with an auc score . and macro-f . , the davidson model emerges as the best performing one. when the existing pre-trained davidson model was tested on our annotated dataset, it achieved . auc and . macro-f . this highlights both the limitations of existing hate detection models to capture newer context, as well as the importance of manual annotations and fine-tuning. we use the fine-tuned model to annotate the rest of the tweets in our dataset (% of hateful tweets for each hashtag is reported in table ii) . we use the machine-annotated tags for the features and training labels in our proposed models only, while the hate generation models are tested solely on gold-standard data. along with the manual annotation and trained hate detection model, we use a dictionary of hate lexicons proposed in [ ] . it contain a total of words/phrases signaling a possible existence of hatefulness in a tweet. example of slur terms used in the lexicon include words such as harami (bastard), jhalla (faggot), haathi (elephant/fat). using the above terms is derogatory and a direct offense. in addition, the lexicon has some colloquial terms such as mulla (muslim), bakar (gossip), aktakvadi (terrorist), jamai (son-in-law) which may carry a hateful sentiment depending on the context in which they are used. to experiment on our hate generation prediction task, we use a total of , tweets(which have atleast news mapping to it from the time of its posting) coming from , users to construct the ground-truth. with an : train-test split, there are hateful tweets among , in the training data, whereas out of , in the testing data. to deal with the severe class imbalance of the dataset, we use both upsampling of positive samples and downsampling of negative samples. with all the features discussed in section iv, the full size of the feature vector is , . we experimented with all our proposed models with this full set of features and dimensionality reduction techniques applied to it. we use principal component analysis (pca) with the number of components set to . also, we conduct experiments selecting k-best features (k = ) using mutual information. we implement a total of six different classifiers using support vector machine (with linear and rbf kernel), logistic regression, decision tree, adaboost, and xgboost [ ] . parameter settings for each of these are reported in table iii . all of the models, pca, and feature section are implemented using scikit-learn . the activity of retweeting, too, shows a skewed pattern similar to hate speech generation. while the maximum number retweets for a single tweet is in our dataset, the average remains to be . . we use only those tweets which have more than one retweet and atleast news mapping to it from the time of its posting. with an : train-test split, this results in a total of , and samples for training and testing. for all the doc vec generated feature vectors related to tweets and news headlines, we set the dimensionality to and , respectively. for retina, we set the parameter hdim and all the intermediate hidden sizes for the rest of the feedforward (except the last one generating logits) and recurrent layers to (see section v-b). hyperparameter tuning of retina. for both the settings (i.e, static and dynamic prediction of retweeters), we used mini-batch training of retina, with both adam and sgd optimizers. we varied the batch size within , and , with the best results for a batch size of for the static mode and for the dynamic mode. we also varied the learning rates within a range − to − , and chose the best one with learning rate − using the sgd optimizer for the dynamic model. the static counterpart produced the best results with adam optimizer [ ] using default parameters. to deal with the class imbalance, we set the parameter w in eq. as w = λ(log c − log c + ), where c and c + are the counts for total and positive samples, respectively in the training dataset, and λ is a balancing constant which we vary from to . with . steps. we found the best configurations with λ = . and λ = . for the static and dynamic modes respectively. https://www.tensorflow.org/api docs/python/tf/keras/optimizers/sgd https://www.tensorflow.org/api docs/python/tf/keras/optimizers/adam in the absence of external baselines for predicting hate generation probability due to the problem's novelty, we explicitly rely on ablation analyses of the models proposed for this task. for retweet dynamics prediction, we implement external baselines and two ablation variants of retina. since information diffusion is a vast subject, we approach it from two perspectives -one is the set of rudimentary baselines (sir, general threshold), and the other is the set of recently proposed neural models. sir [ ] : the susceptible-infectious-recovered (removed) is one of the earliest predictive models for contagion spread. two parameters govern the model -transmission rate and recovery rate, which dictate the spread of contagion (retweeting in our case) along with a social/information network. threshold model [ ] : this model assumes that each node has threshold inertia chosen uniformly at random from the interval [ , ]. a node becomes active if the weighted sum of its active neighbors exceeds this threshold. using the same feature set as described in section v-a, we employ four classifiers -logistic regression, decision tree, linear svc, and random forest (with estimators). all of these models are used for the static mode of retweet prediction only. features representing exogenous signals are engineered in the same way as described in section iv-d. to overcome the feature engineering step involving combinations of topical, contextual, network, and user-level features, neural methods for information diffusion have gained popularity. while these methods are all focused on determining only the next set of users, they are still important to measure the diffusion performance of retina. topolstm [ ] : it is one of the initial works to consider recurrent models in generating the next user prediction probabilities. the model converts the cascades into dynamic dags (capturing the temporal signals via node ordering). the senderreceiver based rnn model captures a combination of active node's static score (based on the history of the cascade), and a dynamic score (capturing future propagation tendencies). forest [ ] : it aims to be a unified model, performing the microscopic and the macroscopic cascade predictions combining reinforcement learning (for macroscopic) with the recurrent model (for microscopic). by considering the complete global graph, it performs graph sampling to obtain the structural context of a node as an aggregate of the structural context of its one or two hops neighbors. in addition, it factors the temporal information via the last m seen nodes in the cascade. hidan [ ] : it does not explicitly consider a global graph as input. any information loss due to the absence of a global graph is substituted by temporal information utilized in the form of ordered time difference of node infection. since hidan does not employ a global graph, like topolstm, it too uses the set of all seen nodes in the cascade as candidate nodes for prediction. we exercise extensive feature ablation to examine the relative importance of different feature sets. among the six different algorithms we implement for this task, along with different sampling and feature reduction methods, we choose the best performing model for this ablation study. following eq. , we remove the feature sets representing h i,t , s ex , s en , and t (see section iv for corresponding features) in each trial and evaluate the performance. to investigate the effectiveness of the exogenous attention mechanism for predicting potential retweeters, we remove this component and experiment on static as well as the dynamic setting of retina. evaluation of classification models on highly imbalanced data needs careful precautions to avoid classification bias. we use multiple evaluation metrics for both the tasks: macro averaged f score (macro-f ), area under the receiver operating characteristics (auc), and binary accuracy (acc). as the neural baselines tackle the problem of retweet prediction as a ranking task, we improvise the evaluation of retina to make it comparable with these baselines. we rank the predicted probability scores (p ui and p ui j in static and dynamic settings, respectively) and compute mean average precision at topk positions (map@k) and binary hits at top-k positions (hits@k). table iv presents the performances of all the models to predict the probability of a given user posting a hateful tweet using a given hashtag. it is evident from the results that, all six models suffer from the sharp bias in data; without any classspecific sampling, they tend to lean towards the dominant class (non-hate in this case) and result in a low macro-f and auc compared to very high binary accuracy. svm with rbf-kernel outperforms the rest when no upsampling or downsampling is done, with a macro-f of . (auc . ). effects of sampling. downsampling the dominant classes result in a substantial leap in the performance of all the models. the effect is almost uniform over all the classifiers except xgboost. in terms of macro-f , decision tree sets the best performance altogether for this task as . . however, the rest of the models lie in a very close range of . - . macro-f . while the downsampling performance gains are explicitly evident, the effects of upsampling the dominated class are less intuitive. for all the models, upsampling deteriorates macro-f by a large extent, with values in the range . - . . however, the auc scores improve by a significant margin for all the models with upsampling except decision tree. adaboost achieves the highest auc of . with upsampling. dimensionality reduction of feature space. our experiments with pca and k-best feature selection by mutual information show a heterogeneous effect on different models. while the only svm with linear kernel shows some improvement with pca over the original feature set, the rest of the models observe considerable degradation of macro-f . however, svm with rbf kernel achieves the best auc of . with pca. with top-k best features, the overall gain in performance is not much significant except decision tree. we also experiment with combinations of different sampling and feature reduction methods, but none of them achieve a significant gain in performance. ablation analysis. we choose decision tree with downsampling of dominant class as our best performing model (in terms of macro-f score) and perform ablation analysis. table v presents the performance of the model with each feature group removed in isolation, along with the full model. evidently, for predicting hate generation, features representing exogenous signals and user activity history are most important. removal of the feature vector signifying trending hashtags, which represent the endogenous signal in our case, also worsens the performance to a significant degree. table vi summarizes the performances of the competing models for the retweet prediction task. here again, binary accuracy presents a very skewed picture of the performance due to class imbalance. while retina in dynamic setting outperforms the rest of the models by a significant margin for all the evaluation metrics, topolstm emerges as the best baseline in terms of both map@ and hits@ . in figure , we compare retina in static and dynamic setting with topolstm in terms of hits@k for different values of k. for smaller values of k, retina largely outperforms topolstm, in both dynamic and static setting. however, with increasing k-values, the three models converge to very similar performances. figure provides an important insight regarding the retweet diffusion modeling power of our proposed framework retina. our best performing baseline, topolstm largely fails to capture the different diffusion dynamics of hate speech in contrast to non-hate (map@ . for non-hate vs. . for hate). on the other hand, retina achieves map@ scores . and . in dynamic ( . and . in static) settings to predict the retweet dynamics for hate and non-hate contents, respectively. one can readily infer that our wellcurated feature design by incorporating hate signals along with the endogenous, exogenous, and topic-oriented influences empowers retina with this superior expressive power. among the traditional baselines, logistic regression gives comparable macro f -score to the best static model; however, owing to memory limitations it could not be trained on news set larger than per tweet. similarly, svm based models could not incorporate even news items per tweet (memory limitation). meanwhile, an ablation on news size gave best results at for both static and dynamic models. we find that the contribution of the exogenous signal(i.e the news items) plays a vital role in retweet prediction, much similar to our findings in table v for predicting hate generation. with the exogenous attention component removed in static as well as dynamic settings (retina-s † and retina-d † , respectively, in table vi) , performance drops by a significant margin. however, the performance drop is more significant in retina-d † for ranking users according to retweet probability (map@k and hits@k). the impact of exogenous signals on macro-f is more visible in the traditional models. to observe the performance of retina more closely in the dynamic setting, we analyse its performance over successive prediction intervals. figure shows the ratio between the predicted and the actual number of retweets arrived at different intervals. as clearly evident, the model tends to be nearly perfect in predicting new growth with increasing time. high error rate at the initial stage is possibly due to the fact that the retweet dynamics remains uncertain at first and becomes more predictable as increasing number of people participate over time. a similar trend is observed when we compare the performance of retina in static setting with varying size of actual retweet cascades. figure shows that retina-s performs better with increasing size of the cascade. in addition, we also vary the number of tweets posted by a user. figure shows that the performance of retina in both static and dynamic settings increases by varying history size from to tweets. afterward, it either drops or remains the same. our attempt to model the genesis and propagation of hate on twitter brings forth various limitations posed by the problem itself as well as our modeling approaches. we explicitly cover such areas to facilitate the grounds of future developments. we have considered the propagation of hateful behavior via retweet cascades only. in practice, there are multiple other forms of diffusion present, and retweet only constitutes a subset of the full spectrum. users susceptible to hateful information often propagate those via new tweets. hateful tweets are often counteracted with hate speech via reply cascades. even if not retweeted, replied, or immediately influencing the generation of newer tweets, a specific hateful tweet can readily set the audience into a hateful state, which may later develop repercussions. identification of such influences would need intricate methods of natural language processing techniques, adaptable to the noisy nature of twitter data. as already discussed, online hate speech is vastly dynamic in nature, making it difficult to identify. depending on the topic, time, cultural demography, target group, etc., the signals of hate speech change. thus models like retina which explicitly uses hate-based features to predict the popularity, need updated signaling strategy. however, this drawback is only evident if one intends to perceive such endeavors as a simple task of retweet prediction only. we, on the other hand, focus on the retweet dynamics of hateful vs. non-hateful contents which presumes the signals of hateful behavior to be well-defined. the majority of the existing studies on online hate speech focused on hate speech detection, with a very few seeking to analyze the diffusion dynamics of hate on large-scale information networks. we bring forth the very first attempt to predict the initiation and spread of hate speech on twitter. analyzing a large twitter dataset that we crawled and manually annotated for hate speech, we identified multiple key factors (exogenous information, topic-affinity of the user, etc.) that govern the dissemination of hate. based on the empirical observations, we developed multiple supervised models powered by rich feature representation to predict the probability of any given user tweeting something hateful. we proposed retina, a neural framework exploiting extra-twitter information (in terms of news) with attention mechanism for predicting potential retweeters for any given tweet. comparison with multiple state-of-the-art models for retweeter prediction revealed the superiority of retina in general as well as for predicting the spread of hateful content in particular. with specific focus of our work being the generation and diffusion of hateful content, our proposed models rely on some general textual/network-based features as well as features signaling hate speech. a possible future work can be to replace hate speech with any other targeted phenomenon like fraudulent, abusive behavior, or specific categories of hate speech. however, these hate signals require a manual intervention when updating the lexicons or adding tropical hate tweets to retrain the hate detection model. while the features of the end-to-end model appear to be highly engineered, individual modules take care of respective preprocessing. in this study, the mode of hate speech spread we primarily focused on is via retweeting, and therefore we restrict ourselves within textual hate. however, spreading hateful contents packaged by an image, a meme, or some invented slang are some new normal of this age and leave the space for future studies. report of the independent international factfinding mission on myanmar fanning the flames of hate: social media and hate crime a survey on automatic detection of hate speech in text measuring the reliability of hate speech annotations: the case of the european refugee crisis spread of hate speech in online social media deep exogenous and endogenous influence combination for social chatter intensity prediction information diffusion and external influence in networks hateful symbols or hateful people? predictive features for hate speech detection on twitter automated hate speech detection and the problem of offensive language deep learning for hate speech detection in tweets a hierarchically-labeled portuguese hate speech dataset arhnet -leveraging community interaction for detection of religious hate speech in arabic the effects of user features on twitter hate speech detection handling imbalance issue in hate speech classification using sampling-based methods on analyzing antisocial behaviors amid covid- pandemic racism is a virus: anti-asian hate and counterhate in social media during the covid- crisis mind your language: abuse and offense detection for code-switched languages chassis: conformity meets online information diffusion containing papers of a mathematical and physical character deepcas: an end-to-end predictor of information cascades deephawkes: bridging the gap between prediction and understanding of information cascades what makes an image popular? representation learning for information diffusion through social networks: an embedded cascade model a novel embedding method for information diffusion prediction in social network big data neural diffusion model for microscopic cascade prediction topological recurrent neural network for diffusion prediction multi-scale information diffusion prediction with reinforced recurrent networks hierarchical diffusion attention network a survey of information cascade analysis: models, predictions and recent advances predicting user engagement on twitter with real-world events ccnet: extracting high quality monolingual datasets from web crawl data event triggered social media chatter: a new modeling framework demarcating endogenous and exogenous opinion diffusion process on social networks a deterministic model for gonorrhea in a nonhomogeneous population distributed representations of sentences and documents attention is all you need news-please: a generic news crawler and extractor xgboost: a scalable tree boosting system adam: a method for stochastic optimization maximizing the spread of influence through a social network key: cord- -pxzsph v authors: tekumalla, ramya; banda, juan m. title: social media mining toolkit (smmt) date: - - journal: genomics inform doi: . /gi. . . .e sha: doc_id: cord_uid: pxzsph v there has been a dramatic increase in the popularity of utilizing social media data for research purposes within the biomedical community. in pubmed alone, there have been nearly , publication entries since that deal with analyzing social media data from twitter and reddit. however, the vast majority of those works do not share their code or data for replicating their studies. with minimal exceptions, the few that do, place the burden on the researcher to figure out how to fetch the data, how to best format their data, and how to create automatic and manual annotations on the acquired data. in order to address this pressing issue, we introduce the social media mining toolkit (smmt), a suite of tools aimed to encapsulate the cumbersome details of acquiring, preprocessing, annotating and standardizing social media data. the purpose of our toolkit is for researchers to focus on answering research questions, and not the technical aspects of using social media data. by using a standard toolkit, researchers will be able to acquire, use, and release data in a consistent way that is transparent for everybody using the toolkit, hence, simplifying research reproducibility and accessibility in the social media domain. only in the last six years, there has been a great influx of research works that describe different types of research works using twitter and reddit data, nearly , papers are found in pubmed [ ]. these works encompass countless applications, such as the usage of opioids [ ] , the flu [ ] , eating disorder [ ] networks analyses, depression symptoms detection [ ] , and diabetes interventions [ ] , etc. while all the listed studies use data from twitter and reddit, we can only find code available for one of them. additionally, the data acquisition methodology is different on each study and seldomly reported, a crucial step towards reproducibility of any of their analyses. when it comes to using twitter data for drug identification and pharmacovigilance tasks, authors of works like [ ] [ ] [ ] have been consistently releasing publicly available datasets, software tools, and complete natural language processing (nlp) systems with their works. in an attempt to shift the biomedical community into better practices for research transparency and reproducibility, we introduce the social media mining toolkit (smmt), a suite of tools aimed to encapsulate the cumbersome details of acquiring, preprocessing, annotating, and standardizing social media data. the need for a toolkit like smmt arose from our work using twitter data for the characterization of disease transmission during natural disasters [ ] and mining large-scale repositories for drug usage related tweets for pharmacovigilance purposes [ ] . we originally wanted to use other researcher's tools and surprisingly we found very little code available with the majority outdated and non-functioning. going one step there has been a dramatic increase in the popularity of utilizing social media data for research purposes within the biomedical community. in pubmed alone, there have been nearly , publication entries since that deal with analyzing social media data from twitter and reddit. however, the vast majority of those works do not share their code or data for replicating their studies. with minimal exceptions, the few that do, place the burden on the researcher to figure out how to fetch the data, how to best format their data, and how to create automatic and manual annotations on the acquired data. in order to address this pressing issue, we introduce the social media mining toolkit (smmt), a suite of tools aimed to encapsulate the cumbersome details of acquiring, preprocessing, annotating and standardizing social media data. the purpose of our toolkit is for researchers to focus on answering research questions, and not the technical aspects of using social media data. by using a standard toolkit, researchers will be able to acquire, use, and release data in a consistent way that is transparent for everybody using the toolkit, hence, simplifying research reproducibility and accessibility in the social media domain. keywords: data mining, information storage and retrieval, machine learning, social media availability: all code described in this paper is fully available at: https://github.com/thepanacealab/smmt. back, we did find rudimentary python libraries to interact with the twitter api, but some of their learning curves are steep and not overly documented. we then decided to clean and integrate our code into a toolkit that would help us provide a comprehensive resource for other researchers/users to replicate our work and to use in their own analyses, hence smmt was born. parallel to tools like smmt, there are other research groups that are outlining frameworks to streamline the mining of social media like sarker et al. [ ] , which are complementary to the use and need of this tool. programmed using python version and the latest twitter api interfaces, the functionality of smmt is divided into three separate sets of tools: data acquisition tools, data preprocessing tools, and data annotation and standardization tools. the particular versions of the additional python libraries used by smmt are available at the github documentation [ ] since they are constantly updated and refreshed. besides extensive usage documentation, the tool also provides two end-to-end usage examples, as well as additional google colaboratory [ ] interactive python notebooks with data usage examples. note than in order to use most of the functionality of smmt, users need to sign-up to acquire access to the twitter application program interface api. once approved, users will be provided a set of api credential keys, more information can be found in [ ] . fig. shows all current components of smmt. in the following sections we provide additional details of each category of available tools. the tools in this category are used to gather data from social media sites, namely twitter for this initial release of smmt. the most common way of acquiring tweets is to use the twitter streaming api [ ] . our toolkit provides two separate utilities to capture streaming data, one will gather all available tweets and will continue running until terminated (streaming.py), and the other will take a list of search keywords and number of desired tweets and will pull those from the current tweet stream (search_generic.py). details on how to use these utilities can be found on the read.me file. the most common and permitted way of sharing twitter data publicly is by only sharing the tweet id number. this number then needs to be 'hydrated' , which means that the twitter api needs to be used to fetch all the complete tweet and additional meta-data fields. this is a vital step for most users trying to replicate other studies or analyses. we provide a utility called get_metadata.py which reads a list of tweet ids and hydrates them automatically. one of the major drawbacks of the twitter api is the fact that unless having paid access to it, researchers cannot extract all historical tweets for any given twitter user. also, extracting all tweets from a given time range is not always easily and efficiently possible with the api. for these purposes we provide a utility called scrape. py which, once given a list of twitter handles and corresponding date ranges, will automatically scrape the twitter page and pull the tweet ids of the desired user and date range. these tweet ids then need to be 'hydrated' to be able to fully use them. after having acquired enough data for research purposes from the twitter stream, or identified and 'hydrated' a publicly available dataset, there is a need to subset the tweets and process the tweets json files to extract the fields of interest. while seemingly trivial, most biomedical researchers do not want to work with json objects, and since around % of the json fields are not populated, precise preprocessing steps need to be carried out to clean the data and render it useful in friendlier formats. smmt contains the parse_json_lite.py tool which takes a relatively small file (less than gigabyte in size) of twitter json objects and separates these objects into a tab delimited file with each json field converted to a column and each tweet into a data row. with over fields of meta-data, researchers are usually not interested in the vast majority of them. this tool can be configured to select which fields are of interest to be parsed and only process those into the tab delimited format. if the size of the tweets json objects file is larger than gigabyte, we provide an additional tool, parse_json_heavy.py, which can handle terabyte sized files sequen- terminologies as dictionaries ner annotation standardized output tially rather than reading them all in memory for speed. once all the tweets are processed into the cleaner tab delimited format, which can even be read in excel, there might be a need to further subset the tweets based on a given list of terms, or dictionary. for this purpose, we have included the separate_tweet_tsv.py file, which takes a term list in a format specified in the read.me file of smmt and will return only the tweets that contain the provided terms. after preprocessing the acquired social media data, researchers have the capabilities of standardizing their tweets' text with our set of tools. taking advantage of ontogene bio term hub [ ] and their harmonization of biomedical terminologies into a unified format, we provide a tool, create_dictionary.py, that converts their downloads into smmt-compatible dictionaries. to avoid creating a complicated and cumbersome format for our tool, we opted for simplicity and only rely on having a tab delimited file with an identifier column and a term name column. other dictionaries that we have made available will standardize any annotations using the observational health data sciences and informatics (ohdsi) vocabulary [ ] . we are testing functionality to also convert our dictionaries to the pubdictionaries [ ] format for the next release, allowing researchers to use their functionality and online rest services. one of the most important tools of smmt is the spacy [ ] ner annotator, smmt_ner_basic.py, this tool will take the tab delimited tweets, a dictionary file, and the name of the output file for the annotations. in order to extend the usability of our tool, we provide the resulting annotations in a traditional format: document, span, term format; as well as pre-formatted outputs compatible with the brat annotation tool [ ] and pubannotation and its viewer textae [ ] as shown in fig. . while all the tools have their own documentation, in order to ease the adoption of the tools available in smmt, we have included an end-to-end example in the examples folder that performs the following tasks: ( ) download tweets from the twitter api stream for each of the following keywords: donald trump,coronavirus,cricket ( ) we then preprocess those tweets to extract tweet id and their text into tab delimited files. ( ) using a google colab notebook, we use these preprocessed files and then use the tf-idf vectorizer on the text of the tweets to create a test and train set and build a multi nomial naive bayes classifier to separate tweets based on their label. all details and steps of this process are outlined in the colab notebook. ( ) we then test our trained model on the test set and generate a confusion matrix heat-map (fig. ) of the classification task, and show the model performance metrics. the whole process of this example takes less than minutes to complete and is heavily documented for smmt users to overcome the learning curve of acquiring and preprocessing tweets. while our example is simple in nature, users can build upon it and modify it to better suit their needs. the tools part of smmt allow users to simplify their research workflows and to focus on determining which data they want to use and the analyses they want to perform, rather than deciphering how to acquire the data. while most cutting-edge and near real-time research will be done pulling tweets from the twitter api stream, there are countless datasets available for historical research, from large general purpose databases like the internet archive's twitter stream grab dataset [ ] , which consists of data from to , to more specialized and pre-curated datasets for uses like pharmacovigilance [ ] among others. this initial version release of smmt will continue growing with additional tools being developed for platforms like reddit, dark web forums, and other social media data sources. pubmed search: social media. bethesda: national library of medicine opioids on twitter: a content analysis of conversations regarding prescription drugs on social media and implications for message design social media and flu: media twitter accounts as agenda setters analyzing big data in social media: text and network analyses of an eating disorder forum association between social media use (twitter, instagram, facebook) and depressive symptoms: are twitter users at higher risk? social media for health promotion in diabetes: study protocol for a participatory public health intervention design pharmacovigilance on twitter? mining tweets for adverse drug reactions utilizing social media data for pharmacovigilance: a review pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features assessing the potential impact of vector-borne disease transmission following heavy rainfall events: a mathematical framework mining archive.org's twitter stream grab for pharmacovigilance research gold mining social media for prescription medication abuse monitoring: a review and proposal for a data-centric framework building machine learning and deep learning models on google cloud platform: a comprehensive guide for beginners twitter developers standard stream parameters a combined resource of biomedical terminology and its statistics characterizing treatment pathways at scale using the ohdsi network industrial-strength natural language processing in python. explosion ai brat: a web-based tool for nlp-assisted text annotation pubannotation: a persistent and sharable corpus and annotation repository archive team: the twitter stream grab we would like to acknowledge jin-dong kim, dbcls, and rois for making our participation in blah (biomedical linked annotation hackathon) possible. we also like to thank javad rafiei asl, kevin bretonnel cohen, núria queralt rosinach, yue wang and atsuko yamaguchi for the help and input during the biomedical linked annotation hackathon . no potential conflict of interest relevant to this article was reported. key: cord- -kr uljop authors: thelwall, mike; thelwall, saheeda title: covid- tweeting in english: gender differences date: - - journal: nan doi: nan sha: doc_id: cord_uid: kr uljop at the start of , covid- became the most urgent threat to global public health. uniquely in recent times, governments have imposed partly voluntary, partly compulsory restrictions on the population to slow the spread of the virus. in this context, public attitudes and behaviors are vitally important for reducing the death rate. analyzing tweets about the disease may therefore give insights into public reactions that may help guide public information campaigns. this article analyses , , english tweets about covid- from march to , . it focuses on one relevant aspect of public reaction: gender differences. the results show that females are more likely to tweet about the virus in the context of family, social distancing and healthcare whereas males are more likely to tweet about sports cancellations, the global spread of the virus and political reactions. thus, women seem to be taking a disproportionate share of the responsibility for directly keeping the population safe. the detailed results may be useful to inform public information announcements and to help understand the spread of the virus. for example, failure to impose a sporting bans whilst encouraging social distancing may send mixed messages to males. covid- is, at the time of writing, a major global threat to public health (e.g., lipsitch, swerdlow, & finelli, ) . public actions are critically important in slowing the spread of the virus and therefore reducing the death rate due to the volume of critically ill patients needing simultaneous care, such as by running out of ventilators. governments around the world have reacted by announcing mandatory actions, such as shutting restaurants and the normal functioning of schools, and by giving strongly recommended or mandatory advice to the public for personal hygiene and social distancing to slow the spread of the virus. the extent to which the population follows expert health advice is expected to have a substantial impact on the death rate from the virus. if social distancing is widely ignored or misunderstood, for example, then national healthcare facilities will not be able to give all critically ill patients the care that they need to survive. it is therefore vitally important to assess how the public is reacting to the crisis and one way (amongst many) of investigating this is through social media posts, including tweets (e.g., cinelli, quattrociocchi, galeazzi, et al., ) , and one important potential arena of difference (amongst many) is gender. twitter is a natural platform for public information sharing in many countries, including all large english-speaking nations. although less popular than facebook, its advantage for research is that it is typically fully public and researchers can therefore access its contents. moreover, twitter gives free use of an applications programming interface (api) for automatically harvesting recent (up to a week old) tweets matching keyword searches, making it a practical source of data about public reactions to tweets. a disadvantage is that twitter users' demographics do not match those of the population. in the usa, about % of adults use the site, behind facebook ( %) and instagram ( %), but ahead of whatsapp ( %) and reddit ( %) (schaeffer, ) . moreover, older people (and more at risk from covid- ) are be less likely to use twitter, men are slightly more likely to use it ( % female within a % female population) but adopters tend to be richer and more educated in the usa (smith & wojcik, ) . there are also finer-grained differences, such as political variations between users and non-users (smith, hughes, remy, & shah, ) . nevertheless, analyzing tweets may give some quick large-scale insights into public reactions to this study focuses on gender differences in reactions to covid- on twitter. since public safety measures must be adhered to by the entire population to be maximally effective, any gender differences in responses may point to weaknesses in public communications about the seriousness of the outbreak. this information may help with the creation of new messages targeting males or females more effectively. in addition, understanding gender differences may help modelling epidemiologists to create more accurate models of the spread of the disease. the current paper therefore analyses two weeks (march - , ) of tweeting in english about covid- from the perspective of gender differences in responses. although the virus is a global pandemic, the focus on english is for pragmatic methodology reasons and similar research in other languages is encouraged (and supported by the free software at http://mozdeh.wlv.ac.uk). the research design was to collect english-language tweets matching a set of queries related to covid- over two weeks and to identify words used more by males than females, using these to point to aspects of gender difference in tweeting about the virus. a word frequency method is useful for gender comparisons because it gives statistically significant evidence in a transparent fashion. in contrast, content analysis or thematic analysis are unlikely to discover fine-grained gender differences and cluster-based methods, such as topic modelling, can be changed by small alterations in the data, and so are not robust. topic modelling is also not able to give as fine-grained gender difference information as word frequency comparisons. word frequency analysis therefore fills a gap in comparison to other methods. the following queries were used to identify different common ways of referring to the disease: coronavirus; "corona virus"; covid- ; covid . these were submitted to twitter at the maximum speed allowed by the free twitter api from to march , obtaining , , tweets after eliminating duplicates (including multiple retweets) and near duplicates (tweets identical apart from @usernames and #hashtags). the tweets were collected and analyzed with the free software mozdeh (http://mozdeh.wlv.ac.uk). twitter does not record user genders, but it is possible to guess male and female genders (only) from their display name if it starts with a first name. a list of gendered first names was used to match the first part of twitter display names. this list was us-based, since the usa is the major english-language user of twitter and its population has international ethnic origins, so its names probably reflect to some extent the names in other anglophone countries. the list was derived from the us census (top names) and supplemented by genderapi.com (names with at least us records). names were included as female (respectively, male) from either source if at least % of people with the name were female (respectively, male). twitter names (display names, rather than usernames) were split at the first space or non-alphanumeric character, first digit, or first camel case transition from lowercase to uppercase (e.g., mikethelwall). the % threshold was chosen to give a high degree of certainty that the user was male. the method is imperfect because twitter usernames may be informal or not reflect a person's name (e.g., cricketfan ), or based on a relatively gender-neutral name (e.g., sam, pat) or a rare name, including names from small ethnic minorities in the usa. nevertheless, the first name procedure splits a set of tweets into three groups: (a) likely to be male-authored; (b) likely to be female-authored; (c) unknown. comparing (a) with (b) gives an indication of likely gender differences overall. visual inspection of the most active users in the data suggests that most bot and corporate tweets are assigned to the unknown gender set. gender differences in topics were identified by a word frequency comparison method to identify words more used by either males or females, using the following procedure. for each word, the proportion of female-authored tweets containing the word was compared to the proportion of male-authored tweets containing the word using a x chi-square test for the gives evidence to reject the null hypothesis of no gender difference in use of the word. because the test is repeated for every word and there are , , words, this procedure would almost certainly produce tens of thousands of false positives due to the number of tests. the benjamini-hochberg procedure (benjamini & hochberg, ) was used to correct for this. it is a familywise error rate correction procedure that ensures that the probability of incorrectly rejecting the null hypothesis in any test is below a threshold value. for extra power, words that were too rare to trigger a statistically significant result, even they were only used by males (or females) were not tested. this chisquared/benjamini-hochberg approach for detecting gender differences in term frequencies has previously been used for academic abstracts (thelwall, bailey, makita, sud, & madalli, ; thelwall, bailey, tobin, & bradshaw, ) , reddit posts (thelwall & stuart, ) and youtube comments (thelwall, ) . the procedure was repeated three times, for p= . , p= . , and p= . , recording the highest significance level for each word. the above procedure was also applied to each day separately to determine the statistically significantly gendered terms for each day (i.e., additional sets of tests). this extra step was taken because a word that is gendered on a single day seems likely to be less relevant to covid- than a word that is gendered on multiple days. for example, a one-day gendered term might relate to a news event that was affected by covid- (e.g., a sporting event cancellation) but this might not be important to the ongoing discussion of the virus. the threshold for including a term was set at (the equivalent of) more than two highly statistically significant days. allocating one star to significance at p= . , two for p= . and three for p= . , the threshold requirement was a total of at least seven stars over the fourteen days. this threshold gave a total of terms out of the that were statistically significantly gendered on at least one day. each word judged statistically significantly gendered (either overall, or on multiple days) reflects one or more underlying gender differences in motivations for tweeting or a gender difference in language styles. each term's underlying causes can be inferred by reading a random sample of tweets containing the term, known as the key word in context (kwic) method (luhn, ) . for example, the term league was associated with tweets discussing the full or partial closure of various sporting competitions or facilities. gender differences in this word therefore suggest that males were more likely to tweet that league-based sport was affected by covid- restrictions. the word contexts varied from obvious (e.g., #jantacurfew) to obscure (e.g., it). in particular, many pronouns were female associated, reflecting a peoplefocus rather than a topic, and definite and indefinite articles were male-associated, reflecting an information focus rather than a specific topic. in cases where the context of a term was unclear from reading ten randomly selected tweets (using the random sort option in mozdeh), a word association analysis was run on the term to identify top associating terms to give additional insights into its main use contexts. the words were manually grouped into themes for each gender to highlight the main types of gender difference. the main themes identified in the tweets are summarized below by gender. the complete list of terms and raw tweet counts associated with them are available on figshare (https:doi.org/ . /m .figshare. ). male-authored tweets about covid- were about twice as likely as females to discuss sports, typically in the context of speculation about, or announcements of, events or competitions being cancelled (figure ). whilst this is relatively peripheral to the disease, males were also substantially more likely to mention, or take issue with, political figures or government, particularly within india (figure ). males were also more likely to tweet about the economy (terms: economy, market; not graphed). female-oriented themes seemed to focus on the first and second lines of defense against the virus. the key theme of social distancing is moderately female-oriented ( figure ), in the sense that females were more likely to use the #socialdistancing hashtag and the need to stay at home as far as possible. partly related to social distancing but also to lockdowns, females were more likely to mention family members ( figure ) and to use all pronouns (figure ). pronouns were typically used for a mix of purposes but tweets with pronouns or family members seemed more likely to discuss concrete actions or practical implications for the tweeter and the people that they know. thus, all three themes have a practical and personal orientation. females were also more likely to tweet about education (terms: school, student, teacher; not graphed), presumably due to its impact on themselves or their family. females were also more likely to discuss healthcare issues (figure ). these tweets were less focused on immediate practical issues but on the main line of defense against the virus, should the practical steps fail. related to this, females were also more likely to express gratitude to healthcare workers and others (terms not graphed) and anxiety (see below). two broad themes were mixed gender in the sense of some terms being male-associated and others being female-associated. males were more likely to discuss the virus as a war whereas females were more likely to mention their anxiety about its effects (fig ) . the war metaphor is a way of generalizing the situation as well as perhaps for males glamorizing actions against it, or emphasizing the seriousness of the issue. thus, war metaphors could be an indirect way of expressing anxiety. there were mixed gender differences in discussions of curfews ( figure ). whilst males were more likely to announce the existence of a curfew, females were more likely to discuss its practical impacts. this quick analysis of gender differences in english tweeting about covid- has several limitations. in addition to the issues discussed above, another important aspect is that twitter does not report the geographic location of the tweets and so the data has unknown origins. in particular, if some countries have an unusually high proportion of active tweeters of one gender, then this could translate into tweets about that country statistically significantly associating with that gender with the tests used above. the results are broadly consistent with previous research into gender differences in language use, including on social media, and gender differences in interests. the primary contribution here is therefore to so show which gender differences translate to covid- on twitter, rather than finding new gender differences. the greater male interest in sport in many countries is widely known (e.g., plaza, boiché, brunel, & ruchaud, ) , and males also seem to discuss politics more (or at least more directly: bode, ) . the greater female focus on caring ( % of family caregivers are female in the usa: family caregiver alliance, ), and family (parker, horowitz, & rohal, ) has also been found before. in terms of language use, females have often been found to use personal pronouns more in some types of text (argamon, koppel, fine, & shimoni, ) . although these conclusions are drawn from statistical tests on big data from twitter, inferences from the results are tentative due to the processing limitations above that could not be addressed and the lack of evidence connecting offline actions to the content of tweets. thus, for example, the greater female tendency to tweet about families does not prove that females were more concerned about the welfare of their families due to covid- , although this is a plausible explanation. thus, the conclusions should be treated similarly to those of purely qualitative research: as evidence-based ideas but not proof of those ideas. the substantially greater focus of males on sport in tweets about covid- might be taken as evidence that males were less serious about the disease in the initial stages. irrespective of whether this is true, sport was an important factor in the reaction to covid- for many males. a policy-related suggestion from this is that cancelling sporting events may be particularly effective in communicating to males the seriousness of a situation. for example, if the population is told to socially distance but allowed to attend mass sporting events on the basis that an alternative (watching the event in crowded pubs or bars) is more dangerous then this may send mixed messages since crowded sporting events clearly involve close proximity with large numbers of strangers. thus, any relaxation of bans on sporting events should be considered very carefully in the future, in countries where they are in place, and sporting bans should be considered in other countries as an important component of social distancing strategies, both for the spreading risk and the message sent to (mainly) males. the results are consistent with, but do not prove, that women are at the forefront of actions to prevent the spread of covid- . public health messages might therefore need to be particularly careful that core messages are transmitted effectively to women in media that they consume so that social distancing is fully understood by as many as possible so that it can be carried out as effectively as possible. gender, genre, and writing style in formal written texts controlling the false discovery rate: a practical and powerful approach to multiple testing closing the gap: gender parity in political engagement on social media. information the covid- social media infodemic defining the epidemiology of covid- -studies needed key word-in-context index for technical literature (kwic index) parenting in america: outlook, worries, aspirations are strongly linked to financial situation sport = male… but not all sports: investigating the gender stereotypes of sport activities at the explicit and implicit levels u.s. has changed in key ways in the past decade, from tech use to demographics democrats on twitter more liberal, less focused on compromise than those not on the platform facts about americans and twitter gender and research publishing in india: uniformly high inequality gender differences in research areas, methods and topics: can people and thing orientations explain the results she's reddit: a source of statistically significant gendered interest information? information processing & management can museums find male or female audiences online with youtube? key: cord- -c hafan authors: tang, lu; bie, bijie; zhi, degui title: tweeting about measles during stages of an outbreak: a semantic network approach to the framing of an emerging infectious disease date: - - journal: am j infect control doi: . /j.ajic. . . sha: doc_id: cord_uid: c hafan background: the public increasingly uses social media not only to look for information about emerging infectious diseases (eids), but also to share opinions, emotions, and coping strategies. identifying the frames used in social media discussion about eids will allow public health agencies to assess public opinions and sentiments. method: this study examined how the public discussed measles during the measles outbreak in the united states during early that originated in disneyland park in anaheim, ca, through a semantic network analysis of the content of around million tweets using kh coder. results: four frames were identified based on word frequencies and co-occurrence: news update, public health, vaccination, and political. the prominence of each individual frame changed over the corse of the pre-crisis, initial, maintenance, and resolution stages of the outbreak. conclusions: this study proposed and tested a method for assessing the frames used in social media discussions about eids based on the creation, interpretation, and quantification of semantic networks. public health agencies could use social media outlets, such as twitter, to assess how the public makes sense of an eid outbreak and to create adaptive messages in communicating with the public during different stages of the crisis. emerging infectious diseases (eids) present novel and unfamiliar risks to the public. as a new tool for strategic communication during eid outbreaks, social media allows government agencies such as the centers for disease control and prevention (cdc) to reach wider audiences. as a platform based on user-generated content, social media enables the public to share opinions and sentiments during these outbreaks. one such eid is measles. measles is a highly contagious and acute illness that can lead to pneumonia, encephalitis, and death. it was declared eliminated in the united states in as the result of the successful nationwide administration of a -dose vaccination (ie, measles, mumps, and rubella vaccine). however, recent years have witnessed the re-emergence of measles outbreaks. most of these outbreaks were associated with imported cases; at the same time, a decrease in the domestic vaccination rate has made the country increasingly vulnerable during such outbreaks. presented here is a semantic network analysis of twitter content about measles during the measles outbreak that first appeared in california during early . examining the most frequently used key words and their co-occurrences allows researchers to induce a semantic network that represents the major frames used in large amounts of text. frames represent the cognitive structure people use in understanding and communicating issues. through framing, media and individuals choose to highlight certain aspect of the crisis while downplaying other aspects. this study adds to the research on crisis and emergency risk communication by demonstrating that social media users applied different frames to understand the public health crisis associated with a measles outbreak: news update frame, public health frame, vaccination frame, and political frame. in terms of research methodology, this study demonstrates the feasibility of identifying frames through semantic network analysis. practically, the findings of the study allow public health professionals to understand how social media users make sense of an eid during different stages of the outbreak so that they can develop more effective crisis communication strategies. social media plays an essential role in the dissemination of information on eids. what social media users post, share, like, and comment on reflects not only what information is available, but also what they consider important. researchers have used social media data to assess public perceptions, sentiments, and responses toward eid outbreaks such as the h n outbreak, the european escherichia coli outbreak, the ebola virus disease outbreak, and the - measles outbreak in the netherlands. with the exception of lazard et al, all these studies examined social media contents deductively based on pre-existing categories through either manual coding or text mining based on manually coded training dataset. semantic network of social media contents represents a new lens through which researchers could inductively investigate how the public thinks and feels during an eid outbreak without the need for a training dataset. semantic networks represent the semantic relationships among a set of words. semantic network analysis is both a theoretical framework and a quantitative text analysis method that uncovers the structure of relationships among words. in a semantic network, word-use frequencies and co-occurrence of the most frequently occurring words represent shared meanings and common perceptions in people's minds. semantic networks can be used to infer the frames used in texts. framing is the process by which organizations and individuals choose to report or discuss an event (such as a public health crisis) by selectively highlighting certain aspects and downplaying other aspects of the event. researchers have explored the frames used by different stakeholders facing a crisis through the examination of semantic networks. david et al examined pre-established frames about population issues in news articles in the philippines by looking at weighted semantic networks between and . however, a strength of semantic network analysis is it allows new and unfamiliar frames to emerge from the data inductively. for instance, schultz et al studied the associative frames used by public relations professionals and news media in the united states and united kingdom after the bp oil spill into the gulf of mexico and found that bp framed the oil spill as caused by external factors and downplayed internal factors (such as company's behaviors), whereas the news media adopted more complicated frames. van der meer et al compared the frames used by public relations professionals, news media, and the public in times of crises such as explosions and volcano eruptions by examining the semantic networks and found the frames used by these groups converged overtime. tian and stewart compared the semantic networks based on word cooccurrence in cnn and bbc online news report about severe acute respiratory syndrome and found that although both news outlets used the public health frame, cnn used the economic impact frame, and bbc used the outbreak impact frame. exploring frames during a measles outbreak as indicated by semantic networks by twitter users can provide insights into how social media users make sense of the crisis and what issues concern them. hence, the first research question (rq) was proposed. rq : what are the frames used in the twitter discussion about the measles outbreak as indicated by semantic networks? crisis communication takes different forms in different stages of the crisis. reynolds and seeger identified a stage-based model of crisis and emergency risk communication, which includes stages: pre-crisis, initial event, maintenance, resolution, and evaluation. each of these stages is associated with different tasks for crisis communication. although this model was proposed as a strategic tool for guiding the crisis communication of government agencies such as the cdc, it nevertheless shows that the public goes through different stages in information processing and sensemaking about public health crises such as a measles outbreak. as a result, they are likely to use different frames to discuss measles on twitter during each stage of the crisis. hence, we asked the next rq: rq : how does the use of different frames change over different stages of the outbreak? because the most recent measles outbreak in the united states occurred between december and april , tweets including the word measles posted between december , , and april , , were purchased from discovertext.com (n = , , ). using this time frame allowed us to look at the twitter discussion before, during, and after the outbreak. first, raw texts were cleaned and transformed into an appropriate format to be mined. non-english tweets were excluded, resulting in , , tweets. for the purpose of semantic network analysis, urls in tweet texts were deleted. special characters such as \,^, and $ as well as user names mentioned after @ were also deleted from tweet texts. next, stop words were excluded, including conjunctions, auxiliary verbs, and transitive verbs, among others. plural forms of words were replaced by singular forms and high frequency words sharing the same root were combined into single words to facilitate the analysis. similarly, multiword phrases with the same meaning were combined. semantic network analysis was conducted to answer the rqs proposed. first, the content of tweets was analyzed to calculate the frequency of words and determine the most frequently co-occurring word pairs using kh coder version . f, a free software for analyzing text and identifying co-occurrence networks (available for download at https://sourceforge.net/projects/khc/). when the data were loaded into the program, a word frequency table and a word co-occurrence network were generated. each individual tweet was a unit of analysis and word pair co-occurrence was defined as the appearance of words in the same tweet. to explore the dynamic nature of twitter discussion during an eid outbreak, the duration of the measles outbreak was divided into stages. the pre-crisis stage (stage ) was between december , , and january , . the first cases of this outbreak were reported on january , which marked the beginning of outbreak and thus the initial stage (stage ). the maintenance stage (stage ) started on january , when the number of new cases started to decline and ended on march , when the last case in this break was reported. the resolution stage (stage ) was between march and april , when the cdc announced this outbreak to be officially over. five days were selected for each stage. to be included in the final semantic network, a word or multiword phrase must have appeared in more than % of tweets and be involved in the top- edges (connections) of each stage filtered based on jaccard coefficient. jaccard coefficient is a statistical measure widely used for assessing similarity between objects (jaccard, ) . the values of jaccard coefficient vary between and . in kh coder, words that appear frequently in the same tweet are considered to be closely associated, and their jaccard coefficient become closer to . to facilitate a more nuanced understanding of the network structure during different stages of outbreak progression, an inductive approach was adopted to identify potential frames based on the high-frequency words and co-occurrence links among them. the visualization of the semantic networks was accomplished using graphviz (http://graphviz.org). to explore the longitudinal changes in the use of the frames identified, dictionaries containing key words associated with each frame were created based on the semantic networks identified and on the authors' further reading of tweets adopting these frames. tweets were labeled as containing a frame based on the presence of tags (ie, frame-relevant terms). in other words, each tweet was labeled by a unique set of key words. for example, the groups of words classified as news update frame included california, disney, utah, january, visitor, official, and today. the list of words associated with the public health frame included cdc, patient, disease, contract, infectious, and virus, among others. the vaccine frame, which included any mention of vaccine-related issues, was associated with key terms such as vaccine, antivaccine, unvaccinate, unvaccinated, inhaled measles vaccine, safety, immunity, and jenny mccarthy (an actress often associated with the anti-vaccine movement). finally, the political frame was associated with terms such as obama, republicans, democrats, immigration, illegal, lawmaker, debate, governor christie, and politics. a single tweet can be labeled as using frame, multiple frames, or none of the frames. next, percentages of tweets using different frames were calculated for the stages identified and χ tests were run to see whether the differences among stages were statistically significant. bonferroni correction was conducted to account for the effects of multiple testing. overall, original tweets (n = , ) accounted for . % of all tweets collected (n = , , ). as shown in figure , although the numbers of tweets and retweets were similar during the most active part of the outbreak, retweets outnumbered original tweets mostly in early february during the maintenance stage. figure represents the semantic network of twitter content about measles during the entire outbreak. four distinct frames were identified inductively based on the reading of the semantic networks and tweets containing the key words included in these semantic networks: news update frame, public health frame, vaccine frame, and political frame. the titles of these frames were coined by the authors based on the typical message carried in each frame, similar to the practice described in odlum and toon ( ) . the news update frame provided news and updates about suspected and confirmed cases of measles before, during, and after the outbreak. typically, a tweet adopting the news update frame included the number of cases in a geographic location; for example, "california reports more measles cases." tweets using the news update frame did not typically contain opinions. tweets adopting the public health frame conveyed medical information such as symptoms of measles, methods of prevention, and treatment. for instance, the following tweet introduced a new complication of measles: "eye complications possible with measles warn ophthalmologists." tweets using the public health frame educated the public about measles and sometimes included behavior recommendations. the vaccine frame referred to the discussion and debates about the safety and necessity of vaccination. tweets using this frame were often emotionally charged. an example of a provaccine tweet was: "if i get measles because some nitwit talked my parents into not vaccinating me, somebody's getting their ass kicked." an example of an anti-vaccine tweet was: "measles vaccine kills kidsmedia blackout." the political frame was used in those tweets that presented the causes of and solutions for the measles outbreak in political terms. some twitter users blamed the outbreak on the influx of illegal immigrants and called for tighter immigration law and border control. the political frame was also used in tweets debating governmental policies on vaccination and disease prevention. an example of a tweet using the political frame was: "am i the only one wondering if the surge of illegals in has anything to do with the measles outbreaks we see now?" across the stages of the outbreak, the pre-crisis stage had the fewest original tweets (n = , ), whereas the maintenance stage included , original tweets, followed by the initial stage (n = , ) and the resolution stage (n = , ). the use of the frames showed different patterns in the stages of the outbreak (see fig ) . the news update frame appeared to be the most dominant frame during the initial and resolution stages. the public health frame was of the most dominant frames in the precrisis stage; however, its use decreased during the initial stage and was lowest during the maintenance stage. the use of the vaccine frame increased from pre-crisis stage to the initial stage and the vaccine frame became the most dominant frame during the maintenance. the political frame was the least often used frame in all stages of the outbreak and appeared most frequently during the maintenance stage. a series of χ tests showed that the use of these frames was significantly different across the stages of the outbreak (see table ). pairwise χ tests were conducted to further explore the differences among these stages in the use of frames (see table ). specifically, the following pairs of comparisons were performed based on the chronological order of development stages: the precrisis stage versus the initial stage, the initial stage versus the maintenance stage, and the maintenance stage versus the resolution stage. all of the pairwise comparisons were significant at the adjusted p < . level, except for the use of the political frame in pre-crisis and initial stages. social media allows the assessment of public opinions, sentiments, and responses during an eid outbreak. our study examines the frames used in twitter discussion during the measles outbreak through a semantic network analysis. this study finds that around half of tweets are retweets. furthermore, retweets outnumbered original tweets during the early days of the maintenance stage. this stands in contrast with the findings of previous studies. for instance, liu, kliman-silver, and mislove found that overall around % of tweets are retweets. radzikowski et al studied the tweets about vaccines during the later stages of the measles outbreak (the maintenance stage in our study) and found around % of tweets in their data corpus were retweets. our data also suggest the highest rate of retweets during this stage, although our data were about measles instead of vaccine. it is possible that the maintenance stage sees more reflection and activism when the immediate threat of the outbreak has been contained and thus is associated with more retweets. the current study identified major frames that emerged organically from the semantic network based on word frequencies and co-occurrence. each frame highlighted an important dimension of measles-related discussion on twitter. furthermore, different stages of the outbreak witnessed fluctuations in each frame's popularity. during the pre-crisis stage, the news update frame and public health frame were the most dominant frames used, whereas the vaccine frame was rarely used and the political frame was almost never used. during the initial stage, use of the news update and vaccine frames increased, although the use of the public health frame actually went down. during the maintenance stage, the vaccine frame became the most frequently used frame, followed by the news update frame, public health frame, and political frame. as the outbreak drew to an end, the use of frames revert to the pattern observed in the precrisis stage, with news update being the most used frame, followed by the public health frame, vaccine frame, and political frame. there are possible explanations for the changes in how measles is framed on twitter. it is possible that the frames on twitter are influenced by the frames set by the mainstream media. for instance, lee and basnyat studied media framing of the h n outbreak in singapore and suggested that although traditional media predominantly used the update frame and prevention frame throughout the break, their use of frames diversified to some extent in the later stages of the crisis to include more personal frames and social frames. our semantic network shows that traditional news media (eg, cnn, cbs, reuters, and ap) featured prominently in the discussion about the measles outbreak. future research could examine the intermedia agenda setting process between traditional news media and social media in covering and framing public health crises associated with eids to see if twitter frames are influenced by the frames used in the traditional news media or if the public may influence the frames used by traditional news media through voicing their concerns and opinions in twitter. the other possible explanation for the changes in the use of frames is psychological. according to the psychometric paradigm, people's response to a risk is decided by the dread, catastrophic potential, controllability, and familiarity of said risk. it is possible that during the early stages of a measles outbreak, the public is unsure about the seriousness of the risk and the controllability of the outbreak and therefore prefers the update frame and the public health frame. these frames provide information that can help the public assess the scale and severity of the crisis. however, as the outbreak develops, people perceive the risks associated with measles as more familiar and more controllable. as a result, they start to try to make sense of the crisis by deciding whom they should blame for the outbreak, thereby leading to the increased use of the vaccine frame, which blames parents who refuse to vaccinate their children, as well as the political frame, which blames illegal immigrants or the government for the outbreak. through surveys, future research could establish the relationship between social media users' risk perception and the frames they apply in discussing eids. the high frequencies of the political frame during the maintenance stage, some of which state that illegal immigrants were responsible for the measles outbreak, also indicates the spread of fake news on twitter. recently, vosoughi, roy, and aral found that false news stories spread further, faster, deeper and more broadly on twitter than truthful stories. compared with truth, false news stories only took one-sixth of the time to reach , people and were % more likely to be retweeted. these findings have particularly important implications for health care professionals during infectious disease emergencies. when rumor or misinformation gains high virality, informing and educating the public would become very challenging and difficult. radzikowski et al. studied the measles vaccination narrative in twitter by examining the semantic networks of popular hashtags and retweeting patterns in the aftermath of the outbreak. they found that although official public health agencies such as the cdc and world health organization have all entered the social media arena, mainstream media coverage about key health issues still have the power to lead online public participation. our study confirmed some of the findings of radzikowski et al. that mainstream news coverage of health topics actually played a very important role in leading online audience attention and shaping public debate on social media and political frame was the most observed ( . %) during the maintenance stage when the vaccine frame also came to its peak ( . %). our study also extended radzikowski et al. by extending the observation time, looking at twitter discussion of this outbreak as a whole from a risk communication perspective, and comparing the content patterns in different stages of the outbreak. this study proposes and tests a method for assessing the frames used in social media discussions about eids based on the creation, interpretation, and quantification of semantic networks. this method can be used to study the public response to different eid outbreaks as well as other ongoing public health crises, such as heart disease and obesity. instead of looking for known frames used in social media, this method allows the identification of unique frames associated with different public health crises. in terms of this study's practical implications, our study shows that agencies could use social media content on platforms such as twitter to assess how the public makes sense of an eid outbreak and to create adaptive messages in communicating with the public during different stages of the crisis. for instance, if it is detected that the public tends to use the vaccine frame during a certain stage of the crisis, public health agencies could design and disseminate specific social media messages to spread useful information and combat misconceptions. this study only represented an early-stage effort at mining the contents about eids on social media. future studies can take a number of directions. it is possible that individuals, government agencies, nonprofit organizations, and news media might use different frames in tweeting about measles before, during, and after an eid outbreak. further research should compare these different stakeholders' tweets. in addition, it would be of interest to study the retweeting network about eids (as done in radzikowski et al, ) . studying the structure of retweeting networks can shed lights on how information and opinion is diffused among twitter users. it would be useful to compare the retweeting networks during different stages of the outbreak to see if information flows differently during these stages and explore the roles played different stakeholders (eg, traditional news media, nonprofit organizations, and corporations) during across the stages. emerging infectious disease (eid) communication during the h n influenza outbreak: literature review ( - ) of the methodology used for eid communication analysis measles -united states framing as a theory of media effects pandemics in the age of twitter: content analysis of tweets during the h n outbreak tweeting during food crises: a psychosocial analysis of threat coping expressions in spain during the european ehec outbreak detecting themes of public concern: a text mining analysis of the centers for disease control and prevention's ebola live twitter chat disease detection or public opinion reflection? content analysis of tweets, other social media, and online newspapers during the measles outbreak in the netherlands in theories of communication networks what constitutes semantic network analysis? a comparison of research and methodologies news frames of the population issue in the philippines strategic framing in the bp crisis: a semantic network analysis of associative frames when frames align: the interplay between pr, news media, and the public in times of crisis framing the sars crisis: a computer-assisted text analysis of cnn and bbc online news reports of sars crisis and emergency risk communication as an integrative model network analysis of message content the distribution of the flora in the alpine zone lexical shifts, substantive changes, and continuity in state of the union discourse what can we learn about the ebola outbreak from tweets? the tweets they are a-changin': evolution of twitter users and behavior. aaai the measles vaccination narrative in twitter: a quantitative analysis from press release to news: mapping the framing of the h n a influenza pandemic the perception of risk the spread of true and false news online key: cord- -b f wtfn authors: caldarelli, guido; nicola, rocco de; petrocchi, marinella; pratelli, manuel; saracco, fabio title: analysis of online misinformation during the peak of the covid- pandemics in italy date: - - journal: nan doi: nan sha: doc_id: cord_uid: b f wtfn during the covid- pandemics, we also experience another dangerous pandemics based on misinformation. narratives disconnected from fact-checking on the origin and cure of the disease intertwined with pre-existing political fights. we collect a database on twitter posts and analyse the topology of the networks of retweeters (users broadcasting again the same elementary piece of information, or tweet) and validate its structure with methods of statistical physics of networks. furthermore, by using commonly available fact checking software, we assess the reputation of the pieces of news exchanged. by using a combination of theoretical and practical weapons, we are able to track down the flow of misinformation in a snapshot of the twitter ecosystem. thanks to the presence of verified users, we can also assign a polarization to the network nodes (users) and see the impact of low-quality information producers and spreaders in the twitter ecosystem. propaganda and disinformation have a history as long as mankind, and the phenomenon becomes particularly strong in difficult times, such as wars and natural disasters. the advent of the internet and social media has amplified and made faster the spread of biased and false news, and made targeting specific segments of the population possible [ ] . for this reason the vice-president of the european commission with responsibility for policies on values and transparency, vȇra yourová, announced, beginning of june , a european democracy action plan, expected by the end of , in which web platforms admins will be called for greater accountability and transparency, since 'everything cannot be allowed online' [ ] . manufacturers and spreaders of online disinformation have been particularly active also during the covid- pandemic period (e.g., writing about bill gates role in the pandemics or about masks killing children [ , ] ). this, alongside the real pandemics [ ] , has led to the emergence of a new virtual disease: covid- infodemics. in this paper, we shall consider the situation in italy, one of the most affected countries in europe, where the virus struck in a devastating way between the end of february and the end of april [ ] . in such a sad and uncertain time, propaganda [ ] in italy, since the beginning of the pandemics and at time of writing, almost k persons have contracted the covid- virus: of these, more than k have died. source: http://www.protezionecivile.gov.it/. accessed september , . has worked hard: one of the most followed fake news was published by sputnik italia receiving , likes, shares and comments on the most popular social media. 'the article falsely claimed that poland had not allowed a russian plane with humanitarian aid and a team of doctors headed to italy to fly over its airspace', the ec vice-president yourová said. actually, the studies regarding dis/mis/information diffusion on social media seldom analyse its effective impact. in the exchange of messages on online platforms, a great amount of interactions do not carry any relevant information for the understanding of the phenomenon: as an example, randomly retweeting viral posts does not contribute to insights on the sharing activity of the account. for determining dis/misinformation propagation two main weapons can be used, the analysis of the content (semantic approach) and the analysis of the communities sharing the same piece of information (topological approach). while the content of a message can be analysed on its own, the presence of some troublesome structure in the pattern of news producer and spreaders (i.e., in the topology of contacts) can be detected only trough dedicated instruments. indeed, for real in-depth analyses, the properties of the real system should be compared with a proper null model. recently, entropy-based null models have been successfully employed to filter out random noise from complex networks and focus the attention on non trivial contributions [ , ] . essentially, the method consists in defining a 'network benchmark' that has some of the (topological) properties of the real system, but is completely random for all the rest. then, every observation that does not agree with the model, i.e., cannot be explained by the topological properties of the benchmark, carries non trivial information. notably, being based on the shannon entropy, the benchmark is unbiased by definition. in the present paper, using entropy-based null-models, we analyse a tweet corpus related to the italian debate on covid- during the two months of maximum crisis in italy. after cleaning the system from the random noise, by using the entropy-based null-model as a filter, we have been able to highlight different communities. interestingly enough, these groups, beside including several official accounts of ministries, health institutions, and -online and offline -newspapers and newscasts, encompass four main political groups. while at first sight this may sound surprising -the pandemic debate was more on a scientific than on a political ground, at least in the very first phase of its abrupt diffusion -, it might be due to pre-existing echo chambers [ ] . the four political groups are found to perform completely different activities on the platform, to interact differently from each other, and to post and share reputable and non reputable sources of information with great differences in the number of their occurrences. in particular, the accounts from the right wing community interact, mainly in terms of retweets, with the same accounts who interact with the mainstream media. this is probably due to the strong visibility given by the mainstream media to the leaders of that community. moreover, the right wing community is more numerous and more active, even relatively to the number of accounts involved, than the other communities. interestingly enough, newly formed political parties, as the one of the former italian prime minister matteo renzi, quickly imposed their presence on twitter and on the online political debate, with a strong activity. furthermore, the different political parties use different sources for getting information on the spreading on the pandemics. to detect the impact of dis/misinformation in the debate, we consider the news sources shared among the accounts of the various groups. with a hybrid annotation approach, based on independent fact checking organisations and human annotation, we categorised such sources as reputable and non reputable (in terms of credibility of the published news and the transparency of the sources). notably, we experienced that a group of accounts spread information from non reputable sources with a frequency almost times higher than that of the other political groups. and we are afraid that, due to the extent of the online activity of the members of this community, the spreading of such a volume of non reputable news could deceit public opinion. we collected circa . m tweets in italian language, from february st to april th [ ] . details about the political situation in italy during the period of data collection can be found in the supplementary material, section . : 'evolution of the covid- pandemics in italy'. the data collection was keyword-based, with keywords related the covid- pandemics. twitter's streaming api returns any tweet containing the keyword(s) in the text of the tweet, as well as in its metadata. it is worth noting that it is not always necessary to have each permutation of a specific keyword in the tracking list. for example, the keyword 'covid' will return tweets that contain both 'covid ' and 'covid- '. table lists a subset of the considered keywords and hashtags. there are some hashtags that overlap due to the fact that an included keyword is a sub-string of another one, but we included both for completeness. the left panel of fig. shows the network obtained by following the projection procedure described in section . . the network resulting from the projection procedure will be called, in the rest of the paper, validated network. the term validated should not be confused with the term verified, which instead denotes a twitter user who has passed the formal authentication procedure by the social platform. in order to get the community of verified twitter users, we applied the louvain algorithm [ ] to the data in the validated network. such an algorithm, despite being one of the most popular, is also known to be order dependent [ ] . to get rid of this bias, we apply it iteratively n times (n being the number of the nodes) after reshuffling the order of the nodes. finally, we select the partition with the highest modularity. the network presents a strong community structure, composed by four main subgraphs. when analysing the emerging communities, we find that they correspond to right wing parties and media (in steel blue) center left wing (dark red) stars movement (m s ), in dark orange institutional accounts (in sky blue) details about the political situation in italy during the period of data collection can be found in the supplementary material, section . : 'italian political situation during the covid- pandemics'. this partition in four subgroups, once examined in more details, presents a richer substructure, described in the right panel of fig. . starting from the center-left wing, we can find a darker red community, including various ngos and various left oriented journalists, vips and pundits. a slightly lighter red sub-community turns out to be composed by the main politicians of the italian democratic party (pd), as well as by representatives from the european parliament (italian and others) and some eu commissioners. the violet red group is mostly composed by the representatives of italia viva, a new party founded by the former italian prime minister matteo renzi (december -february ). in golden red we can find the subcommunity of catholic and vatican groups. finally the dark violet red and light tomato subcommunities consist mainly of journalists. in turn, also the orange (m s) community shows a clear partition in substructures. in particular, the dark orange subcommunity contains the accounts of politicians, parliament representatives and ministers of the m s and journalists. in aquamarine, we can find the official accounts of some private and public, national and international, health institutes. finally, in the light slate blue subcommunity we can find various italian ministers as well as the italian police and army forces. similar considerations apply to the steel blue community. in steel blue, the subcommunity of center right and right wing parties (as forza italia, lega and fratelli d'italia). in the following, this subcommunity is going to be called as fi-l-fdi, recalling the initials of the political parties contributing to this group. the sky blue subcommunity includes the national federations of various sports, the official accounts of athletes and sport players (mostly soccer) and their teams. the teal subcommunity contains the main italian news agencies. in this subcommunity there are also the accounts of many universities. the firebrick subcommunity contains accounts related to the as roma football club; analogously in dark red official accounts of ac milan and its players. the slate blue subcommunity is mainly composed by the official accounts of radio and tv programs of mediaset, the main private italian broadcasting company. finally, the sky blue community is mainly composed by italian embassies around the world. for the sake of completeness, a more detailed description of the composition of the subcommunities in the right panel of figure is reported in the supplementary material, section . : 'composition of the subcommunities in the validated network of verified twitter users'. here, we report a series of analyses related to the domain names, hereafter simply called domains, that mostly appear in all the tweets of the validated network of verified users. the domains have been tagged according to their degree of credibility and transparency, as indicated by the independent software toolkit newsguard https://www.newsguardtech.com/. the details of this procedure are reported below. as a first step, we considered the network of verified accounts, whose communities and sub-communities are shown in fig. . on this topology, we labelled all domains that had been shared at least times (between tweets and retweets). table shows the tags associated to the domains. in the rest of the paper, we shall be interested in quantifying reliability of news sources publishing during the period of interest. thus, for our analysis, we will not consider those sources corresponding to social networks, marketplaces, search engines, institutional sites, etc. tags r, ∼ r and nr in table are used only for news sites, be them newspapers, magazines, tv or radio social channels, and they stand for reputable, quasi reputable, not reputable, respectively. label unc is assigned to those domains with less than occurrences in ours tweets and rewteets dataset. in fact, the labeling procedure is a hybrid one. as mentioned above, we relied on newsguard, a plugin resulting from the joint effort of journalists and software table tags used for labeling the domains developers aiming at evaluating news sites according to nine criteria concerning credibility and transparency. for evaluating the credibility level, the metrics consider whether the news source regularly publishes false news, does not distinguish between facts and opinions, does not correct a wrongly reported news. for transparency, instead, the tool takes into account whether owners, founders or authors of the news source are publicly known; and whether advertisements are easily recognizable [ ] . after combining the individual scores obtained out of the nine criteria, the plugin associates to a news source a score from to , where is the minimum score for the source to be considered reliable. when reporting the results, the plugin provides details about the criteria which passed the test and those that did not. in order to have a sort of no-man's land and not to be too abrupt in the transition between reputability and non-reputability, when the score was between and , we considered the source to be quasi reputable, ∼r. it is worth noting that not all the domains in the dataset under investigation were evaluated by newsguard at the time of our analysis. for those not evaluated automatically, the annotation was made by three tech-savvy researchers, who assessed the domains by using the same criteria as newsguard. table gives statistics about number and kind of tweets (tw = pure tweet; rt = retweet), the number of url and distinct url (dist url), the number of domains and users in the validated network of verified users. we clarify what we mean by these terms with an example: a domain for us corresponds to the so-called 'second-level domain' name [ ] , i.e., the name directly to the left of .com, .net, and any other top-level domains. for instance, repubblica.it, corriere.it, nytimes.com are considered domains by us. instead, the url maintains here its standard definition [ ] and an example is http://www.example.com/index.html. table shows the outcome of the domains annotation, according to the scores of newsguard or to those assigned by the three annotators, when scores were no available from newsguard. at a first glance, the majority of the news domains belong to the reputable category. the second highest percentage is the one of the untagged domains -unc. in fact, in our dataset there are many domains that occur only few times once. for example, there are domains that appear in the datasets only once. fig. shows the trend of the number of tweets and retweets, containing urls, posted by the verified users of the validated projection during the period of data [ ] newsguard rating process: https://www.newsguardtech.com/ratings/rating-process-criteria/ [ ] https://en.wikipedia.org/wiki/domain_name [ ] table annotation results over all the domains in the whole dataset -validated network of verified users. in [ ] . going on with the analysis, table shows the percentage of the different types of domains for the communities identified in the left plot of fig. . it is worth observing that the steel blue community (both politicians and media) is the most active one, even if it is not the most represented: the number of users is lower than the one of the center left community (the biggest one, in terms of numbers), but the number of their posts containing a valid url is almost the double of that of the second more active community. interestingly, the activity of the verified users of the steel blue community is more focused on content production of (see the only tweets sub-table) than in sharing (see the only retweets sub-table). in fact, retweets represent almost . % of all posts from the media and the right wing community, while in the case of the center-left community it is . %. this effect is observable even in the average only tweets post per verified user: a right-wing user and a media user have an average of . original posts, against . for center-left-wing users. these numbers are probably due to the presence in the former community of the italian most accessed media. they tend to spread their (original) pieces of news on the twitter platform. interestingly, the presence of urls from a non reputable source in the steel blue community is more than times higher than the second score in the same field in the case of original tweets (only tweets). it is worth noting that, for the case of the dark orange and sky blue communities, which are smaller both in terms of users and number of posts, the presence of non classified sources is quite strong (it represents nearly % of retweeted posts for both the communities), as it is the frequency of posts linking to social network contents. interestingly enough, the verified users of both groups seem to focus slightly more on the same domains: there are, on average, . and . posts for each url domain respectively for the dark orange and sky blue communities, and, on average, . and . posts for the steel blue and the dark red communities. the right plot in fig. report a fine grained division of communities: the four largest communities have been further divided into sub-communities, as mentioned in subsection . . here, we focus on the urls shared in the purely political sub-communities in table . broadly speaking, we examine the contribution of the different political parties, as represented on twitter, to the spread of mis/disinformation and propaganda. table clearly shows how the vast majority of the news coming from sources considered scarce or non reputable are tweeted and retweeted by the steel blue political sub-community (fi-l-fdi). notably, the percentage of non reputable sources shared by the fi-l-fdi accounts is more than times the percentage of their community (the steel blue one) and it is more than times the second community in the nr ratio ranking. for all the political sub-communities the incidence of social network links is much higher than in their original communities. looking at table , even if the number of users in each political sub-community is much smaller, some peculiar behaviours can be still be observed. again, the center-right and right wing parties, while representing the least represented ones in terms of users, are much more active than the other groups: each (verified) user is responsible, on average of almost . messages, while the average is . , . and . for m s, iv and pd, respectively. it is worth noticing that italia viva, while being a recently founded party, is very active; moreover, for them the frequency of quasi reputable sources is quite high, especially in the case of only tweets posts. the impact of uncategorized sources is almost constant for all communities in the retweeting activity, while it is particularly strong for the m s. finally, the posts by the center left communities (i.e., italia viva and the democratic party) tend to have more than one url. specifically, every post containing at least a url, has, on average, . and . urls respectively, against the . of movimento stelle and . for the center-right and right wing parties. to conclude the analysis on the validated network of verified users, we report statistics about the most diffused hashtags in the political sub-communities. fig. focuses on wordclouds, while fig. reports the data under an histograms form. actually, from the various hashtags we can derive important information regarding the communications of the various political discursive communities and their position towards the management of the pandemics. first, it has to be noticed that the m s is the greatest user of hashtags: their two most used hashtags have been used almost twice the most used hashtags used by the pd, for instance. this heavy usage is probably due to the presence in this community of journalists and of the official account of il fatto quotidiano, a newspaper explicitly supporting the m s: indeed, the first two hashtags are "#ilfattoquotidiano" and "#edicola" (kiosk, in italian). it is interesting to see the relative importance of hashtags intended to encourage the population during the lockdown: it is the case of "#celafaremo" (we will make it), "#iorestoacasa" (i am staying home), "#fermiamoloinsieme" (let's stop it together ): "#iorestoacasa" is present in every community, but it ranks th in the m s verified user community, th in the fi-l-fdi community, nd in the italia viva community and th in the pd one. remarkably, "#celafaremo" is present only in the m s group, as "#fermiamoloinsieme" can be found in the top hashtags only in the center-right and right wing cluster. the pd, being present in various european institutions, mentions more european related hashtags ("#europeicontrocovid ", europeans against covid- ), in order to ask for a common reaction of the eu. the center-right and right wing community has other hashtags as "#forzalombardia" (go, lombardy! ), ranking the nd, and "#fermiamoloinsieme", ranking th. what is, nevertheless, astonishing, is the presence among the most used hashtags of all communities of the name of politicians from the same group ('interestingly '#salvini" is the first used hashtag in the center right and right wing community, even if he did not perform any duty in the government), tv programs ("#mattino ", "#lavitaindiretta", "#ctcf", "#dimartedì"), as if the main usage of hashtags is to promote the appearance of politicians in tv programs. finally, the hashtags used by fi-l-fdi are mainly used to criticise the actions of the government, e.g., "#contedimettiti" (conte, resign! ). fig. shows the structure of the directed validated projection of the retweet activity network, as outcome of the procedure recalled in section of the supplementary material. as mentioned in section of the supplementary material, the affiliation of unverified users has been determined using the tags obtained by the validated projected network of the verified users, as immutable label for the label propagation of [ ] . after label propagation, the representation of the political communities in the validated retweet network changes dramatically with respect to the case of the network of verified users: the center-right and right wing community is the most represented community in the whole network, with users (representing . % of all the users in the validated network), followed by italia viva users with accounts ( . % of all the accounts in the validated network). the impact of m s and pd is much more limited, with, respectively, and accounts. it is worth noting that this result is unexpected, due to the recent formation of italia viva. as in our previous study targeting the online propaganda [ ] , we observe that the most effective users in term of hub score [ ] are almost exclusively from the center-right and right wing party: considering the first hubs, only are not from this group. interestingly, out of these are verified users: roberto burioni, one of the most famous italian virologists, ranking nd, agenzia ansa, a popular italian news agency, ranking st, and tgcom , the popular newscast of a private tv channel, ranking rd. the fourth account is an online news website, ranking th: this is a not verified account which belongs to a not political community. remarkably, in the top hubs we find of the top hubs already found when considered the online debate on migrations from northern africa to italy [ ] : in particular, a journalist of a neo-fascist online newspaper (non verified user), an extreme right activist (non verified user) and the leader of fratelli d'italia giorgia meloni (verified user), who ranks rd in the hub score. matteo salvini (verified user), who was the first hub in [ ] , ranks th, surpassed by his party partner claudio borghi, ranking th. the first hub in the present network is an extreme right activist, posting videos against african migrants to italy and accusing them to be responsible of the contagion and of violating lockdown measures. table shows the annotation results of all the domains tweeted and retweeted by users in the directed validated network. the numbers are much higher than those shown in table , but the trend confirms the previous results. the majority of urls traceable to news sources are considered reputable. the number of unclassified domains is higher too. in fact, in this case, the annotation was made considering the domains occurring at least times. table annotation results over all the domains -directed validated network table reports statistics about posts, urls, distinct urls, users and verified users in the directed validated network. noticeably, by comparing these numbers with those of table , reporting statistics about the validated network of verified users, we can see that here the number of retweets is much more higher, and the trend is the opposite: verified users tend to tweet more than retweet ( vs ), while users in the directed validated network, which comprehends also non verified users, have a number of retweets . times higher than the number of their tweets. fig. shows the trend of the number of tweets containing urls over the period of data collection. since we are analysing a bigger network than the one considered in section . , we have numbers that are one order of magnitude greater than those shown in fig. ; the highest peak, after the discovery of the first cases in lombardy, corresponds to more than , posts containing urls, whereas the analogous peak in fig. corresponds to , posts. apart from the order of magnitudes, the two plots feature similar trends: higher traffic before the beginning of the italian lockdown, and a settling down as the quarantine went on [ ] . table shows the core of our analysis, that is, the distribution of reputable and non reputable news sources in the direct validated network, consisting of both verified and non-verified users. again, we focus directly on the political sub-communities identified in the previous subsection. two of the sub-communities are part of the center-left wing community, one is associated to the stars movement, the remaining one represents center-right and right wing communities. in line with previous results on the validated network of verified users, the table clearly shows how the vast majority of the news coming from sources considered scarce or non reputable are tweeted and retweeted by the center-right and right wing communities; % of the domains tagged as nr are shared by them. as shown in table , the activity of fi-l-fdi users is again extremely high: on average there are . retweets per account in this community, against the . of m s, the . of iv and the . of pd. the right wing contribution to the debate is extremely high, even in absolute numbers, due to the the large number of users in this community. it is worth mentioning that the frequency of non reputable sources in this community is really high (at about % of the urls in the only tweets) and comparable with that of the reputable ones (see table , only [ ] the low peaks for february and march are due to an interruption in the data collection, caused by a connection breakdown. table domains annotation per political sub-communities -directed validated network tweets). in the other sub-communities, pd users are more focused on un-categorised sources, while users from both italia viva and movimento stelle are mostly tweeting and retweeting reputable news sources. and users, but also in absolute numbers: out of the over m tweets, more than k tweets refer to a nr url. actually, the political competition still shines through the hashtag usage even for the other communities: it is the case, for instance, of italia viva. in the top hashtags we can find '#salvini', '#lega', but also '#papeete' [ ] , '#salvinisciacallo' (salvini jackal ) and '#salvinimmmerda' (salvini asshole). on the other hand, in italia viva hashtags supporting the population during the lockdown are used: '#iorestoacasa', '#restoacasa' (i am staying home), '#restiamoacasa' (let's stay home). criticisms towards the management of lombardy health system during the pandemics can be deduced from the hashtag '#commissariamtelalombardia' (put lombardy under receivership) and '#fontana' (the lega administrator of the lombardy region). movimento stelle has the name of the main leader of the opposition '#salvini', as first hashtag and supports criticisms to the lombardy administration with the hashtags '#fontanadimettiti' (fontana, resign! ) and '#gallera', the health and welfare minister of the lombardy region, considered the main responsible for the bad management of the pandemics. nevertheless, it is possible to highlight even some hashtags encouraging the population during the lock down, as the above mentioned '#iorestoacasa', '#restoacasa' and '#restiamoacasa'. it is worth mentioning that the government measures, and the corresponding m s campaigns, are accompanied specific hashtags: '#curaitalia' is the name of one of the decree of the prime minister to inject liquidity in the italian economy, '#acquistaitaliano' (buy italian products! ), instead, advertise italian products to support the national economy. as a final task, over the whole set of tweets produced or shared by the users in the directed validated network, we counted the number of times a message containing a url was shared by users belonging to different political communities, although without considering the semantics of the tweets. namely, we ignored whether the urls were shared to support or to oppose the presented arguments. table shows the most tweeted (and retweeted) nr domains shared by the political communities presented in table , the number of occurrences is reported next to each domain. the first nr domains for fi-l-fdi in table are related to the right, extreme right and neo-fascist propaganda, as it is the case of imolaoggi.it, ilprimatonazionale.it and voxnews.info, recognised as disinformation websites by newsguard and by the two main italian debunker websites, bufale.net and butac.it. as shown in the table, some domains, although in different number of occurrences, are present under more than one column, thus shared by users close to different political communities. this could mean, for some subgroups of the community, a retweet with the aim of supporting the opinions expressed in the original tweets. however, since the semantics of the posts in which these domains are present were not investigated, the retweets of the links by more than one political community could be due to contrast, and not to support, the opinions present in the original posts. despite the fact that the results were achieved for a specific country, we believe that the applied methodology is of general interest, being able to show trends and peculiarities whenever information is exchanged on social networks. in particular, when analysing the outcome of our investigation, some features attracted our attention: persistence of clusters wrt different discussion topics: in caldarelli et al. [ ] , we focused on tweets concerned with immigration, an issue that has been central in the italian political debate for years. here, we discovered that the clusters and the echo chambers that have been detected when analysing tweets about immigration are almost the same as those singled out when considering discussions concerned with covid- . this may seem surprising, because a discussion about covid- may not be exclusively political, but also medical, social, economic, etc.. from this we can argue that the clusters are political in nature and, even when the topic of discussion changes, users remain in their cluster on twitter. (indeed, journalists and politicians use twitter for information and political propaganda, respectively). the reasons political polarisation and political vision of the world affect so strongly also the analysis of what should be an objective phenomenon is still an intriguing question. persistence of online behavioral characteristics of clusters: we found that the most active, lively and penetrating online communities in the online debate on covid- are the same found in [ ] , formed in a almost purely political debate such as the one represented by the right of migrants to land on the italian territory. (dis)similarities amongst offline and online behaviours of members and voters of parties: maybe less surprisingly, the political habits is also reflected in the degree of participation to the online discussions. in particular, among the parties in the centre-left-wing side, a small party (italia viva) shows a much more effective social presence than the larger party of the italian centre-left-wing (partito democratico), which has many more active members and more parliamentary representation. more generally, there is a significant difference in social presence among the different political parties, and the amount of activity is not at all proportional to the size of the parties in terms of members and voters. spread of non reputable news sources: in the online debate about covid- , many links to non reputable (defined such by newsguard, a toolkit ranking news website based on criteria of transparency and credibility, led by veteran journalists and news entrepreneurs) news sources are posted and shared. kind and occurrences of the urls vary with respect to the corresponding political community. furthermore, some of the communities are characterised by a small number of verified users that corresponds to a very large number of acolytes which are (on their turn) very active, three times as much as the acolytes of the opposite communities in the partition. in particular, when considering the amount of retweets from poorly reputable news sites, one of the communities is by far (one order of magnitude) much more active than the others. as noted already in our previous publication [ ] , this extra activity could be explained by a more skilled use of the systems of propaganda -in that case a massive use of bot accounts and a targeted activity against migrants (as resulted from the analysis of the hub list). our work could help in steering the online political discussion around covid- towards an investigation on reputable information, while providing a clear indication of the political inclination of those participating in the debates. more generally, we hope that our work will contribute to finding appropriate strategies to fight online misinformation. while not completely unexpected, it is striking to see how political polarisation affects also the covid- debate, giving rise to on-line communities of users that, for number and structure, almost closely correspond to their political affiliations. this section recaps the methodology through which we have obtained the communities of verified users (see section . ). this methodology has been designed in saracco et al. [ ] and applied in the field of social networks for the first time in [ , ] . for the sake of completeness, the supplementary material, section , recaps the methodology through which we have obtained the validated retweet activity network shown in section . . in section of the supplementary material, the detection of the affiliation of unverified users is described. in the supplementary material, the interested reader will also find additional details about ) the definition of the null models (section ); ) a comparison among various label propagation for the political affiliation of unverified users (section ); and ) a brief state of the art on fact checking organizations and literature on false news detection (section ). many results in the analysis of online social networks (osn) shows that users are highly clustered in group of opinions [ , - , , , ] ; indeed those groups have some peculiar behaviours, as the echo chamber effects [ , ] . following the example of references [ , ] , we are making use of this users' clustering in order to detect discursive community, i.e. groups of users interacting among themselves by retweeting on the same (covid-related) subjects. remarkably, our procedure does not follow the analysis of the text shared by the various users, but is simply related on the retweeting activity among users. in the present subsection we will examine how the discursive community of verified twitter users can be extracted. on twitter there are two distinct categories of accounts: verified and unverified users. verified users have a thick close to the screen name: the platform itself, upon request from the user, has a procedure to check the authenticity of the account. verified accounts are owned by politicians, journalists or vips in general, as well as the official accounts of ministers, newspapers, newscasts, companies and so on; for those kind of users, the verification procedure guarantees the identity of their account and reduce the risk of malicious accounts tweeting in their name. non verified accounts are for standard users: in this second case, we cannot trust any information provided by the users. the information carried by verified users has been studied extensively in order to have a sort of anchor for the related discussion [ , , , , ] to detect the political orientation we consider the bipartite network represented by verified (on one layer) and unverified (on the other layer) accounts: a link is connecting the verified user v with the unverified one u if at least one time v was retweeted by u, or viceversa. to extract the similarity of users, we compare the commonalities with a bipartite entropy-based null-model, the bipartite configuration model (bicm [ ] ). the rationale is that two verified users that share many links to same unverified accounts probably have similar visions, as perceived by the audience of unverified accounts. we then apply the method of [ ] , graphically depicted in fig. , in order to get a statistically validated projection of the bipartite network of verified and unverified users. in a nutshell, the idea is to compare the amount of common linkage measured on the real network with the expectations of an entropy-based null model fixing (on average) the degree sequence: if the associated p-value is so low that the overlaps cannot be explained by the model, i.e. such that it is not compatible with the degree sequence expectations, they carry non trivial information and we project the related information on the (monopartite) projection of verified users. the interested reader can find the technical details about this validated projection in [ ] and in the supplementary information. the data that support the findings of this study are available from twitter, but restrictions apply to the availability of these data, which were used under license italian socio-political situation during the period of data collection in the present subsection we present some crucial facts for the understanding of the social context in which our analysis is set. this subsection is divided into two parts: the contagion evolution and the political situation. these two aspects are closely related. a first covid- outbreak was detected in codogno, lodi, lombardy region, on february, th [ ] . in the very next day, two cases were detected in vò, padua, veneto region. on february, th, in order to contain the contagions, the national government decided to put in quarantine municipalities, in the area around lodi and vò, near padua [ ] . nevertheless, the number of contagions raised to , hitting different regions; one of the infected person in vò died, representing the first registered italian covid- victim [ ] . on february, th there were already confirmed cases in italy. the first lockdown should have lasted until the th of march, but due to the still increasing number of contagions in northern italy, the italian prime minister giuseppe conte intended to extend the quarantine zone to almost all the northern italy on sunday, march th [ ] : travel to and from the quarantine zone were limited to case of extreme urgency. a draft of the decree announcing the expansion of the quarantine area appeared on the website of the italian newspaper corriere della sera on the late evening of saturday, th, causing some panic in the interested areas [ ] : around people, living in milan, but coming from southern regions, took trains and planes to reach their place of [ ] prima lodi, ""paziente ", il merito della diagnosi va diviso... per due", th june [ ] italian gazzetta ufficiale, "decreto-legge febbraio , n. ". the date is intended to be the very first day of validity of the decree. [ ] il fatto quotidiano, "coronavirus,è morto il enne ricoverato nel padovano. contagiati in lombardia, un altro in veneto", nd february . [ ] bbc news, "coronavirus: northern italy quarantines million people", th march " [ ] the guardian, "leaked coronavirus plan to quarantine m sparks chaos in italy", th march origins [ ] [ ] . in any case, the new quarantine zone covered the entire lombardy and partially other regions. remarkably, close to bergamo, lombardy region, a new outbreak was discovered and the possibility of defining a new quarantine area on march th was considered: this opportunity was later abandoned, due to the new northern italy quarantine zone of the following days. this delay seems to have caused a strong increase in the number of contagions, making the bergamo area the most affected one, in percentage, of the entire country [ ] ; at time of writing, there are investigations regarding the responsibility of this choice. on march, th, the lockdown was extended to the whole country, resulting in the first country in the world to decide for national quarantine [ ] . travels were restricted to emergency reason or to work; all business activities that were not considered as essentials, as pharmacies and supermarkets, had to be closed. until the st of march lockdown measures became progressively stricter all over the country. starting from the th of april, some retails activities as children clothing shops, reopened. a first fall in the number of deaths was observed on the th of april [ ] . a limited reopening started with the so-called "fase " (phase ) on the th of may [ ] . from the very first days of march, the limited capacity of the intensive care departments to take care of covid-infected patients, took to the necessity of a re-organization of italian hospitals, leading, e.g., to the opening of new intensive care departments [ ] . moreover, new communication forms with the relatives of the patients were proposed, new criteria for the intubating patients were developed, and, in the extreme crisis, in the most infected cases, the emergency management took to give priority to the hospitalisation to patients with a higher probability to recover [ ] . outbreaks were mainly present in hospitals [ ] . unfortunately, healthcare workers were contaminated by the covid [ ] . this contagion resulted in a relative high number of fatalities: by the nd of april, covid deaths were registered among doctors. due to the pressure on the intensive care capacity, even the healthcare personnel was subject to extreme stress, especially in the most affected zones [ ] . on august th, , the leader of lega, the main italian right wing party, announced to negate the support to the government of giuseppe conte, which was formed after a post-election coalition between the renzi formed a new center-left party, italia viva (italy alive, iv), due to some discord with pd; despite the scission, italia viva continued to support the actual government, having some of its representatives among the ministers and undersecretaries, but often marking its distance respect to both pd and m s. due to the great impact that matteo salvini and giorgia meloni -leader of fratelli d'italia, a right wing party-have on social media, they started a massive campaign against the government the day after its inauguration. the regions of lombardy, veneto, piedmont and emilia-romagna experienced the highest number of contagions during the pandemics; among those, the former are administrated by the right and center-right wing parties, the fourth one by the pd. the disagreement in the management of the pandemics between regions and the central government was the occasion to exacerbate the political debate (in italy, regions have a quite wide autonomy for healthcare). the regions administrated by the right wing parties criticised the centrality of the decisions regarding the lock down, while the national government criticises the health management (in lombardy the healthcare system has a peculiar organisation, in which the private sector is supported by public funding) and its non effective measure to reduce the number of contagions. the debate was ridden even at a national level: the opposition criticized the financial origin of the support to the various economic sectors. moreover, the role of the european union in providing funding to recover italian economics after the pandemics was debated. here, we detail the composition of the communities shown in figure of the main text. we remind the reader that, after applying the leuven algorithm to the validated network of verified twitter users, we could observe main communities, that correspond to right wing parties and media (in steel blue) center left wing (dark red) stars movement (m s ), in dark orange institutional accounts (in sky blue) starting from the center-left wing, we can find a darker red community, including various ngos (the italian chapters of unicef, medecins sans frontieres, action aid, emergency, save the children, etc.), various left oriented journalists, vips and pundits [ ] . finally, we can find in this group political movements (' sardine') and politicians on the left of pd (as beppe civati, pietro grasso, ignazio marino) or on the left current of the pd (laura boldrini, michele emiliano, stefano bonaccini). a slightly lighter red sub-community turns out to be composed by the main politicians of the italian democratic party (pd), as well as by representatives from the european parliament (italian and others) and some eu commissioners. the violet red group is mostly composed by the representatives of the newly founded italia viva, by the former italian prime minister matteo renzi (december -february ) and former secretary of pd. in golden red we can find the subcommunity of catholic and vatican groups. finally the dark violet red and light tomato subcommunities are composed mainly by journalists. interestingly enough, the dark violet red contains also accounts related to the city of milan (the major, the municipality, the public services account) and to the spoke person of the chinese minister of foreign affair. in turn, also the orange (m s) community shows a clear partition in substructures. in particular, the dark orange subcommunity contains the accounts of politicians, parliament representatives and ministers of the m s and journalists and the official account of il fatto quotidiano, a newspaper supporting the movement stars. interestingly, since one of the main leaders of the movement, luigi di maio, is also the italian minister of foreign affairs, we can find in this subcommunity also the accounts of several italian embassies around the world, as well as the account of the italian representatives at nato, ocse and oas. in aquamarine, we can find the official accounts of some private and public, national and international, health institutes (as the italian istituto superiore di sanità, literally the italian national institute of health, the world health organization, the fondazione veronesi) the minister of health roberto speranza, and some foreign embassies in italy. finally, in the light slate blue subcommunity we can find various italian ministers as well as the italian police and army forces. similar considerations apply to the steel blue community. in steel blue, the subcommunity of center right and right wing parties (as forza italia, lega and fratelli d'italia). the presidents of the regions of lombardy, veneto and liguria, administrated by center right and right wing parties, can be found here. (in the following this subcommunity is going to be called as fi-l-fdi, recalling the initials of the political parties contributing to this group.) the sky blue subcommunity includes the national federations of various sports, the official accounts of athletes and sport players (mostly soccer) and their teams, as well as sport journals, newscasts and journalists. the teal subcommunity contains the main italian news agencies, some of the main national and local newspapers, [ ] as the cartoonists makkox and vauro, the singers marracash, frankiehinrg, ligabue and emphil volo vocal band, and journalists from repubblica (ezio mauro, carlo verdelli, massimo giannini), from la tv channel (ricardo formigli, diego bianchi). newscasts and their journalists. in this subcommunity there are also the accounts of many universities; interestingly enough, it includes also the all the local public service local newscasts. the firebrick subcommunity contains accounts related to the as roma football club; analogously in dark red official accounts of ac milan and its players. the slate blue subcommunity is mainly composed by the official accounts of radio and tv programs of mediaset, the main private italian broadcasting company, together with singers and musicians. other smaller subcommunities includes other sport federations, and sports pundits. finally, the sky blue community is mainly composed by italian embassies around the world. the navy subpartition contains also the official accounts of the president of the republic, the italian minister of defense and the one of the commissioner for economy at eu and former prime minister, paolo gentiloni. in the study of every phenomenon, it is of utmost importance to distinguish the relevant information from the noise. here, we remind a framework to obtain a validated monopartite retweet network of users: the validation accounts the information carried by not only the activity of the users, but also by the virality of their messages. we represented pictorially the method in fig. . we define a directed bipartite network in which one layer is composed by accounts and the other one by the tweets. an arrow connecting a user u to a tweet t represents the u writing the message t. the arrow in the opposite direction means that the user u is retweeting the message t. to filter out the random noise from this network, we make use of the directed version of the bicm, i.e. the bipartite directed configuration model (bidcm [ ] ). the projection procedure is then, analogous to the one presented in the previous subsection: it is pictorially displayed in the fig. . briefly, consider the couple of users u and u and consider the number of message written by u and shared u . then, calculate which is the distribution of the same measure according with the bidcm: if the related p-value is statistically significant, i.e. if the number of u 's tweets shared by u is much more than expected by the bidcm, we project a (directed) link from u to u . summarising, the comparison of the observation on the real network with the bidcm permits to uncover all contributions that cannot originate from the constraints of the null-model. using the technique described in subsection . of the main text, we are able to assign to almost all verified users a community, based on the perception of the unverified users. due to the fact that the identity of verified users are checked by twitter, we have the possibility of controlling our groups. indeed, as we will show in the following, the network obtained via the bipartite projection provides a reliable description regarding the closeness of opinions and role in the social debate. how can we use this information in order to infer the orientation of non verified users? in the reference [ ] we used the tags obtained for both verified and unverified users in the bipartite network described in subsection . of the main real network c) e) figure schematic representation of the projection procedure for bipartite directed network. a) an example of a real directed bipartite network. for the actual application, the two layers represent twitter accounts (turquoise) and posts (gray). a link from a turquoise node to a gray one represents that the post has been written by the user; a link in the opposite direction represents a retweet by the considered account. b) the bipartite directed configuration model (bidcm) ensemble is defined. the ensemble includes all the link realisations, once the number of nodes per layer has been fixed. c) we focus our attention on nodes i and j and count the number of directed common neighbours (in magenta both the nodes and the links to their common neighbours), i.e., the number of posts written by i and retweeted by j. subsequently, d) we compare this measure on the real network with the one on the ensemble: if this overlap is statistically significant with respect to the bidcm, e) we have a link from i to j in the projected network. text and propagated those labels accross the network. in a recent analysis, we observed that other approaches are more stable [ ] : in the present manuscript we make use of the most stable algorithm. we use the label propagation as proposed in [ ] on the directed validated network. indeed, the validated directed network in the present appendix we remind the main steps for the definition of an entropy based null model; the interested reader can refer to the review [ ] . we start by revising the bipartite configuration model [ ] , that has been used for detecting the network of similarities of verified users. we are then going to examine the extension of this model to bipartite directed networks [ ] . finally, we present the general methodology to project the information contained in a -directed or undirected-bipartite network, as developed in [ ] . let us consider a bipartite network g * bi , in which the two layers are l and Γ. define g bi the ensemble of all possible graphs with the same number of nodes per layer as in g * bi . it is possible to define the entropy related to the ensemble as [ ] : where p (g bi ) is the probability associated to the instance g bi . now we want to obtain the maximum entropy configuration, constraining some relevant topological information regarding the system. for the bipartite representation of verified and unverified user, a crucial ingredient is the degree sequence, since it is a proxy of the number of interactions (i.e. tweets and retweets) with the other class of accounts. thus in the present manuscript we focus on the degree sequence. let us then maximise the entropy ( ), constraining the average over the ensemble of the degree sequence. it can be shown, [ ] , that the probability distribution over the ensemble is where m iα represent the entries of the biadjacency matrix describing the bipartite network under consideration and p iα is the probability of observing a link between the nodes i ∈ l and α ∈ Γ. the probability p iα can be expressed in terms of the lagrangian multipliers x and y for nodes on l and Γ layers, respectively, as in order to obtain the values of x and y that maximize the likelihood to observe the real network, we need to impose the following conditions [ , ]        where the * indicates quantities measured on the real network. actually, the real network is sparse: the bipartite network of verified and unverified users has a connectance ρ . × − . in this case the formula ( ) can be safely approximated with the chung-lu configuration model, i.e. where m is the total number of links in the bipartite network. in the present subsection we will consider the case of the extension of the bicm to direct bipartite networks and highlight the peculiarities of the network under analysis in this representation. the adjancency matrix describing a direct bipartite network of layers l and Γ has a peculiar block structure, once nodes are order by layer membership (here the nodes on l layer first): where the o blocks represent null matrices (indeed they describe links connecting nodes inside the same layer: by construction they are exactly zero) and m and n are non zero blocks, describing links connecting nodes on layer l with those on layer Γ and viceversa. in general m = n, otherwise the network is not distinguishable from an undirected one. we can perform the same machinery of the section above, but for the extension of the degree sequence to a directed degree sequence, i.e. considering the in-and out-degrees for nodes on the layer l, (here m iα and n iα represent respectively the entry of matrices m and n) and for nodes on the layer Γ, the definition of the bipartite directed configuration model (bidcm, [ ] ), i.e. the extension of the bicm above, follows closely the same steps described in the previous subsection. interestingly enough, the probabilities relative to the presence of links from l to Γ are independent on the probabilities relative to the presence of links from Γ to l. if q iα is the probability of observing a link from node i to node α and q iα the probability of observing a link in the opposite direction, we have where x out i and x in i are the lagrangian multipliers relative to the node i ∈ l, respectively for the out-and the in-degrees, and y out α and y in α are the analogous for α ∈ Γ. in the present application we have some simplifications: the bipartite directed network representation describes users (on one layer) writing and retweeting posts (on the other layer). if users are on the layer l and posts on the opposite layer and m iα represents the user i writing the post α, then k in α = ∀α ∈ Γ, since each message cannot have more than an author. notice that, since our constraints are conserved on average, we are considering, in the ensemble of all possible realisations even instances in which k in α > or k in α = , or, otherwise stated, non physical; nevertheless the average is constrained to the right value, i.e. . the fact that k in α is the same for every α allows for a great simplification of the probability per link on m: where n Γ is the total number of nodes on the Γ layer. the simplification in ( ) is extremely helpful in the projected validation of the bipartite directed network [ ] . the information contained in a bipartite -directed or undirected-network, can be projected onto one of the two layers. the rationale is to obtain a monopartite network encoding the non trivial interactions among the two layers of the original bipartite network. the method is pretty general, once we have a null model in which probabilities per link are independent, as it is the case of both bicm and bidcm [ ] . the first step is represented by the definition of a bipartite motif that may capture the non trivial similarity (in the case of an undirected bipartite network) or flux of information (in the case of a directed bipartite network). this quantity can be captured by the number of v −motifs between users i and j [ , ] , or by its direct extension (note that v ij = v ji ). we compare the abundance of these motifs with the null models defined above: all motifs that cannot be explained by the null model, i.e. whose p-value are statistically significance, are validated into the projection on one of the layers [ ] . in order to assess the statistically significance of the observed motifs, we calculate the distribution associated to the various motifs. for instance, the expected value for the number of v-motifs connecting i and j in an undirected bipartite network is where p iα s are the probability of the bicm. analogously, where in the last step we use the simplification of ( ) [ ] . in both the direct and the undirect case, the distribution of the v-motifs or of the directed extensions is poisson binomial one, i.e. a binomial distribution in which each event shows a different probability. in the present case, due to the sparsity of the analysed networks, we can safely approximate the poisson-binomial distribution with a poisson one [ ] . in order to state the statistical significance of the observed value, we calculate the related p-values according to the relative null-models. once we have a p-value for every detected v-motif, the related statistical significance can be established through the false discovery rate (fdr) procedure [ ] . respect to other multiple test hypothesis, fdr controls the number of false positives. in our case, all rejected hypotheses identify the amount of v-motifs that cannot be explained only by the ingredients of the null model and thus carry non trivial information regarding the systems. in this sense, the validated projected network includes a link for every rejected hypothesis, connecting the nodes involved in the related motifs. in the main text, we solved the problem of assigning the orientation to all relevant users in the validated retweet network via a label propagation. the approach is similar, but different to the one proposed in [ ] , the differences being in the starting labels, in the label propagation algorithm and in the network used. in this section we will revise the method employed in the present article, as compared it to the one in [ ] and evaluate the deviations from other approaches. first step of our methodology is to extract the polarisation of verified users from the bipartite network, as described in section . of the main text, in order to use it as seed labels in the label propagation. in reference [ ] , a measure of the "adherence" of the unverified users towards the various communities of verified users was used in order to infer their orientation, following the approach in [ ] , in turn based on the polarisation index defined in [ ] . this approach was extremely performing when practically all unverified users interact at least once with verified one, as in [ ] . while still having good performances in a different dataset as the one studied in [ ] , we observed isolated deviations: it was the case of users with frequent interactions with other unverified accounts of the same (political) orientation, randomly retweeting a different discursive community verified user. in this case, focusing just on the interaction with verified accounts, those nodes were assigned a wrong orientation. the labels for the polarisation of the unverified users defined [ ] were subsequently used as seed labels in the label propagation. due to the possibility described above of assigning wrongly labels to unverified accounts, in the present paper, we consider only the tags of verified users, since they pass a strict validation procedure and are more stable. in order to compare the results obtained with the various approaches, we calculated the variation of information (vi, [ ] ). v i considers exactly the different in information contents captured by two different partition, as consider by the shannon entropy. results are reported in the matrix in figure for the th of february (results are similar for other days). even when using the weighted retweet network as "exact" result, the partition found by the label propagation of our approach has a little loss of information, comparable with the one of using an unweighted approach. indeed, the results found by the various community detection algorithms show little agreement with the label propagation ones. nevertheless, we still prefer the label propagation procedure, since the validated projection on the layer of verified users is theoretically sound and has a non trivial interpretation. the main result of this work quantifies the level of diffusion on twitter of news published by sources considered scarcely reputable. academy, governments, and news agencies are working hard to classify information sources according to criteria of credibility and transparency of published news. this is the case, for example, of newsguard, which we used for the tagging of the most frequent domains in the direct validated network obtained according to the methodology presented in the previous sections. as introduced in subsection . of the main text, the newsguard browser extension and mobile app [ ] offers a reliability result for the most popular newspapers in the world, summarizing with a numerical score the level of credibility and journalistic transparency of the newspaper. with the same philosophy, but oriented towards us politics, the fact-checking site politifact.com reports with a 'truth meter' the degree of truthfulness of original claims made by politicians, candidates, their staffs, and, more, in general, protagonists of us politics. one of the eldest fact-checking websites dates back to : snopes.com, in addition to political figures, is a fact-checker for hoaxes and urban legends. generally speaking, a fact-checking site has behind it a multitude of editors and journalists who, with a great deal of energy, manually check the reliability of a news, or of the publisher of that news, by evaluating criteria such as, e.g., the tendency to correct errors, the nature of the newspaper's finances, and if there is a clear differentiation between opinions and facts. thus, it is worth noting that recent attempts tried to automatically find articles worthy of being fact-checked. for example, work in [ ] uses a supervised classifier, based on an ensemble of neural networks and support vector machines, to figure out which politicians' claims need to be debunked, and which have already been debunked. despite the tremendous effort of stakeholders to keep the fact-checking sites up to date and functioning, disinformation resists debunking due to a combination of factors. there are psychological aspects, like the quest for belonging to a community and getting reassuring answers, the adherence to one's viewpoint, a native reluctance to change opinion [ , ] , the formation of echo chambers [ ] , where people polarize their opinions as they are insulated from contrary perspectives: these are key factors for people to contribute to the success of disinformation spreading [ , ] . moreover, researchers demonstrate how the spreading of false news is strategically supported by the massive and organized use of trolls and bots [ ] . despite the need to educate the user to a conscious fruition of online information through means also different from those represented by technological solutions, there are a series of promising works that exploit classifiers based on machine learning or on deep learning to tag a news as credible or not. one interesting approach is based on the analysis of spreading patterns on social platforms. monti et al. recently provide a deep learning framework for detection of fake news cascades [ ] . a ground truth is acquired by following the example by vosoughi et al. [ ] collecting twitter cascades of verified false and true rumors. employing a novel deep learning paradigm for graph-based structures, cascades [ ] https://www.newsguardtech.com/ are classified based on user profile, user activity, network and spreading, and content. the main result of the work is that 'a few hours of propagation are sufficient to distinguish false news from true news with high accuracy'. this result has been confirmed by other studies too. work in [ ] , by zhao et al. examine diffusion cascades on weibo and twitter: focusing on topological properties, such as the number of hops from the source and the heterogeneity of the network, the authors demonstrate that networks in which fake news are diffused feature characteristics really different from those diffusing genuine information. diffusion networks investigation appear to be a definitive path to follow for fake news detection. this is also confirmed by pierri et al. [ ] : also here, the goal is to classifying news articles pertaining to bad and genuine information' by solely inspecting their diffusion mechanisms on twitter'. even in this case, results are impressive: a simple logistic regression model is able to correctly classify news articles with a high accuracy (auroc up to %). the political blogosphere and the u.s. election: divided they blog ) coronavirus: 'deadly masks' claims debunked coronavirus: bill gates 'microchip' conspiracy theory and other vaccine claims fact-checked extracting significant signal of news consumption from social networks: the case of twitter in italian political elections fast unfolding of communities in large networks influence of fake news in twitter during the us presidential election how does junk news spread so quickly across social media? algorithms, advertising and exposure in public life the role of bot squads in the political propaganda on twitter tracking social media discourse about the covid- pandemic: development of a public coronavirus twitter data set the statistical physics of real-world networks political polarization on twitter predicting the political alignment of twitter users partisan asymmetries in online political activity echo chambers: emotional contagion and group polarization on facebook mapping social dynamics on facebook: the brexit debate ) tackling covid- disinformation -getting the facts right ) speech of vice president věra jourová on countering disinformation amid covid- -from pandemic to infodemic filter bubbles, echo chambers, and online news consumption community detection in graphs finding users we trust: scaling up verified twitter users using their communication patterns opinion dynamics on interacting networks: media competition and social influence near linear time algorithm to detect community structures in large-scale networks randomizing bipartite networks: the case of the world trade web inferring monopartite projections of bipartite networks: an entropy-based approach maximum-entropy networks. pattern detection, network reconstruction and graph combinatorics journalists on twitter: self-branding, audiences, and involvement of bots emotional dynamics in the age of misinformation debunking in a world of tribes coronavirus, a milano la fuga dalla "zona rossa": folla alla stazione di porta garibaldi coronavirus, l'illusione della grande fuga da milano. ecco i veri numeri degli spostamenti verso sud coronavirus: italian army called in as crematorium struggles to cope with deaths coronavirus: italy extends emergency measures nationwide italy sees first fall of active coronavirus cases: live updates coronavirus in italia, verso primo ok spostamenti dal / , non tra regioni italy's health care system groans under coronavirus -a warning to the world negli ospedali siamo come in guerra. a tutti dico: state a casa coronavirus: ordini degli infermieri, mila i contagiati automatic fact-checking using context and discourse information extracting significant signal of news consumption from social networks: the case of twitter in italian political elections controlling the false discovery rate: a practical and powerful approach to multiple testing users polarization on facebook and youtube fast unfolding of communities in large networks the role of bot squads in the political propaganda on twitter the psychology behind fake news the statistical physics of real-world networks fake news: incorrect, but hard to correct. the role of cognitive ability on the impact of false information on social impressions echo chambers: emotional contagion and group polarization on facebook graph theory (graduate texts in mathematics) resolution limit in community detection maximum likelihood: extracting unbiased information from complex networks. phys rev e -stat nonlinear on computing the distribution function for the poisson binomial distribution reconstructing mesoscale network structures the contagion of ideas: inferring the political orientations of twitter accounts from their connections comparing clusterings by the variation of information fake news detection on social media using geometric deep learning at the epicenter of the covid- pandemic and humanitarian crises in italy: changing perspectives on preparation and mitigation. catal non-issue content near linear time algorithm to detect community structures in large-scale networks randomizing bipartite networks: the case of the world trade web inferring monopartite projections of bipartite networks: an entropy-based approach the spread of low-credibility content by social bots analytical maximum-likelihood method to detect patterns in real networks a question of belonging: race, social fit, and achievement cognitive and social consequences of the need for cognitive closure fake news propagate differently from real news even at early stages of spreading analysis of online misinformation during the peak of the covid- pandemics in italy supplementary material guido caldarelli , , * † , rocco de nicola † , marinella petrocchi † , manuel pratelli † and fabio saracco † there is another difference in the label propagation used here against the one in [ ] : in the present paper we used the label propagation of [ ] , while the one in [ ] was quite home-made. as in reference [ ] , the seed labels of [ ] are fixed, i.e. are not allowed to change [ ] . the main difference is that, in case of a draw, among the labels of the first neighbours, in [ ] a tie is removed randomly, while in the algorithm of [ ] the label is not assigned and goes into a new run, with the newly assigned labels. moreover, the updated of labels in [ ] is asynchronous, while it is synchronous in [ ] . we opted for the one in [ ] for being actually a standard in the label propagation algorithms, being stable, more studied, and faster [ ] . finally, differently from the procedure in [ ] , we applied the label propagation not to the entire (undirected version of the) retweet network, but on the (undirected version of the) validated one. (the intent of choosing the undirected version is that in both case in which a generic account is significantly retweeting or being retweeted by another one, they do probably share some vision of the phenomena under analysis, thus we are not interested in the direction of the links, in this situation.) the rationale in using the validated network is to reduce the calculation time (due to the dimensions of the dataset), while obtaining an accurate result. while the previous differences from the procedure of [ ] are dictated by conservativeness (the choice of the seed labels) or by the adherence to a standard (the choice of [ ] ), this last one may be debatable: why choosing the validated network should return "better" results than the ones calculated on the entire retweet network? we consider the case of a single day (in order to reduce the calculation time) and studied different approaches: a louvain community detection [ ] on the undirected version of the validated network of retweets; a louvain community detection on the undirected version of the unweighted retweet network; a louvain community detection on the undirected version of the weighted retweet network, in which the weights are the number of retweets from user to user; a label propagation a la raghavan et al. [ ] on the directed validated network of retweets; a label propagation a la raghavan et al. on the (unweighted) retweet network; a label propagation a la raghavan et al. on the weighted retweet network, the weights being the number of retweets from user to user. actually, due to the order dependence of louvain [ ] , we run several times the louvain algorithm after reshuffling the order of the nodes, taking the partition in communities that maximise the modularity. similarly, the label propagation of [ ] has a certain level of randomness: we run it several times and choose the most frequent label assignment for every node. key: cord- - o mbut authors: yu, jingyuan title: open access institutional and news media tweet dataset for covid- social science research date: - - journal: nan doi: nan sha: doc_id: cord_uid: o mbut as covid- quickly became one of the most concerned global crisis, the demand for data in academic research is also increasing. currently, there are several open access twitter datasets, but none of them is dedicated to the institutional and news media twitter data collection, to fill this blank, we retrieved data from institutional/news media twitter accounts, of them were related to government and international organizations, of them were news media across north america, europe and asia. we believe our open access data can provide researchers more availability to conduct social science research. covid- was announced as pandemic by who on mar [ ] , being the only pandemic in the past years (the last one was swing flu), it was first detected as an unknown pneumonia in wuhan (hubei, china). on april st, , john hopkins university coronavirus resource center [ ] , reported that the ongoing epidemic has infected , people, the death toll has already reached , , given the highly contagious nature of the virus as well as the significant mortality rate, numerous governments have announced national lockdown. the impact of covid- on world economy and politics is unprecedented in modern time. on the past ebola epidemic crisis, scholars found the importance of using twitter data to do social science research [ ] , [ ] , many of them use this microblog data as social indicators to analyze the effect of epidemic outbreak on public concerns [ ] , health information needs and health seeking behavior [ ] , and public response to policy makers [ ] etc. current open access covid- twitter data were mainly collected by keywords, such as coronavirus, covid- etc [ ] , [ ] , none of the them is dedicated to government/news media tweet collection. given that our retrieval targets are policy makers and news source, we believe our dataset can provide scholars more valuable data to conduct social science research in related fields, such as crisis communication, public relation etc. we used twitter rest api to retrieve twitter data from march , , there are collection categories: "gov tweet" (governments, international organizations etc.), "us news tweet", "uk news tweet", "spain news tweet", "germany news tweet", "france news tweet", "china news tweet" and "additional news tweet" (news source added later), each of them contains various collection target (twitter account name, details see next section). at the first time we collected the most recent tweets of every collection target before march , , we didn't set a time limit, which implies that the date of the first tweet from each of the sources may vary and even be relatively old. we update our dataset every week (last update on april , ), after removing duplicated data by matching tweet id, we get the clean version data. on the other hand, due to the data collection strategy we made, there may be tweets both related and unrelated to covid- , which opens up new possibilities for academic analysis. gov tweet category contains the following accounts (table ): the dataset is available on github at the following address: https://github.com/narcisoyu/institional-and-news-media-tweet-dataset-for-covid- social-science-research. our data was collected in compliance with twitter's official developer agreement and policy [ ] . the dataset will be updated weekly, interested researchers will need to agree upon the terms of usage dictated by the chosen license. following twitter official policies, we released and stored only tweet ids, as far as we know, two tools can be used to hydrate full information: hydrator (https://github.com/docnow/hydrator) and twarc (https://github.com/docnow/twarc). interested researchers shall follow the usage instructions of the fore-mentioned tools. who director-general's opening remarks at the media briefing on covid- - an interactive web-based dashboard to track covid- in real time ebola and the social media the medium and the message of ebola detecting themes of public concern: a text mining analysis of the centers for disease control and prevention's ebola live twitter chat health information needs and health seeking behavior during the - ebola outbreak: a twitter content analysis content analysis of a live cdc twitter chat during the ebola outbreak covid- : the first public coronavirus twitter dataset a twitter dataset of + million tweets related to covid- developer agreement and policy -twitter developers key: cord- - vu vce authors: beskow, david m.; carley, kathleen m. title: social cybersecurity chapter : casestudy with covid- pandemic date: - - journal: nan doi: nan sha: doc_id: cord_uid: vu vce the purpose of this case study is to leverage the concepts and tools presented in the preceding chapters and apply them in a real world social cybersecurity context. with the covid- pandemic emerging as a defining event of the st century and a magnet for disinformation maneuver, we have selected the pandemic and its related social media conversation to focus our efforts on. this chapter therefore applies the tools of information operation maneuver, bot detection and characterization, meme detection and characterization, and information mapping to the covid- related conversation on twitter. this chapter uses these tools to analyze a stream containing million tweets from million unique users from march to april . our results shed light on elaborate information operations that leverage the full breadth of the bend maneuvers and use bots for important shaping operations. the covid- pandemic is a defining event of the modern era, and there are few events more appropriate to apply social cybersecurity tools and concepts. at the time of this writing, the pandemic has reached almost every society in the world, with massive impact not only on the lives of those who contract it but on the social, economic, and institutional fabric of these societies. the pandemic and differing opinions on how to react to it have created a virtual battle of ideas across social media. actors ranging from soccer moms to well-resourced nation states have entered the virtual marketplace of beliefs and ideas trying to sway the beliefs, actions, and decisions of both leaders and followers. with the pandemic as the backdrop of life as we write this book, it seemed appropriate to use the social cybersecurity tools that we discussed in the previous chapters to identify and understand information operations related to there are still many questions as well as competing narratives about the origins and nature of the covid- coronavirus disease. the disease is caused by the severe acute respiratory syndrome coronavirus (sars-cov- ), and was the virus and resulting pandemic have radically altered the world landscape and daily life for many people. with a large portion of the world's consumers sheltering in or near their home, the world's markets have slowed, inducing the greatest global recession since the great depression [ ] . it has led to the cancellation or deferment of most travel activities, sporting and entertainment events, religious and political gatherings. as of april , schools and universities had closed or otherwise been disrupted in countries affecting . billion students (pre-primary, primary, lower-secondary, and upper secondary levels of education), or . % of the total enrolled students for these categories [ ] . the covid- pandemic has induced information warfare at multiple levels, both within and between nations. at the individual level, much of the information is in regards to safety and health during the pandemic. social media often doesn't have structural filters for good ideas, and at times poor ideas regarding health and safety have been promoted by segments of the crowd. within countries, the pandemic has created two warring factions: ) those who think policy should prioritize the health and safety of citizens and ) those who think policy should prioritize the national economy and maintaining the jobs and livelihoods of citizens. these factions often seem to fall along existing political party lines, with conservatives emphasizing the economy and liberals emphasizing health and safety. this policy oriented friction has contributed a large portion of the information conflict within nations. finally, the origin and nature of the covid- coronavirus has aggravated existing geopolitical fault lines between nations, particularly between the united states and china, with the european union, russia, brazil, and other nations participating in the conflict of narratives. as we look at information warfare and social cybersecurity in the covid- pandemic conversation, we attempt to illuminate information conflict across this spectrum. each country's response has varied based on its society, forms of government, and health care system. the covid- pandemic has served as a worldwide test of the resiliency of these systems. nations are therefore turning to information warfare to strengthen their test results while attacking the response and results of other nations, particularly those whose form of government differs from theirs. this is particularly true in the information conflict between china and united states. this chapter will showcase the use of social cybersecurity tools and theory to identify and characterize information operations in the covid- related twitter stream. to do this, we will start by discussing data collection and initial exploratory data analysis. then we will conduct bot detection and characterization, merging bot classification with other quantitative methods. we will then look at meme classification and analysis, highlighting the use of multi-modal data for information warfare. finally, we will briefly illustrate the use of sketch-io. throughout this chapter will highlight the importance of bend as a foundational construct for understanding modern information warfare. like most event oriented social cybersecurity data collection on twitter, our team started by establishing a twitter stream with select covid- related terms, periodically expanding the terms as appropriate. this gives us the main foundation of data for assessment. once we began analysis of the stream, we turned to the twitter rest api to collect other data as necessary. this additional data is often account oriented and includes timelines, friends, followers, and at times account id "re-hydration". we will discuss each of these below. for covid- , our team began collecting data on january with a limited list of keywords and expanded this list on march . for the data that we will focus on in this chapter, the list of keywords are: "coronaravirus", "coronavirus", "wuhan virus", "wuhanvirus", " ncov", "ncov", "ncov ", "covid- ", "covid ", "covid ". the resulting stream provided the primary data that we will focus our study on. in this chapter, we will focus on the data from march to april . the summary statistics of this data is provided in table . this is not the entire covid- conversation. many tweets may not contain these key words, or may be in other languages. additionally, our stream is limited to tweets per second, or approximately . million tweets per day. the temporal analysis below will highlight the . million daily limit on our stream. as we go through our analysis, we will return at times to the twitter rest api to collect additional data that is not found in the stream. for example, the stream only contains an account's covid- content, but not all other content and topics. to get access to all content shared by an account, we will at times collect their account timeline (aka account history). there are also times we need to identify an account's friends and followers, which will also require us to call the twitter rest api. finally, at the end of our exploratory data analysis, we will try to find out if any accounts have been suspended by twitter since contributing content to our stream. we will do this by "re-hydrating" account ids in order to see if the account still exists. if the account doesn't exist, we will then test to see if the account was suspended or deleted by the user. all data analysis begins with a thorough exploration of the data. the analyst should explore all of the distributions and possible relationships in the data. in our case, this means exploring the temporal, categorical, geospatial, and quantitative distributions of various fields in the twitter data. we collected and parsed the data using the twitter col python package that we created to assist with collecting and manipulating twitter data. the primary statistics for the data are provided in table . we have million tweets produced by million users. we see that the majority of the tweets are actually retweets. in fact, only million tweets ( % of the stream) is original content produced by the respective account. it is also interesting that quotes significantly outnumber replies. a quote is produced when a user references another tweet and comments on it, starting a new thread. replies, by contrast, are addressed to the original author and don't start a new thread. we see a modest number of state sponsored tweets, with substantially more amplification of state sponsored accounts. finally, for reference, this primary stream is gigabytes in compressed gzip format or . terabytes in uncompressed format. next we analyze the temporal distribution of our stream. this is primarily done to study the peaks and valleys to understand when the conversation surges and ebbs. this analysis will also ensure that our tweets fall within the scope of our study (from march to april ). the temporal distribution of our stream is provided in figure . this is not what we expect to see, and is definitely not what you would observe in smaller streams. given the size of our stream, we are artificially limited by twitter rate limiting. twitter limits basic streaming api's to no more than tweets per second, or . million tweets per day. it is unknown what portion of the total conversation we are getting, but given that several days fall below the . million, we believe we are collecting the majority. twitter did open up a unlimited covid- stream in mid-may [ ] , but to our knowledge this was not available during the time frame we were interested in. in order to measure the geospatial density of the data, we used a country level geo-inference algorithm created by huang and carley [ ] . this model infers the country level location of tweets with . % accuracy. we ran this on data from april and plotted it on a chloropleth map in figure with a logarithmic scale. this shows that the vast majority of the data is from the united states, with notable contributions from canada, south america, western europe, nigeria, south and southeast asia. this geospatial inference and visualization provides yet another facet of our exploratory data analysis. given both the keyords that we used and twitter users by country, this spatial distribution is essentially what one would expect. its important to explore categorical data distributions in twitter data. in addition to the retweet/reply/quote/original content distribution that we explored in table , we also want to look at top languages, hashtags, mentions, and url domains. the importance of hashtags ebbs and flows with time and events. we've developed the visualization found in figure to understand this changing dynamic for the top hashtags found in the stream. we observe decreasing use of "corona" and increasing use of "covid " as a hashtag to identify pandemic related content and conversation. this reflects both the naming of the virus and the increasing knowledge in the world about the virus. the full counts for languages and domains is provided in table . we see that the majority of the content is in english, followed by many of the prominent world languages. the only major world language that is underrepresented is chinese, and that's because twitter is blocked by the "great firewall of china." we note that much of the data from china in this data set is actually from state sponsored chinese media. the domains include several link shorteners, multimedia companies, as well as a spattering of news media companies, including both traditional and alternative news, independent and state owned media, and both conservative and liberal leaning news sites. by its very nature social media creates links between people, and these links combine to create various types of networks. most social media have a friend or follow functionality, which creates the most obvious social network on these platforms. additionally, the online conversation itself will create links between accounts. whenever an account retweets, mentions, replies, or quotes another with twitter data, collecting friend/follower links is limited by strict ratelimiting on the part of twitter. most developer accounts are only allowed to scrape , friends/followers for one account every minute. the twitter rest api will only return , friends/followers per request. for example, at the time of this writing the primary twitter account for the united states center for disease control (@cdcgov), has . million followers. to get all of the followers for @cdcgov would take approximately , , , = minutes. in other words, it would take hours to scrape the followers for this single account. scraping links for all of the accounts in the covid- conversation (and most conversations) is therefore not realistic. for this reason we often visualize parts of the conversation network, for which we already have the data and is more useful in understanding the conversation. in our visualization of the network, we visualized the mention network, though at other times the retweet or reply network may be appropriate. the twitter json structure has all of the information we need to create mention, retweet, or reply networks. as discussed at other times in this book, we created the public facing twitter col python package to collect and manipulate twitter data. the twitter col package has several functions that make it easy to parse networks from raw twitter json data. the ora software [ , ] , both ora-pro and ora-lite software has functions for parsing twitter data into networks. these functions were used on the covid- stream. visualizing large networks can be difficult. the covid- mention network contains , , nodes and , , edges (density = . ). very few software packages can visualize a network of this size, and none of the software solutions that our team commonly uses can visualize this network. for this reason, we chose to visualize just the core of the network. to do this, we found the k-core of the network. the k-core of graph g is the maximal connected subgraph of g where all the nodes (vertices) have a degree of at least k. while experimenting with k with our mention network, we found the k = was adequate. this means we will visualize the core of the mention network in which all nodes will have a degree of at least . this core network is dense (density = . ), which means there are more edges than we are able to visualize. to sparsify the graph, we sampled one million edges to visualize. this final core network contains , nodes and , , edges with density = . . in figure we visualize the core of the mention network colored by language using the graphistry software. we could also visualize larger networks with ora-pro or the sigmanet package in r. in figure we see the inter-connection between the english, french, spanish, and portuguese conversations. we also can see the community groups that are clearly evident in the larger conversations, particularly the english and spanish conversation. given that we now have networks parsed from our covid- stream, there are a number of network science techniques that we can use to help understand this network. for now we will focus on the measuring which accounts (or nodes in the network) are the most influential (or central) to the network. measuring centrality is an important step in network science, and there are many different methods that have been published on how to do this, with each technique measuring a slightly different definition of influence and importance. for example, degree centrality measures influence by number of connections whereas betweenness centrality measures influence by those accounts that bridge communities. for our analysis, we chose to measure centrality and influence by eigenvector centrality [ ] . eigenvector centrality measures influence by finding accounts that have the most connections to influential accounts. in other words, not all links are created equal, and an account is more influential if it is connected to many nodes who themselves have high scores. we believe that this mirrors the way that many people view influence in the physical world, and therefore use eigenvector centrality for our analysis. we also find that eigenvector centrality, unlike measures like betweenness, is computationally practical on large networks. the top influencers as measured by eigenvector centrality for both the mention and the retweet networks are shown below in table . as we will discuss later, the central accounts in the mention network are politicians and celebrities that we expect. the central accounts in the retweet network, however, includes many bots, which we will elaborate on later. it is often important to evaluate the quantity and nature of suspended accounts in any event oriented stream. twitter and most social media companies suspend accounts that frequently violate their terms of service. these violations could include frequently posting violent, racist, or other unauthorized content. it also could be that the accounts display unauthorized automated activity (i.e. they are a bot). by identifying and evaluating suspended accounts, we get a sense of how the social media company has been "cleaning" this particular stream. to identify suspended accounts, we "re-hydrate" account ids in order to see if the account still exists. if the account doesn't exist, we will then test to see if the account was suspended by twitter or deleted by the user. the workflow begins by identifying all unique account id's in the stream. we then "rehydrate" them in batch mode using the twitter rest api. using batch mode is a fast method to "re-hydrate", but does not provide any feedback for missing accounts. having rehydrated account ids, we then identify those that are missing with m issing = t otal − rehydrated. there are several reasons why an account could be missing. the two most common are that the account was deleted by the user or was suspended by twitter. to determine which of these is the case, we individually attempt to rehydrate the missing ids (not in batch mode). this provides a detailed response that indicates whether the account was deleted or suspended. given the size of the covid- stream, we randomly sampled million users from the available million unique users in the stream. after "re-hydrating" in batch mode, we determined that , were missing. using this number, we estimate that . ± . % of the accounts in the stream have been deleted or suspended (estimated using % confidence interval). next we want to estimate the number that have been suspended. to do this we now individually attempt to "rehydrate" the missing ids. when we do this, we find that , accounts had been suspended by twitter. using this number, we estimate that twitter has suspended . ± . % (estimated using % confidence interval). using these ids, we made another pass through our stream, and determined that these , accounts produced , tweets that contain the terms we were filtering for the stream. running bot-hunter tier algorithm on this data we find that . % of the suspended users have strong bot-like characteristics. it is often helpful to sample several of the suspended accounts and view their tweets to determine if they were participating in information operations, and if so what was the message and who was the target audience. for example, the tweets of suspended account @scopatumanigga are provided in table . our first observation is that nearly all statuses are retweets, which is highly indicative of bot behavior. given the account screen name and content, it appears that this likely bot account is attempting to infiltrate and influence african american virtual communities on twitter. the likely intent is to amplify racial divides in order to create instability in the united states. table : tweets from suspended account @scopatumanigga rt @tonyhawk: ive been sick lately (not sick af just sick) with symptoms other than covid- . but i know two friends in the u.s. with co rt @mvrlyns: so after the coronavirus blows over, will yall continue to practice good hygiene and sanitation? ... or will yall go back to rt @kenichial: joe biden: the tests for the coronavirus should be free bernie sanders: the vaccines and treatment for the coronavirus shou rt @workerism: haven't been able to stop thinking about this. a us pharma company with a potential covid- vaccine is in court trying to p rt @crypticnoone: movies would never do this rt @carnage : black athletes give back during every crisis. i'm not saying it doesn't happen but i don't be seeing tom brady, mike trout rt @shanalala : the elite getting tested without any symptoms and commoners with all the symptoms are denied tests. rt @elliecampbbell: i know its necessary to stop the spread of covid- , but self isolation, no school etc and everything being shut/cance rt @erichaywood: there are people in this photograph rt @eugenegu: @realdonaldtrump there it is. ive been deathly afraid of this exact moment where trump turns to racism and xenophobia and ca rt @baeonda: im years old and i tested positive for covid- . ive been debating on posting, but i want to share my experience especi rt @socialistwitch: coronavirus is not mother nature's 'cure' for 'evil humans'. the earth doesn't suffer from humanity. it suffers from rt @raeoflite: hi. yes, it originated in china, but the technical term is covid- your mom originated from the back of a buick skylark @swevenpjm @ashleytwo @lydiakahill how's this covid treating you bernie sanders on his way to the hospital after he sees this tweet url rt @chapatipapa: so they cashed out. hoarded supplies. moved money to companies they believe could make the vaccine, and testing kits all t rt @claireific: hey remember that time i said that tell me about tuskegee should be a required interview question for medical applicants rt @nigensei: empty hotels all over the city of las vegas and theyre putting the homeless in a fucking parking lot. rt @decaturdane: second wave??? url rt @hoodcuiture: i swear christians and colonizers never stop!! how an isolated group get infected rt @breenewsome: whyyyyyyyyyyy do we accept such a lower quality of life in this country in exchange for nothing but slogans & confetti rt @ iamtiredlord: this baby was shot times in the head by a grown ass man because he wanted her girlfriend who rejected him. they are rt @thespinsterymc: the family members of the johnsons have made it clear that medical neglect killed them. stop romanticizing this antibla rt @lackingsaint: got to be honest, it's starting to feel like we're just doing things we feel like doing and saying it's in support of hea bot detection is a critical step for social cybersecurity workflows. as discussed in chapter , bot detection often helps to delineate an information warfare campaign, as well as illuminate lines of effort (topics) and target audiences. it also sheds light on the scale, level of sophistication, and at times attribution for the operation. in this section we will discuss how to appropriately deploy bot detection algorithms for social cybersecurity. we will start by estimating the accuracy of our algorithms on the given data stream as well as selecting an appropriate threshold for the data at hand (in our case the covid- stream). we will discuss where each of the bot-hunter tiers should be used in the workflow. having run bot-hunter on all million accounts, we will use it to find influential bots, lines of effort, target audiences, and foreign influence. before using any bot detection tool on a given event or topic oriented data stream, the analysts should verify it's accuracy on the stream as well as determine an appropriate threshold. as we discussed in chapter , training data matters for bot detection. we need to verify that the model that we are trying to use, and its respective training data, are appropriate and predictive for our event stream. in our case that means verifying that the bot detection model works on our covid- stream. to make this evaluation, we need a small labeled dataset from our event stream. for the covid- data, we created a list of all unique user id's that were found in the data, and then randomly sampled accounts from this list. we then manually labeled the accounts using a custom workflow that we've developed. after manually labeling the accounts, we evaluated the proposed bot-detection algorithms on the data. with labels and bot-detection scores in hand, we can measure performance with various metrics (accuracy, precision, recall, f score, roc-auc, etc). these scores will tell us how well our models are generalizing to covid- data. the scores are shown in table for default settings of threshold = . for all models. from table we first and foremost determine that bot-hunter tier should be our primary bot detection model, with a higher f score and good balance of precision and recall. from table we can also determine that the botometer model as well as the bot-hunter tier model don't seem to generalize as well to the covid- twitter stream. the bot-hunter tier model, which appearing to perform well, will only be used on specific tasks because it is only able to predict english speaking accounts and because it tends to have a higher false positive rate (in this test the false-positive rate is three times larger than the bot-hunter tier ). these scores, however, are sensitive to the threshold that we choose. in order to choose an appropriate threshold, we use our labels and bot detection scores and plot precision-recall curves as shown in figure . remember that precision and recall often have an inverse relationship. as precision increases, recall decreases, and vice versa. recall monotonically decreases, whereas precision does not monotonically increase. the exact choice of the threshold will depend on the context and any related policy decisions. if the policy decision requires a low false-positive rate, then choose a threshold with high precision. if the policy decision or analytical goal requires a low false-negative rate, then choose a threshold with higher recall. for most tasks it is best to have a balance of precision and recall, which is why we often use f score to measure the performance of bot detection algorithms. in choosing a bot detection threshold, our goal for the covid- stream is to characterize the entire conversation. we want to have robust characterization of the entire forest, not necessarily precise analysis of individual trees. this goal requires a good balance between precision and recall. as indicated above, our bot-hunter tier text model tends to produce a high false positive rate. for this reason, we will use a threshold of . for this model in our covid- stream, which cuts false positive rate by a third. for bot-hunter tier we will retain the . default threshold, since it provides a good balance with precision and recall. for botometer and bot-hunter tier , we will use a threshold of . in order to increase recall. the adjusted performance is provided in table since the bot-hunter tier algorithm is our primary algorithm, we've visualized the probability distribution for all covid- accounts in figure a with threshold = . and threshold = . . here we see a large number of human accounts, as well as a large number of accounts that are in the middle, with decreasing numbers of high probability bots. the threshold = . is our default model, while we can at times use threshold = . for increased precision. note that for each actor they will have a botscore for each bot detection tool. these algorithm identified bots can be more accurately described as actors with bot-like characteristics. depending on which tool is used and which threshold is used the number of "bots" that are identified will vary. for example, on any given day depending on what is used, the number of "bots" may vary from approximately % to %. now that we have bot prediction thresholds, we can begin to use the models to understand the covid- stream and the conversation and actors involved in it. in figure we visualize the sparsified core of the mention network (k core = ), colored by bot prediction with bot-hunter tier and threshold = . . we see that bot-like accounts are highly embedded in the core of the conversation, and connect to and mention influential accounts in an effort to manipulate these personalities and their followers. in this visualization, we have also zoomed in to get a better feeling of the structure of the network and so that we can highlight the location of prominent english speaking accounts. the most surprising bot detection analysis that we found was in regards to account creation date. anytime we analyze a list of accounts, it is often enlightening to visualize a histogram of their account creation date. the twitter json contains a field in the user object that records the date, hour, minute, and second that the account was originally created. this date will be some date between march , (the day twitter started) and the current date. we've found it best to bin this by day. each bar in the resulting histogram will contain all accounts that were created on that day. any large spike of accounts created around the same time should cause us to dig deeper to check for the presence of a bot "army". in figure we visualize the account density plot for covid- colored by bot percentage. the coloring indicates what portion of accounts in that bar have a bot-hunter tier score greater than . . green indicates bars that have few bots, while red indicates bars that have higher proportion of bots. from figure we see that a large number of bot-like accounts have been created since the pandemic began and then immediately deployed into the conversation. in fact, of the million accounts participating in the conversation, . million have been created since february and have a bot-hunter tier score greater than . . undoubtedly part of this surge in accounts is created by individuals who are stuck at home and decided to create a twitter account. a significant portion, however, appears to be bot armies. these bot armies are produced by a number of actors with a variety of agendas, but likely all involve manipulation of the marketplace of beliefs, ideas, and collective action. next we want to intersect our measure of influence (eigenvector centrality) with a bot prediction score in order to identify influential bots. in table we list the top most influential accounts in the mention network and retweet network as measured by eigenvector centrality. table also provides the bot-hunter tier bot score, with red text indicating accounts that have a score greater than . . looking at the macro-level comparison, we see more bots are involved with the retweet network than the mention network. this makes sense given that bots are often used to scale amplification, and the easiest way to amplify is with retweets. we also see many more verified politician, news, and government accounts in the mention network. many news accounts have "bot-like" behavior, with @foxnews, @msnbc, and @cbsnews accounts surpassing the . threshold. it has been documented that many news and celebrities can have bot-like behavior [ ] , which is supported in our analysis. we also see in the retweet network that several accounts that are not classified as bots nonetheless have bot . we also note that it is possible for users to employ software or a bot on occasion from the same account. this hybrid form that is human and bot we refer to as a cyborg. cyborgs will also tend to exhibit bot like characteristics; however, they are likely to be lower in their values than a totally automated account. next we want to try to separate the stream into topic groups. we will do this with latent dirichlet allocation (lda) model [ ] . to do this, we concatenated all english hashtags by account and then performed lda with k = . we chose to concatenate hashtags rather than use raw text for computational tractability and because hashtags provide tokens that capture the essence of topic and meaning. word clouds of the resulting five topic groups are shown in figure . we see that topic groups are differentiated in some ways by geography and in other ways by politics. this topic groups allow us to segment the conversation and focus on a topic of interest, for example a certain geography (the nigerian conversation) or a specific political affiliation (the conservative political conversation). the choice of k is just as much an art as a science. if you want to get a view of the macro topics, use a smaller k, like we did here with k = . if you want to extract a very specific conversation (i.e. the liberal conversation in canada), you will have to increase k in order to sufficiently isolate the topic of interest. similar to topic analysis, we also want to look at community groups. while topic analysis looked at semantic topics that seem to cluster together, community groups look at accounts that tend to cluster together regardless of topic. we ran louvaine community detection [ ] on both the retweet and mention network, and then looked at influential accounts and top hashtags for each of the top communities. a summarization of the community groups in the retweet network are summarized in table . here we see once again that some community groups are geographically and/or linguistically oriented, while others are politically affiliated. now that we've explored bots in the data, we will begin to look at other ways to characterize accounts. these include analysis of biased or questionable sources, abusive languages, and use of national flags. next we will look at the political bias found in the urls. to do this, we will use the dictionary approach presented in chapter . using this approach, . % of the total url domains were found in the dictionary lookup that we built in chapter . having estimated the bias and factual content in the urls, we visualize this distribution in figure where the bar plots are colored by the proportion of bot involvement. from this we first see that the highest number of urls are coming from the center and center left political bias, and generally have high factual content. we do see the presence of fake, satire, and conspiracy theory sources in this stream, which are often correlated with the low factual content seen in figure b . we also see that bots have a higher degree of correlation with urls from the far-right and fake-news biases, and to a lesser extent from the far-left. we also see high bot correlation with urls containing low factual content. these discoveries largely confirm our assumptions going into the analysis. we used the multi-lingual dictionary based algorithm presented in chapter to identify tweets that contain abusive language. the daily volume of abusive is presented in figure . we did not normalize this visualization since our total daily count of tweets was held constant at . million tweets. if this was not true, it would be appropriate to normalize this (plot proportion instead of raw count). this allows us to identify events that seem to aggravate the population of active twitter users. the two prominent spikes are tied to political events and voices in the united states, with the first spike tied to actions by us congress and the second spike tied to comments by the us executive branch. we found that . % of the accounts that share abusive content had bot like characteristics. this is significantly higher than the . % of non-abusive accounts that have bot-like characteristics. this means that within the covid- stream bot-like accounts are used to produce or promote abusive content. as discussed in chapter , at times flags in the user description can indicate suspicious accounts. this is especially true with multiple flags. to explore this in the covid- stream, we extracted all flags in the account descriptions for all million accounts. we've plotted the distribution of these in figure reviewing the distributions in figure , nothing in the one and two flag distribution is unexpected or necessarily cause for further exploration. once we get to the three, four, five, and six flag distributions, however, these are likely suspicious accounts. in particular many of these accounts have multiple western nations (us, canada, and european nations), and may be used to manipulate multiple western nations while appearing to be an expatriate. image/meme analysis as discussed in chapter , memes are a powerful way to connect a message to a target audience. memes evolve as they propagate through a society. given the size of the covid- stream, we sampled million images (approximately % of the total) and conducted meme classification on these. the meme-hunter model classified , images as memes. a collage of these memes is found in figure . given the massive impact of covid- on society and daily life, many memes were innocent humor designed to help folks get through some very tough circumstances. we did find a number of political memes, however, some targeting domestic pandemic policy discussions and others targeting geo-political competition. the domestic policy memes were trying to use image and text to argue for one of the competing priorities: namely the safety of society or the economic foundation of that society. the geo-political memes were likely created by nationstates or nation-state proxies, with many memes created by russia, china, and iran, which will be discussed in more detail below. as discussed in chapter , the meme-hunter suite of tools includes a special meme optical character recognition (ocr) pipeline for meme images. this was also presented in detail in [ ] . we ran meme ocr on the extracted memes from million images, and conducted wordcloud visual analysis of the results. these results are found in figure . from these results we see that a number of general coronavirus memes. we also see that a number of memes are targeting government leaders, ministers, and agencies. next we used open source facial recognition software to identify prominent politicians and world leaders in the memes. facial recognition software simply identifies personalities, but does not indicate whether the meme is supporting or attacking the specific personality found. a distribution of memes about prominent us politicians and other world leaders is given in figure . here we see that prominent world leaders are the target of most of the memes in our sample, with xi jinping and donald trump in the first and second place positions, respectively. this also highlights that much of the covid- discussion and information conflict is between the us, china, and europe. we next calculated the evolution of the , memes that meme-hunter classified in our sample. the network was created by using a vgg deep learning model [ ] and extracting the last layer before softmax. using this , dimension vector to represent the image, we then conducted radius nearest neighbor graph learning with distance = . in figure we see that many of the darker image memes are clustered together in the center, with other prominent memes evolving in clusters that are separate components. we've highlighted the evolution and links between two of the memes. we were able to use the covid- data to test the sketch-io prototype application for sketching and analyzing information operation campaigns. given that our prototype application was not able to ingest the entire stream due to its size, we instead ingested all data produced by or propagating state sponsored media. the sketch-io application proved effective and responsive at quickly performing a number of battle drills to analyze this data. example usage and screenshots are provided in figure . in this section we will identify overt and seemingly overt propaganda operations by nation state actors. covert and black propaganda is much harder to detect and attribute to a nation state actor (black propaganda is designed to make the victim appear to be the perpetrator). we will use the bot labeler function for finding state sponsored media and state sponsored media amplification as a proxy for finding propaganda. in figure we show the number of retweets of state sponsored media accounts as measured by bot labler. this image is colored by percentage of bots that are retweeting these accounts. as indicated in chapter , these state sponsored media vary drastically in purpose and independence (the purpose of russian rt is different than the us voice of america). that being said, we clearly see that russia and china are investing heavily in producing and promoting state sponsored media and messages. these messages are amplified by both bots and legitimate accounts. to some extent, the covid pandemic seems to be a turning point for the chinese in regards to information operations. traditionally, chinese information operations focused on positive narrative largely pushed by the " cent army", a low-paid or otherwise coopted group of online netizens. traditionally, they did not conduct "higher risk higher reward" operations that relied on negative, antagonizing and controversial statements. this slowly started to change with the hong kong protests in , and fully changed with covid- pandemic, with china seeming to adopt more aggressive io practices and operations. in february , lijian zhao was promoted to the deputy director general, information department, foreign ministry. foreign minister zhao has a history of aggressive information operations, and seems to be implementing this in the chinese information operations as well as personally on social media. on march , zhao posted a tweet that suspected the us army of bringing the coronavirus to wuhan province in china (see figure ). zhao seems to be a key personality leading and directing china's more aggressive approach to information operations. fig. : foreign minister lijian zhao's rise to director of chinese information operations seems to correlate with a more aggressive approach than historical chinese information operations (tweet id: ). chinese propaganda is amplified by chinese government representatives. chinese government officials around the world are able to amplify state sponsored media without questioning it, knowing that only approved messages are published. in the covid- stream, the chinese ambassador to venezuela has the nd highest number of state sponsored mentions/retweets. he retweeted or mentioned chinese state sponsored spanish and english covid- content times. the increasing penetration of chinese state sponsored media around the world is pushed by these legitimate accounts. it also appears that china often uses trolls instead of bots. with easy access to human capital, the chinese seem to prefer the control and nuance that trolls allow compared to a bot army. many of the suspicious chinese accounts contain enough nuance and temporal patterns to consider them a troll rather than a bot. we see increased use of meme warfare by china in the covid- stream. historically, china has seemed hesitant to use memes, potentially because memes propagate through evolution, and this evolution is outside of the control of the state [ ] . china has especially been concerned with the evolution of memes within their own population, and has banned some memes [ ] . in covid- , however, we see them developing and deploying memes in a way that is more akin to russian information operations. an example of a chinese meme is provided in figure a . here we see a message tied to an image that has clear cultural relevance and traction within the target audience. additional examples of chinese memes that were found using the bot-match methodology are provided in figure below. we also found evidence that china is starting to interlace adult oriented and humorous content with their information in order to increase traffic, particularly from certain demographics. examples of this are found in figure ). this has historically been a key part of russian io, and it seems that china is increasingly adopting similar practices. explicit adult content is often designed to attract and manipulate the minds of young impressionable men. in the chinese propaganda, we see some propaganda uses chinese language while others uses english language content. the english text is often accompanied by memes that connect the message with american audiences. the english language propaganda is arguably targeting american audience, attacking leaders and institutions in america. it is also designed to strengthen american's view of china, chinese leadership and the chinese communist party (ccp). the propaganda that uses chinese language text is arguably targeting chinese audiences within china's borders as well as in other asian countries. this propaganda is designed to increase nationalistic fervor within china's borders. within the chinese propaganda that we viewed, the vast majority was singularly focused on the united states. this differs from russian information operations, which focuses more broadly on the west, adding european actors to their list of targets. within the covid- stream, russia appears to be staying with their historical information playbook. with extensive experience in manipulating world opinion with their "active measures" throughout the cold war, russia has long had one of the most aggressive and well-resourced information operations capabilities among nation states. russia's information operations are tightly coupled with their other cyber operations. russian information operations rely on increasing penetration of their state sponsored media around the world. rt, sputnik, and other state sponsored media outlets offer news in many languages around the world. these media outlets offer news stories that support russian information narratives, as seen in figure . in this figure that russia provided more help to the italian population than did the european union. as seen in figure b , russian operations still use large and sophisticated bot "armies" to push their content. this observation is supported by the quantitative analysis seen in figure , which shows the amplification of russian state sponsored messages is % bots. as indicated above, russian operations target the west in general, with european targets receiving almost as much emphasis as the united states. this differs from chinese operations, which seem to primarily target the united states, with smaller efforts directed at europe. in figure we see examples of russian state media trolling the west. while russian state sponsored media advertise themselves as news organizations, this content is arguably well beyond reporting unbiased news. on april , iran initiated a concerted attack on the united states trying to encourage california to exit the union. most of this effort was tagged with #calexit. by the time the dust settled approximately k tweets had been launched by a several thousand strong bot/troll army. this is not the first time that #calexit (or other similar messages like #texit), have become trending hashtags on twitter largely due to foreign information operations. the april surge appears to be largely iranian influence operation that was triggered by domestic political tension in the united states. in the days and weeks preceding april, tensions between us national leadership and california leadership intensified, and the governor of california referred to california as a "nation-state" [ ] . the iranian government and/or proxies appeared to be monitoring these political tensions in the united states, and timed their #calexit campaign to capitalize on them. this hour information operation had some creative content with many human/troll accounts and limited automation. the creative content can be seen in the meme collage in figure . as discussed in chapter , bot-match can be a very powerful tool for finding similar accounts given a seed account. bot-match allows you to find similar accounts where similarity can be defined by network proximity, semantic proximity, or a combination of both. the size of our network and tweet corpus limit the number of models available for measuring similarity. we first concatenate all user text, thereby aggregating content to the account level. because of the size of our corpus, we then used cosine similarity on a document-term matrix (also known as a bag-of-words) with top words. this was created for english tweets. we used this to identify accounts that were propagating chinese propaganda in english. we used @safisina as our seed account (this account was discovered above in figure b ). we illustrate in figure how we recursively build out the chinese propaganda network with bot-match. notice that we did not have to conduct any elaborate labeling or training process, we just needed to start with our seed node and the document-term matrix. all accounts seen in figure are amplifying chinese state sponsored media, and are each embedded in slightly different networks. using the bot-match methods illustrated in figure , we were able to identify approximately additional accounts that appeared to be propagating chinese propaganda targeting america. we then used the twitter rest api to scrape the timeline history of these accounts, extract image links, and run meme-hunter on the images shared by these accounts. we found that indeed all of these accounts were conducting very targeted information operations against the united states around the covid- pandemic as well as the american protest that followed the death of george floyd at the hands of minneapolis police officer. a collage of a sample of these targeted memes is provided in figure . throughout our analysis of foreign influence, we observed their use of the bend forms of maneuver. we continue to observe that russia closely intertwines narrative and network maneuvers. they conduct long and protracted efforts to infiltrate target audiences before interjecting narrative. china, while working hard to control and manipulate the narrative, does not appear to infiltrate targeted subcultures. they do appear to tie their narrative to the target audience culture as seen above with memes based on the friends sitcom as well as memes that use the george floyd protests to support pro-ccp policies. with the limited network maneuver, it remains to be seen whether their information operations gain traction or simply become a "shot in the dark." iran also appeared to launch large attributed information campaigns focused on a specific narrative such as #calexit, without first preparing target networks. once again, these operations may become information warfare "chaff" with limited effects. the primary goal of this case study was to illustrate social cybersecurity workflows in a relevant event. using the apt covid- twitter stream from march to april , we demonstrated how to collect the data, conduct initial exploratory data analysis, conduct and use bot and meme classification and exploration, conduct account characterization, and demonstrate the role of sketch-io and the bend framework. this chapter therefore serves as an example for social cybersecurity researchers on how to leverage these tools to identify and characterize offensive information operations targeting their society, institutions and culture. in regards to the covid- pandemic, we found large information operations trying to manipulate domestic and international perceptions, beliefs, and actions. at the domestic level the information conflict was largely over pandemic policy, particularly whether public safety or the economy were more important. at the international level we identified attempts to manipulate international perception of the origins of the disease as well as perceptions of each countries handling of the disease. we also see nation-state efforts to amplify tensions and drive wedges in existing fissures in rival nations. throughout the data we see bots and trolls used to scale and spread narrative, therefore acting like a "forcemultiplier" in information operations. we see russia continue to use memes, and china massively increase their use of memes in information warfare. both nations study their target audience and choose relevant cultural artifacts to connect their memes and messages to the target audience. even as the covid- coronavirus moved much of business and society to use virtual platforms for social interaction as well as business and collaboration, it also allowed many nation-states to increasingly use virtual platforms for competition in the information space with ramifications for geo-politics. while the effects of these campaigns are hard to measure, their scale and persistence require social cybersecurity policy and process. characterization and comparison of russian and chinese disinformation campaigns the evolution of political memes: detecting and characterizing internet memes with multi-modal deep learning latent dirichlet allocation fast unfolding of communities in large networks ora: a toolkit for dynamic network analysis and visualization ora user's guide the rise of social bots classification of twitter accounts into automated agents and human users ieee/acm international conference on advances in social networks analysis and mining the great lockdown: worst economic downturn since the great depression imf blog on predicting geolocation of tweets using convolutional neural networks a new status index derived from sociometric analysis coronavirus in california: gavin newsom's response -the atlantic very deep convolutional networks for large-scale image recognition overview twitter developers this work was supported in part by the office of naval research (onr) multidisciplinary university research initiative award n , award n , onr award n and the center for computational analysis of social and organization systems (casos). the views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the onr or the u.s. government. key: cord- - qys j u authors: zogan, hamad; wang, xianzhi; jameel, shoaib; xu, guandong title: depression detection with multi-modalities using a hybrid deep learning model on social media date: - - journal: nan doi: nan sha: doc_id: cord_uid: qys j u social networks enable people to interact with one another by sharing information, sending messages, making friends, and having discussions, which generates massive amounts of data every day, popularly called as the user-generated content. this data is present in various forms such as images, text, videos, links, and others and reflects user behaviours including their mental states. it is challenging yet promising to automatically detect mental health problems from such data which is short, sparse and sometimes poorly phrased. however, there are efforts to automatically learn patterns using computational models on such user-generated content. while many previous works have largely studied the problem on a small-scale by assuming uni-modality of data which may not give us faithful results, we propose a novel scalable hybrid model that combines bidirectional gated recurrent units (bigrus) and convolutional neural networks to detect depressed users on social media such as twitter-based on multi-modal features. specifically, we encode words in user posts using pre-trained word embeddings and bigrus to capture latent behavioural patterns, long-term dependencies, and correlation across the modalities, including semantic sequence features from the user timelines (posts). the cnn model then helps learn useful features. our experiments show that our model outperforms several popular and strong baseline methods, demonstrating the effectiveness of combining deep learning with multi-modal features. we also show that our model helps improve predictive performance when detecting depression in users who are posting messages publicly on social media. mental illness is a serious issue faced by a large population around the world. in the united states (us) alone, every year, a significant percentage of the adult population is affected by different mental disorders, which include depression mental illness ( . %), anorexia and bulimia nervosa ( . %), and bipolar mental illness ( . %) [ ] . sometimes mental illness has been attributed to the mass shooting in the us [ ] , which has taken numerous innocent lives. one of the common mental health problems is depression that is more dominant than other mental illness conditions worldwide [ ] . the fatality risk of suicides in depressed people is times higher than the general population [ ] . diagnosis of depression is usually a difficult task because depression detection needs a thorough and detailed psychological testing by experienced psychiatrists at an early stage [ ] . moreover, it is very common among people who suffer from depression that they do not visit clinics to ask help from doctors in the early stages of the problem [ ] . however, it is common for people who suffer from mental health problems to often "implicitly" (and sometimes even "explicitly") disclose their feelings and their daily struggles with mental health issues on social media as a way of relief [ , ] . therefore, social media is an excellent resource to automatically help discover people who are under depression. while it would take a considerable amount of time to manually sift through individual social media posts and profiles to locate people going through depression, automatic scalable computational methods could provide timely and mass detection of depressed people which could help prevent many major fatalities in the future and help people who genuinely need it at the right moment. the daily activities of users on social media could be a gold-mine for data miners because this data helps provide rich insights on user-generated content. it not only helps give them a new platform to study user behaviour but also helps with interesting data analysis, which might not be possible otherwise. mining users' behavioural patterns for psychologists and scientists through examining their online posting activities on multiple social networks such as facebook, weibo [ , ] , twitter, and others could help target the right people at right time and provide urgent crucial care [ ] . there are existing startup companies such as neotas with offices in london and elsewhere which mines publicly available user data on social media to help other companies automatically do the background check including understanding the mental states of prospective employees. this suggests that studying the mental health conditions of users online using automated means not only helps government or health organisations but it also has a huge commercial scope. the behavioural and social characteristics underlying the social media information attract many researchers' interests from different domains such as social scientists, marketing researchers, data mining experts and others to analyze social media information as a source to examine human moods, emotions and behaviours. usually, depression diagnosis could be difficult to be achieved on a large-scale because most traditional ways of diagnosis are based on interviews, questionnaires, self-reports or testimony from friends and relatives. such methods are hardly scalable which could help cover a larger population. individuals and health organizations have thus shifted away from their traditional interactions, and now meeting online by building online communities for sharing information, seeking and giving the advice to help scale their approach to some extent so that they could cover more affected population in less time. besides sharing their mood and actions, recent studies indicate that many people on social media tend to share or give advice on health-related information [ , , , ] . these sources provide the potential pathway to discover the mental health knowledge for tasks such as diagnosis, medications and claims. detecting depression through online social media is very challenging requiring to overcome various hurdles ranging from acquiring data to learning the parameters of the model using sparse and complex data. concretely, one of the challenges is the availability of the relevant and right amount of data for mental illness detection. the reason why more data is ideal is primarily that it helps give the computational model more statistical and contextual information during training leading to faithful parameter estimation. while there are approaches which have tried to learn a model on a small-scale data, the performance of these methods is still sub-optimal. for instance, in [ ] , the authors tried crawling tweets that contain depression-related keywords as ground truth from twitter. however, they could collect only a limited amount of relevant data which is mainly because it is difficult to obtain relevant data on a large-scale quickly given the underlying search intricacies associated with the twitter application programming interface (api) and the daily data download limit. despite using the right keywords the service might return several false-positives. as a result, their model suffered from the unsatisfactory quantitative performance due to poor parameter estimation on small unreliable data. the authors in [ ] also faced a similar issue where they used a small number of data samples to train their classifier. as a result, their study suffered from the problem of unreliable model training using insufficient data leading to poor quantitative performance. in [ ] the authors propose a model to detect anxious depression of users. they have proposed an ensemble classification model that combines results from three popular models including studying the performance of each model in the ensemble individually. to obtain the relevant data, the authors introduced a method to collect their data set quickly by choosing the first randomly sampled users who are followers of ms india student forum for one month. a very common problem faced by the researchers in detecting depression on social media is the diversity in the user's behaviours on social media, making extremely difficult to define depressionrelated features to cope with mental health issues. for example, it was evidenced that although social media could help us to gather enough data through which useful feature engineering could be effectively done and several user interactions could be captured and thus studied, it was noticed in [ , ] that one could only obtain a few crucial features to detect people with eating disorders. in [ ] the authors also suffered from the issue of inadequate features including the amount of relevant data set leading to poor results. different from the above works, we have proposed a novel model that is trained on a relatively large dataset showcasing that the method scales and it produces better and reliable quantitative performance than existing popular and strong comparative methods. we have also proposed a novel hybrid deep learning approach which can capture crucial features automatically based on data characteristic making the approach reliable. our results show that our model outperforms several state-of-the-art comparative methods. depressed users behave differently when they interact on social media, producing rich behavioural data, which is often used to extract various features. however, not all of them are related to depression characteristics. many existing studies have either neglected important features or selected less relevant features, which mostly are noise. on the other hand, some studies have considered a variety of user behaviour. for example, [ ] is one such work that has collected a large-scale dataset with reliable ground truth labels. they then extracted various features representing user behaviour in social media and grouped these features into several modalities. finally, they proposed a new model called the multimodal dictionary learning model (mdl) to detect depressed users from tweets, based on dictionary learning. however, given the high-dimensional, sparse, figurative and ambiguous nature of tweet language use, dictionary learning cannot capture the semantic meaning of tweets. instead, word embedding is a new technique that can solve the above difficulties through neural network paradigms. hence, due to the capability of the word embedding for holding the semantic relationship between tweets and the knowledge to capture the similarity between terms, we combine multi-modal features with word embedding, to build a comprehensive spectrum of behavioural, lexical, and semantic representations of users. recently, using deep learning to gain insightful and actionable knowledge from complex and heterogeneous data has become mainstream in ai applications for healthcare, e.g. the medical image processing and diagnosis has gained great success. the advantage of deep learning sits in its outstanding capability of iterative learning and automated optimizing latent representations from multi-layer network structure [ ] . this motivates us to leverage the superior neural network learning capability with the rich and heterogeneous behavioural patterns of social media users. to be specific, this work aims to develop a new novel deep learning-based solution for improving depression detection by utilizing multi-modal features from diverse behaviour of the depressed user in social media. apart from the latent features derived from lexical attributes, we notice that the dynamics of tweets, i.e. tweet timeline provides a crucial hint reflecting depressed user emotion change over time. to this end, we propose a hybrid model comprising bidirectional gated recurrent unit (bigru) and conventional neural network (cnn) model to boost the classification of depressed users using multi-modal features and word embedding features. the model can derive new deterministic feature representations from training data and produce superior results for detecting depression-level of twitter users. our proposed model uses a bigru, which is a network that can capture distinct and latent features, as well as long-term dependencies and correlations across the features matrix. bigru is designed to use backward and forward contextual information in text, which helps obtain a user latent feature from their various behaviours by using a reset and update gates in a hidden layer in a more robust way. in general, gru-based models have shown better effectiveness and efficiency than the other recurrent neural networks (rnn) such as long short term memory (lstm) model [ ] . by capturing the contextual patterns bidirectionally helps obtain a representation of a word based on its context which means under different contexts, a word could have different representation. this indeed is very powerful than other techniques such as traditional unidirectional gru where one word is represented by only one representation. motivated by this we add a bidirectional network for gru that can effectively learn from multi-modal features and provide a better understanding of context, which helps reduce ambiguity. besides, bigru can extract more discrete features and helps improve the performance of our model. the bigru model could capture contextual patterns very well, but lacks in automatically learning the right features suitable for the model which would play a crucial role in predictive performance. to this end, we introduce a one-dimensional cnn as a new feature extractor method to classify user timeline posts. our full model can be regarded as a hybrid deep learning model where there is an interplay between a bigru and a cnn model during model training. while there are some existing models which have combined cnn and birnn models, for instance, in [ ] the authors combine bilstm or bigru and cnn to learn better features for text classification using an attention mechanism for feature fusion, which is a different modelling paradigm than what is introduced in this work, which captures the multi-modalities inherent in data. in [ ] , the authors proposed a hybrid bigru and cnn model which later constrains the semantic space of sentences with a gaussian. while the modelling paradigms may be closely related with the combinations of a bigru and a cnn model, their model is designed to handle sentence sentiment classification rather than depression detection which is a much more challenging task as tweets in our problem domain are short sentences, largely noisy and ambiguous. in [ ] , the authors propose a combination of bigru and cnn model for salary detection but do not exploit multi-modal and temporal features. finally, we also studied the performance of our model when we used the two attributes word embedding and multi-modalities separately. we found that model performance deteriorated when we used only multi-modal features. we further show when we combined the two attributes, our model led to better performance. to summarize, our study makes the following contributions: ( ) we propose a novel depression detection framework by deep learning the textual, behavioural, temporal, and semantic modalities from social media. ( ) a gated recurrent unit to detect depression using several features extracted from user behaviours. ( ) we built a cnn network to classify user timeline posts concatenated with bigru network to identify social media users who suffer from depression. to the best of our knowledge, this is the first work of using multi-modalities of topical, temporal and semantic features jointly with word embeddings in deep learning for depression detection. ( ) the experiment results obtained on a real-world tweet dataset have shown the superiority of our proposed method when compared to baseline methods. the rest of our paper is organized as follows. section reviews the related work to our paper. section presents the dataset that used in this work, and different pre-processing we applied on data. section describes the two different attributes that we extracted for our model. in section , we present our model for detection depression. section reports experiments and results. finally, section concludes this paper. in this section, we will discuss closely related literature and mention how they are different from our proposed method. in general, just like our work, most existing studies focus on user behaviour to detect whether a user suffers from depression or any mental illness. we will also discuss other relevant literature covering word embeddings and hybrid deep learning methods which have been proposed for detecting mental health from online social networks and other resources including public discussion forums. since we also introduce the notion of latent topics in our work, we have also covered relevant related literature covering topic modelling for depression detection, which has been widely studied in the literature. data present in social media is usually in the form of information that user shares for public consumption which also includes related metadata such as user location, language, age, among others [ ] . in the existing literature, there are generally two steps to analyzing social data. the first step is collecting the data generated by users on networking sites, and the second step is to analyze the collected data using, for instance, a computational model or manually. in any data analysis, feature extraction is an important task because using only a relevant small set of features, one can learn a high-quality model. understanding depression on online social networks could be carried out using two complementary approaches which are widely discussed in the literature, and they are: • post-level behavioural analysis • user-level behavioural analysis methods that use this kind of analysis mainly target at the textual features of the user post that is extracted in the form of statistical knowledge such as those based on count-based methods [ ] . these features describe the linguistic content of the post which are discussed in [ , ] . for instance, in [ ] the authors propose classifier to understand the risk of depression. concretely, the goal of the paper is to estimate that there is a risk of user depression from their social media posts. to this end, the authors collect data from social media for a year preceding the onset of depression from user-profiles and distil behavioural attributes to be measured relating to social engagement, emotion, language and linguistic styles, ego network, and mentions of antidepressant medications. the authors collect their data using crowd-sourcing task, which is not a scalable strategy, on amazon mechanical turk. in their study, the crowd workers were asked to undertake a standardized clinical depression survey, followed by various questions on their depression history and demographics. while the authors have conducted thorough quantitative and qualitative studies, they are disadvantageous in that it does not scale to a large set of users and does not consider the notion of text-level semantics such as latent topics and semantic analysis using word embeddings. our work is both scalable and considers various features which are jointly trained using a novel hybrid deep learning model using a multi-modal learning approach. it harnesses high-performance graphics processing units (gpus) and as a result, has the potential to scale to large sets of instances. in hu et al., [ ] the authors also consider various linguistic and behavioural features on data obtained from social media. their underlying model relies on both classification and regression techniques for predicting depression while our method performs classification, but on a large-scale using a varied set of crucial features relevant to this task. to analyze whether the post contains positive or negative words and/or emotions, or the degree of adverbs [ ] used cues from the text, for example, i feel a little depressed and i feel so depressed, where they capture the usage of the word "depressed" in the sentences that express two different feelings. the authors also analyzed the posts' interaction (i.e., on twitter (retweet, liked, commented)). some researchers studied post-level behaviours to predict mental problems by analysing tweets on twitter to find out the depression-related language. in [ ] , the authors have developed a model to uncover meaningful and useful latent structure in a tweet. similarly, in [ ] , the authors monitored different symptoms of depression that are mentioned in a user's tweet. in [ ] , they study users' behaviour on both twitter and weibo. to analyze users' posts, they have used linguistic features. they used a chinese language psychological analysis system called textmind in sentiment analysis. one of the interesting post-level behavioural studies was done by [ ] on twitter by finding depression relevant words, antidepressant, and depression symptoms. in [ ] the authors used postlevel behaviour for detecting anorexia; they analyze domain-related vocabulary such as anorexia, eating disorder, food, meals and exercises. there are various features to model users in social media as it reflects overall behaviour over several posts. different from post-level features extracted from a single post, user-level features extract from several tweets during different times [ ] . it also extracts the user's social engagement presented on twitter from many tweets, retweets and/or user interactions with others. generally, posts' linguistic style could be considered to extract features [ , , ] . the authors in [ ] extracted six depression-oriented feature groups for a comprehensive description of each user from the collected data set. the authors used the number of tweets and social interaction as social network features. for user profile features, they have used user shared personal information in a social network. analysing user behaviour looks useful for detecting eating disorder. in wang et al., [ ] they extracted user engagement and activities features on social media. they have extracted linguistic features of the users for psychometric properties which resembles the settings described in [ , , ] where the authors have extracted features from two different social networks (twitter and weibo). they extracted features from a user profile, posting time and user interaction feature such as several followers and followee. this is one interesting work [ ] where the authors combine user-level and post-level semantics and cast their problem as a multiple instance learning setup. the advantage that this method has is that it can learn from user-level labels to identify post-level labels. there is an extensive literature which has used deep learning for detecting depression on the internet in general ranging from tweets to traditional document collection and user studies. while some of these works could also fall in one of the categories above, we are separately presenting these latest findings which use modern deep learning methods. the most closely related recent work to ours is [ ] where the authors propose a cnn-based deep learning model to classify twitter users based on depression using multi-modal features. the framework proposed by the authors has two parts. in the first part, the authors train their model in an offline mode where they exploit features from bidirectional encoder representations from transformers (bert) [ ] and visual features from images using a cnn model. the two features are then combined, just as in our model, for joint feature learning. there is then an online depression detection phase that considers user tweets and images jointly where there is a feature fusion at a later stage. in another recently proposed work [ ] , the authors use visual and textual features to detect depressed users on instagram posts than twitter. their model also uses multi-modalities in data, but keep themselves confined to instagram only. while the model in [ ] showed promising results, it still has certain disadvantage. for instance, bert vectors for masked tokens are computationally demanding to obtain even during the fine-tuning stage, unlike our model which does not have to train the word embeddings from scratch. another limitation of their work is that they obtain sentence representations from bert, for instance, bert imposes a token length limit where longer sequences are simply truncated resulting in some information loss, where our model has a much longer sequence length which we can tune easily because our model is computationally cheaper to train. we have proposed a hybrid model that considers a variety of features unlike these works. while we have not specifically used visual features in our work, using a diverse set of crucial relevant textual features is indeed reasonable than just visual features. of course, our model has the flexibility to incorporate a variety of other features including visual features. multi-modal features from the text, audio, images have also been used in [ ] , where a new graph attention-based model embedded with multi-modal knowledge for depression detection. while they have used temporal cnn model, their overall architecture has experimented on small-scale questionnaire data. for instance, their dataset contains sessions of interactions ranging between - min (with an average of min). while they have not experimented their method with short and noisy data from social media, it remains to be seen how their method scales to such large collections. xezonaki et al., [ ] propose an attention-based model for detecting depression from transcribed clinical interviews than from online social networks. their main conclusion was that individuals diagnosed with depression use affective language to a greater extent than those who are not going through depression. in another recent work [ ] , the authors discuss depression among users during the covid- pandemic using lstm and fasttext [ ] embeddings. in [ ] , the authors also propose a multi-model rnn-based model for depression prediction but apply their model on online user forum datasets. trotzek et al., [ ] study the problem of early detection of depression from social media using deep learning where the leverage different word embeddings in an ensemble-based learning setup. the authors even train a new word embedding on their dataset to obtain task-specific embeddings. while the authors have used the cnn model to learn high-quality features, their method does not consider temporal dynamics coupled with latent topics, which we show to play a crucial role in overall quantitative performance. the general motivation of word embeddings is to find a low-dimensional representation of a word in the vocabulary that signifies its meaning in the latent semantic space. while word embeddings have been popularly applied in various domains in natural language processing [ ] and information retrieval [ ] , it has also been applied in the domain of mental health issues such as depression. for instance, in [ ] , the authors study on reddit (reddit is also used in [ ] ) a few communities which contain discussions on mental health struggles such as depression and suicidal thoughts. to better model the individuals who may have these thoughts, the authors proposed to exploit the representations obtained from word embeddings where they group related concepts close to each other in the embeddings space. the authors then compute the distance between a list of manually generated concepts to discover how related concepts align in the semantic space and how users perceive those concepts. however, they do not exploit various multi-modal features including topical features in their space. farruque et al., [ ] study the problem of creating word embeddings in cases where the data is scarce, for instance, depressive language detection from user tweets. the underlying motivation of their work is to simulate a retrofitting-based word embedding approach [ ] where they begin with a pre-trained model and fine-tune the model on domain-specific data. gong et al., [ ] proposed a topic modelling approach to depression detection using multi-modal analysis. they propose a novel topic model which is context-aware with temporal features. while the model produced satisfactory results on audio/visual emotion challenge (avec), the method does not use a variety of rich features and could face scalability issues because simple posterior inference algorithms such as those based on gibbs or collapsed gibbs sampling do not parallelize unlike deep learning methods, or one need sophisticated engineering to parallelize such models. twitter has been popularly regarded as one online social media resource that provides free data for data mining on tweets. this is the reason for its popularity among researchers who have widely used data from twitter. one can freely and easily download tweet data through their apis. however, in the past, researchers have generally followed two methods for using twitter data, which are: • using an already existing dataset shared freely and publicly by others. the downside of such datasets is that they might be old to learn anything useful in the current context. recency may be crucial in some studies such as understanding current trends of a recently trending topic [ ] . • crawling data using vocabulary from a social media network though is slow but helps get fresh, relevant and reliable data which would help learn patterns that are currently being discussed on online social networks. this method takes time to collect relevant and then process the data given that resources such as twitter which provide data freely impose tweet download restrictions per user per day, as a result of fair usage policy applied to all users. developing and validating the terms used in the vocabulary by users with mental illness is time-consuming but helps obtain a reliable list of words, by which reliable tweets could be crawled reducing the amount the false-positives. recent research conducted by the authors of [ ] is one such work that has collected a large-scale data with reliable ground truth data, which we aim to reuse. we present the statistics of the data in table . to exemplify the dataset further, the authors collected three complementary data sets, which are: • depression data set: each user is labelled as depressed, based on their tweet content between and . this includes , depressed users and , tweets. • non-depression data set: each user is labelled as non-depressed and the tweets were collected in december . this includes over million active users and billion tweets. • depression-candidate data set: the authors collected are labelled as depression-candidate, where the tweet was collected if contained the word "depress". this includes , depressioncandidate users and over million tweets. data collection mechanisms are often loosely controlled, impossible data combinations, for instance, users labelled as depressed but have provided no posts, missing values, among others. after data has dataset depressed non-depressed no. of users million no. of tweets , billion table . statistics of the large dataset collected by the authors in [ ] which is used in this study. been crawled, it is still not ready to be used directly by the machine learning model due to various noise still present in data, which is called the "raw data". the problem is even more exacerbated when data has been downloaded from online social media such as twitter because tweets may contain spelling and grammar mistakes, smileys, and other undesirable characters. therefore, a pre-processing strategy is needed to ensure satisfactory data quality for computational modal to achieve reliable predictive analysis. the raw data used in this study has labels of "depressed" and "non-depressed". this data is organised as follows: users: this data is packaged as a json file for each user account describing details about the user such as user id, number of followers, number of tweets etc. note that json is a standard popular data-interchange which is easy for humans to read and write. timeline: this data package contains files containing several tweets along with corresponding metadata, again in json format. to further clean the data we used natural language processing toolkit (nltk). this package has been widely used for text pre-processing [ ] and various other works. it has also been widely used for removing common words such as stop words from text [ , , ] . we have removed the common words from users tweets (such as "the", "an", etc.) as these are not discriminative or useful enough for our model. these common words sometimes also increase the dimensionality of the problem which could sometimes lead to the "curse-of-dimensionality" problem and may have an impact on the overall model efficiency. to further improve the text quality, we have also removed non-ascii characters which have also been widely used in literature [ ] . pre-processing and removal of noisy content from the data helped get rid of plenty of noisy content from the dataset. we then obtained a high-quality reliable data which we could use in this study. besides, this distillation helped reduce the computational complexity of the model because we are only dealing with informative data which eventually would be used in modelling. we present the statistics of this distilled data below: to further mitigate the issue of sparsity in data, we excluded those users who have posted less than ten posts and users who have less than followers, therefore we ended up with positive users and negative users. social media data conveys all user contents, insights and emotion reflected from individual's behaviours in the social network. this data shows how users interact with their connections. in this work, we collect information from each user and categorize it into two types of attributes, namely multi-modal attribute and word embedding, as follows: we introduce this attribute type where the goal is to calculate the attribute value corresponding to each modality for each user. we estimate that the dimensionality for all modalities of interest is ; and we mainly consider four major modalities as listed below and ignore two modalities due to missing values. these features are extracted respectively for each user as follows: . . social information and interaction. from this attribute, we extracted several features embedded in each user profile. these are features related to each user account as specified by each feature name. most of the features are directly available in the user data, such as the number of users following and friends, favourites, etc. moreover, the extracted features relate to user behaviour on their profile. for each user, we calculate their total number of tweets, their total length of all tweets and the number retweets. we further calculate posting time distribution for each user, by counting how many tweets the user published during each of the hours a day. hence it is a -dimensional integer array. to get posting time distribution for each tweet, we extract two digits as hour information, then go through all tweets of each user and track the count of tweets posted in each hour of the day. emojis allow users to express their emotions through simple icons and non-verbal elements. it is useful to get the attention of the reader. emojis could give us a glance for the sentiment of any text or tweets, and it is essential to differentiate between positive and negative sentiment text [ ] . user tweets contain a large number of emojis which can be classified into positive, negative and neutral. for each positive, neutral, and negative type, we count their frequency in each tweet. then we sum up the numbers from each user's tweets to get the sum for each user. so the final output is three values corresponding to positive, neutral and negative emojis by the user. we also consider voice activity detection (vad) features. these features contain valance, arousal and dominance scores. for that, we count first person singular and first person plural. using affective norms for english words, a vad score for words are obtained. we create a dictionary with each word as a key and a tuple of its (valance, arousal, dominance) score as value. next, we parse each tweet and calculate vad score for each tweet using this dictionary. finally, for each user, we add up the vad scores of tweets by that user, to calculate the vad score for each user. topic modelling belongs to the class statistical modelling frameworks which helps in the discovery of abstract topics in a collection of text documents. it gives us a way of organizing, understanding and summarizing collections of textual information. it helps find hidden topical patterns throughout the process, where the number of topics is specific by the user apriori. it can be defined as a method of finding a group of words (i.e. topics) from a collection of documents that best represent the latent topical information in the collection. in our work, we applied the unsupervised latent dirichlet allocation (lda) [ ] to extract the most latent topic distribution from user tweets. to calculate topic level features, we first consider corpus of all tweets of all depressed users. next, we split each tweet into a list of words and assemble all words in decreasing order of their frequency of occurrence, and common english words (stopwords) are removed from the list. finally, we apply lda to extract the latent k = topics distribution, where k is the number of topics. we have found experimentally k = to be a suitable value. while there are tuning strategies and strategies based on bayesian non-parametrics [ ] , we have opted to use a simple, popular, and computationally efficient approach which helps give us the desired results. it is the count of depression symptoms occurring in tweets, as specified in nine groups in dsm-iv criteria for a depression diagnosis. the symptoms are listed in appendix a. we count how many times the nine depression symptoms are mentioned by the user in their tweets. the symptoms are specified as a list of nine categories, each containing various synonyms for the particular symptom. we created a set of seed keywords for all these nine categories, and with the help of the pre-trained word embedding, we extracted the similarities of these symptoms to extend the list of keywords for each depression symptoms. furthermore, we scan through all tweets, counting how many times a particular symptom is mentioned in each tweet. we also focused on the antidepressants, and we created a lexicon of antidepressants from the "antidepressant" wikipedia page which contains an exhaustive list of items and is updated regularly, in which we counted the number of names listed for antidepressants. the medicine names are listed in appendix b. word embeddings are a class of representation learning models which find the underlying meaning of words in the vocabulary in some low-dimensional semantic space. their underlying principle is based on optimising an objective function which helps bring words which are repeatedly occurring together under a certain contextual window, close to each other in the semantic space. the usual windows size that works well in many settings is [ ] . a remarkable ability of these models is that they can effectively capture various lexical properties in natural language such as the similarity between words, analogies among words, and others. these models have become increasingly popular in the natural language processing domain and have been used as input to deep learning models. among various word embedding models proposed in the literature, word vec [ ] is one of the most popular techniques that use shallow neural networks to learn word embedding. word vec is a predictive model for learning word embeddings from raw text that is also computationally efficient. word vec takes a large corpus of text as its input and generates a vector space with a corresponding vector in the space allocated to each specific word. word vectors are placed in the space of the vector. the words that share common meanings in the corpus are located in space near to each other. to learn the semantic meaning between the words that were posted by depressed users, we add a new attribute to extract more meaningful features. count features in multi-modalities attribute are useful and effective to extract features from normal text. however, they could not effectively capture the underlying semantics, structure, sequence and meaning in tweets. while count features are based on the independent occurrence of words in a text corpus, they cannot capture the contextual meaning of words in the text which is effectively captured by word embeddings. motivated by this, we apply word embedding techniques to extract more meaningful features from every user's tweets and capture the semantic relationship among word sequence. we used a popular model called word vec [ ] with a -dimensional set of word embeddings pre-trained on google news corpus to produce a matrix of word vectors. the skip-gram model is used to learn word vector representations which are characterised by low-dimensional real-valued representations for each word. this is usually done as a pre-processing stage, after which the vectors learned are fed into a model. in this section, we describe our hybrid model that learns from multi-modal features. while there are various hybrid deep learning models proposed in the literature, our method is novel in that it learns multi-modal features which include topical features as shown in figure . the joint learning mechanism learns the model parameters in a consolidated parameter space where different model parameters are shared during the training phase leading to more reliable results. note that simple cascaded-based approaches incorporate error propagation from one stage to next [ ] . at the end of the feature extraction step, we obtain the training data in the form of an embedding matrix for each user representing the user timeline posts attribute. we also have a -dimensional vector of integers for each user representing the multi-modalities attribute. due to the complexity of user posts and the diversity of their behaviour on social media, we propose a hybrid model based on cnn that combines with bigru to detect depression through social media as depicted in figure . for each user, the model takes two inputs for the two attributes. first, the four modalities feature input that represents user behaviour vector runs into bigru, capturing distinct and latent features, as well as long-term dependencies and correlation across the features matrix. the second input represents each user input tweet that will be replaced with it's embedding and fed to the convolution layer to learn some representation features from the sequential data. the output in the middle of both attributes is concatenated to represent one single vector feature that fed into an activation layer of sigmoid for prediction. in the following sections, we will discuss the following two existing separate architectures which will be combined leading to a novel computational model for modelling spatial structures and multi-modalities. in particular, the model comprises a cnn network to learn the spatial structure from user tweets and a framework to extract latent features from multi-modalities attribute followed by the application of bigru. an individual user's timeline comprises semantic information and local features. recent studies show that cnn has been successfully used for learning strong, suitable and effective features representations [ ] . the effective feature learning capabilities of cnns make them an ideal choice to extract semantic features from a user post. in this work, we propose to apply cnn network to extract semantic information features from user tweets. the input to our cnn network is the embedded matrix layer with a sentence matrix and the sentence will be treated as sequence of words s : [w , w , w , . . . , w i ]. each word w ∈ r ×d is a one vector of the embedding matrix r w×d , where d represents the dimension of each word in the matrix and w represents the length or number of words for each user posts. we set the size of each user sentence between and words and describe the average of only ten tweets for each user. note that this size is much larger than what has been used in other recent closely-related models which are based on bert. also, we could train our model on the dataset which helps create specific representations for our dataset in a computationally less demanding way unlike those which are based on bert that is both computational and financially expensive to train followed by fine-tuning. the input layer is attached to the convolution layer by three conventional layers to learn n-gram features capturing word order; thereby capturing crucial text semantic which usually cannot be captured by a bag-of-words-based model [ ] . we use a convolution operation c n to extract features between words as follows: ( ) where f is a nonlinear function, b denotes bias and x n:n+h− a window of h words. here the convolution is applied to the window of word vector, where the window size is h. the network now creates a feature map according to the following equation: ( ) the output of convolution layer feature map will be an input for the pooling layer, which is an important step to reduce dimension of the space by selecting appropriate features. we used the max pooling layer to calculate the maximum value for every feature-map patch. the output of pooling operation is generated as follows: . we add the lstm layer to create a stack of deep learning algorithms to optimize the results. the recurrent neural network (rnn) is a powerful network when the input is fixed vectors to process in sequence even if the data is non-sequential. models such as bigru, gru, and lstm fall in the class of rnns. the static attributes are usually inputted to the bigru. gru is an alternative of lstm and links the forget gate and the input gate into a single update gate, which is computationally efficient than an lstm network due to the reduction of gates. gru can effectively and efficiently capture long-distance information between features, but one way or unidirectional gru could only capture the historical information features partly. moreover, for our static attributes, we would like to get the information about the behavioural semantics of each user. to this end, we have applied bigru to combine the forward and backward directions for every input feature to capture the behavioural semantics in both directions. bidirectional models, in general, capture information of the past and the future, where information is captured considering both past and future contexts which makes it more powerful than unidirectional models [ ] . suppose the input which resembles a user behaviour be represented as x ,x ..., xn. when we apply the traditional unidirectional gru, we have the following form: ( ) bidirectional gru actually consist of two layers of gru as in figure , and introduced to obtain the forward and the backward information. and the hidden layer has two values for the output, one for backward output and the other to forward output, and the algorithm can be describe as follow: where h s represents the input of step s, while ì h s and h s represent the hidden state of the forward and the backward gru in step s. each gru network is defined as the follow: , gru network is calculates the update gate z s in the time step s. this gate helps the model decide how much information is obtained from the previous step which could be passed to the next step. the reset gate in equation is used to determine how much information from past step needs to be forgotten. the gru model used a reset gate to save related information from the past as depicted in equation . lastly, the model will calculate h s that holds all the information and passes it down to the network as depicted in equation . after we obtain the latent features from each model, we integrate these features and concatenate them as feature vector to be input into an activation function for classification as mentioned below. experiments and results we compare our model with the following classification methods: • ∼mdl: multimodal dictionary learning model (mdl) is to detect depressed users on twitter [ ] . they use a dictionary learning to extract latent data features and sparse representation of a user. since we cannot get access to all [ ] 's attributes, we implement mdl in our way. • svm: support vector machines are a class of machine learning models in text classification that try to optimise a loss function that learns to draw a maximum-margin separating hyperplane between two sets of labelled data, e.g., drawing a maximum-margin hyperplane between a positive and negative labelled data [ ] . this is the most popular classification algorithm. • nb: naive bayes is a family of probabilistic algorithms based on applying bayes' theorem with the "naive" assumption of conditional independence between instances [ ] . while the suitability conditional independence has been questioned by various researchers, these models surprisingly give superior performance when compared with many sophisticated models [ ] . for our experiments, we have used the datasets as mentioned in section ( ). they provide a large scale of data, especially for labelled negative and candidate positive. after pre-processing and extracting info from their raw data, we filter out the below datasets to perform our experiments: • number of users labelled positive: . • number of tweets from positive users: . • number of users labelled negative: . • number of tweets from negative users: . then further excluded users who posted less than ten posts and users who have more than followers, we end up with a final dataset consisting of positive users and negative users. we adopt the ratio : to split our data into training and test. we used pre-trained word vec that is trained on google news corpus which comprises of billion words. we used python . . and tensorflow . . to develop our implementation. we rendered the embedding layer to be not trainable so that we keep the features representations, e.g., word vectors and topic vectors in their original form. we used one hidden layer, and max-pooling layer of size which gave better performance in our setting. for both network bigru and cnn optimization, we used adam optimization algorithm. finally we trained our model for iterations, with batch size of . the number of iterations was sufficient to converge the model and our experimental results further cement this claim where we outperform existing strong baseline methods. we employ traditional information retrieval metrics such as precision, recall, f , and accuracy based on the confusion matrix to evaluate our model. a confusion matrix is a sensational matrix used for evaluating classification performance, which is also called an error matrix because it shows the number of wrong predictions versus the number of right predictions in a tabulated manner. some important terminologies associated with computing the confusion matrix are the following: • p: the actual positive case, which is depressed in our task. • n: the actual negative case, which is not depressed in our task. • tn: the actual case is not depressed, and the predictions are not depressed as well • fn: the actual case is not depressed, but the predictions are depressed. • fp: the actual case is depressed, but the predictions are not depressed. • tp: the actual case is depressed, and the predictions are depressed as well. based on the confusion matrix, we can compute the accuracy, precision, recall and f score as follows: in our experiments, we study our model attributes including the quantitative performance of our hybrid model. the multi-modalities attribute and user's timeline semantic features attribute, we will use both these attributes jointly. after grouped user behaviour in social media into multi-modalities attribute (mm), we evaluate the performance of the model. first, we examine the effectiveness of using the multi-modalities attribute (mm) only with different classifiers. second, we showed how the model performance increased when we combined word embedding with mm. we summarise the results in table and figure as follows: • naive bayes obtain the lowest f score, which demonstrates that this model has less capability to classify tweets when compared with other existing models to detect depression. the reason for its poor performance could be that the model is not robust enough to sparse and noisy data. • ∼mdl model outperforms svm and nb and obtains better accuracy than these two methods. since this is a recent model especially designed to discover depressed users, it has captured the intricacies of the dataset well and learned its parameters faithfully leading to better results. • we can see our proposed model improved the depression detection up to % on f -score, compared to ∼mdl model. this suggests that our model outperforms a strong model. the reason why our model performs well is primarily because it leverages a rich set of features which is jointly learned in the consolidated parameters estimation resulting in a robust model. • we can also deduce from the table that our model consistently outperforms all existing and strong baselines. • furthermore, our model achieved the best performance with % in f , indicating that combining bigru with cnn for multimodal strategy for user timeline semantic features strategy is sufficient to detect depression in twitter. to get a better look for our model performance and how it does classify the samples, we have used the confusion matrix. for this, we import the confusion matrix module from sklearn, which helps us to generate the confusion matrix. we visualize the confusion matrix, which demonstrates how classes are correlated to indicate the percentage of the samples. we can observe from figure that our model predicts effectively non-depressed users (tn) and depressed users (tp). we have also compared the effectiveness of each of the two attributes of our model. therefore, we test the performance of the model with a different attribute, we build the model to feed it with each attribute separately and compare how the model performs. first, we test the model using only the multi-modalities attribute, we can observe in fig the model perform less optimally when we used bigru only. in contrast, the model performs better when we use only cnn with word embedding attribute. this signifies that extracting semantic information features from user tweets is crucial for depression detection. although, the model when used only word embedding attribute outperform multi-modalities, still the true positive rate (sensitivity) for both attribute are close to each other as we see the precision score for each bigru and cnn. finally, we can see the model performance increased when combined both cnn and bigru, and outperforms when using each attribute independently. after depressed users are classified, we examined the most common depression symptoms among depressed users. in figure , we can see symptom one (feeling depressed), is the most common symptom posted by depressed users. that shows how depressed users are exposing and posting their depressive mood on social media more than any other symptoms. besides that, other symptoms such as energy loss, insomnia, a sense of worthlessness, and suicidal thoughts have appeared in more than % of the depressed user. to further investigate the five most influencing symptoms among depressed users, we collected all the tweets associated with these symptoms. then we created a tag cloud [ ] for each of these five symptoms, to determine what are the frequent words and importance that related to each symptom as shown in figure where larger font words are relatively more important than rest in the same cloud representation. this cloud gives us an overview of all the words that occur most frequently within each of these five symptoms. in this paper, we propose a new model for detecting depressed user through social media analysis by extracting features from the user behaviour and the user's online timeline (posts). we have used a real-world data set for depressed and non-depressed users and applied them in our model. we have proposed a hybrid model which is characterised by introducing an interplay between the bigru and cnn models. we assign the multi-modalities attribute which represents the user behaviour into the bigru and user timeline posts into cnn to extract the semantic features. our model shows that by training this hybrid network improves classification performance and identifies depressed users outperforming other strong methods. this work has great potential to be further explored in the future, for instance, we can enhance multi-modalities feature by using short-text topic modelling, for instance, propose a new variant of the biterm topic model (btm) [ ] capable of generating depression-associated topics, as a feature extractor to detect depression. besides, using a new recently proposed popular word representation techniques also known as pre-trained language models such as deep contextualized word representations (elmo) [ ] and bidirectional encoder representations from transformers (bert) [ ] , and train them on a large corpus of depression-related tweets instead of using a pre-trained word embedding model. while there will be challenges when using such pre-trained language models can introduce because of the restriction that they impose on the sequence length; nevertheless, studying these models on this task helps to unearth their pros and cons. eventually, our future works aim to detect other mental illness in conjunction with depression to capture complex mental issues which have pervaded into an individual's life. diagnostic and statistical manual of mental disorders (dsm- ®) towards using word embedding vector space for better cohort analysis depressed individuals express more distorted thinking on social media latent dirichlet allocation methods in predictive techniques for mental health status on social media: a critical review libsvm: a library for support vector machines multimodal depression detection on instagram considering time interval of posts empirical evaluation of gated recurrent neural networks on sequence modeling predicting depression via social media depression detection using emotion artificial intelligence bert: pre-training of deep bidirectional transformers for language understanding a depression recognition method for college students using deep integrated support vector algorithm augmenting semantic representation of depressive language: from forums to microblogs retrofitting word vectors to semantic lexicons analysis of user-generated content from online social communities to characterise and predict depression degree topic modeling based multi-modal depression detection take two aspirin and tweet me in the morning: how twitter, facebook, and other social media are reshaping health care natural language processing methods used for automatic prediction mechanism of related phenomenon predicting depression of social media user on different observation windows anxious depression prediction in real-time social data rehabilitation of count-based models for word vector representations text-based detection and understanding of changes in mental health sensemood: depression detection on social media supervised deep feature extraction for hyperspectral image classification using social media content to identify mental health problems: the case of# depression in sina weibo mental illness, mass shootings, and the politics of american firearms advances in pretraining distributed word representations rethinking communication in the e-health era on discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes borut sluban, and igor mozetič deep learning for depression detection of twitter users depressive moods of users portrayed in twitter glove: global vectors for word representation deep contextualized word representations identifying health-related topics on twitter early risk detection of anorexia on social media beyond lda: exploring supervised topic modeling for depression-related language in twitter beyond modelling: understanding mental disorders in online social media dissemination of health information through social networks: twitter and antibiotics depression detection via harvesting social media: a multimodal dictionary learning solution cross-domain depression detection via harvesting social media multi-modal social and psycho-linguistic embedding via recurrent neural networks to identify depressed users in online forums detecting cognitive distortions through machine learning text analytics a comparison of supervised classification methods for the prediction of substrate type using multibeam acoustic and legacy grain-size data sharing clusters among related groups: hierarchical dirichlet processes understanding depression from psycholinguistic patterns in social media texts utilizing neural networks and linguistic metadata for early detection of depression indications in text sequences recognizing depression from twitter activity timelines tag clouds and the case for vernacular visualization detecting and characterizing eatingdisorder communities on social media topical n-grams: phrase and topic discovery, with an application to information retrieval salary prediction using bidirectional-gru-cnn model world health organization estimating the effect of covid- on mental health: linguistic indicators of depression during a global pandemic modeling depression symptoms from social network data through multiple instance learning georgios paraskevopoulos, alexandros potamianos, and shrikanth narayanan. . affective conditioning on hierarchical networks applied to depression detection from transcribed clinical interviews a biterm topic model for short texts semi-supervised approach to monitoring clinical depressive symptoms in social media survey of depression detection using social networking sites via data mining relevance-based word embedding combining convolution neural network and bidirectional gated recurrent unit for sentence semantic classification feature fusion text classification model combining cnn and bigru with multi-attention mechanism graph attention model embedded with multi-modal knowledge for depression detection medlda: maximum margin supervised topic models. the depression and disclosure behavior via social media: a study of university students in china list of depression symptoms as per dsm-iv:( ) depressed mood.( ) iminished interest. key: cord- -x taxwkx authors: singh, amandeep; halgamuge, malka n.; moses, beulah title: an analysis of demographic and behavior trends using social media: facebook, twitter, and instagram date: - - journal: social network analytics doi: . /b - - - - . - sha: doc_id: cord_uid: x taxwkx abstract personality and character have major effects on certain behavioral outcomes. as advancements in technology occur, more people these days are using social media such as facebook, twitter, and instagram. due to the increase in social media's popularity, the types of behaviors are now easier to group and study as this is important to know the behavior of users via social networking in order to analyze similarities of certain behavior types and this can be used to predict what they post as well as what they comment, share, and like on social networking sites. however, very few review studies have undertaken grouping according to similarities and differences to predict the personality and behavior of individuals with the help of social networking sites such as facebook, twitter, and instagram. therefore, the purpose of this research is to collect data from previous researches and to analyze the methods they have used. this chapter reviewed research studies on the topic of behavioral analysis using the social media from to . this research is based on the method of previous publications and analyzed the results, limitations, and number of users to draw conclusions. our results indicated that the percentage of completed research on the facebook, twitter, and instagram show that % of the studies were done on twitter, % on facebook, and % on instagram. twitter seems to be more popular and recent than the other two spheres as there are more studies on it. further, we extracted the studies based on the year and graphs in which indicated that more research has been done on facebook to analyze the behavior of users and the trends are decreasing in the following year. however, more studies have been done on twitter in than any other social media. the results also show the classifications based on different methods to analyze individual behavior. however, most of the studies have been done on twitter, as it is more popular and newer than facebook and instagram particularly from to , and more research needs to be done on other social media spheres in order to analyze the trending behaviors of users. this study should be useful to obtain knowledge about the methods used to analyze user behavior with description, limitations, and results. although some researchers collect demographic information on users’ gender on facebook, others on twitter do not. this lack of demographic data, which is typically available in more traditional sources such as surveys, has created a new focus on developing methods to work out these traits as a means of expanding big data research. criteria, and data analysis [ ] . the next section is the result section which provides the statistical analysis and the percentage of research completed on different social media. the result section includes a table which provides the research paper analysis according to the year along with pie chart figures, data collection, and behavior analysis methods and classifications based on different methods with line graphs [ ] . the next section is a discussion on the given topic and the last section is the conclusion of this research work. data were collected from different conference papers published in the ieee. from these papers, different methods of analyzing the user behavior [ ] was assessed. this report is based on a review of the published articles and analyzes the methods they have used. the data are given in a tabular form. data were collected from various journal papers from the ieee library regarding the analysis of the user behavior using social media from to . the collected data were related to facebook, twitter, and instagram in different countries [ ] . the attributes that were used for data collection were: applications, methods used, description of the method, number of users, limitations, and results. this raw data is presented in table [ ]. the different data attributes used to analyze the papers are given in table . this included the following: author name, applications, methods used, detail of methods, number of users, limitations, and results. data were gathered relating to different social networking sites [ ] . in our analysis, the different methods that have been used by researchers to analyze the user's behavior are explored. in this research, three different social media datasets have been collected, which represents the methods and technologies used to understand the behavior of the users. the raw data presented in table specifies the attributes that were used to conduct this research. we pooled and analyzed studies based on the impact of variables used in their studies. the descriptive details of the study based on the publication year were then analyzed to observe the behavior of the social media user from to . a comparison of the methods they used to investigate the behavior of users was then done. this research included papers from the last years from to . all papers used data from facebook, instagram, and twitter. privacy is major concern algorithm for news feed is not known filtering is not done properly [ ] it has been observed that individuals who are friends with each others have similar interests two evaluation metrics were used to judge the performance of classifier roc and pr used to the aim of this research is to know the methods used by researchers to predict the behavior of social media users. in this research, data were collected based on the use of three different social networking sites such as facebook, instagram, and twitter. a random user list was used to analyze the behavior. in our final analysis, we pooled the data, which showed a statistically significant difference in various parameters (published year, methods, results, and limitations) for different social media sites. the results section includes the percentage of research on the three social networking sites, research papers according to year with bar graph representations, data collection and behavior analysis methods and classification based on the different methods with line graph representations. we performed statistical analysis to organize the data and predict the trends based on the analysis. this showed the different social media sites used based on the data given in table . as shown in fig. and table , % of data was based on facebook users, % of data was based on instagram users, and % of data was based on twitter users. as such, it is clear that twitter is used more than other two social media sites for the analysis of the behavior of users. table data collection techniques and behavior analysis methods used by different studies are shown in table . the behavior of users can be analyzed using different methods as shown in table . fig. is based on the classification of papers based on the different methods used and it is clear that the researchers have used analysis techniques more than others and they have rarely used coding rules. in this analysis, we observed that the amount of studies on facebook and instagram in the period from to was low, so there is a need of more research in these important areas. this review study will help the readers to understand the different methods that the authors have used in their research studies on behavior analysis in social media. an examination of the different methods of behavior analysis carried out with the help of social media is the main aim of this research. thirty research studies were collected and analyzed to understand the personality of individuals who use social media such as facebook, twitter, and instagram. only three types of social network sites were included in this research. this analysis from the reported studies gives an overview of methods used to predict the personality of social media users. as seen from fig. , % of research was done on twitter from to , whereas as the other two social networking sites, facebook and instagram, only had % and %, respectively. moreover, some studies [ , ] proposed more than one method to analyze individuals' behavior. a major issue in this area is the security and privacy of the information that the users put on the social media. however, some of the studies included in this review provided suggestions and methods to help secure the personal information of users. many authors also discussed machine learning technique to observe the personality of social networking site users. the results showed that most of the research completed in were on twitter rather than facebook and instagram. in , most research was done on facebook and the least research was done on instagram. on the other hand, in twitter has the highest numbers of research papers and facebook had the lowest numbers. in , twitter and instagram had the highest number of research paper while facebook had none at all. data collection and behavior analysis methods provided by authors were collected as raw data and analyzed. a classification based on the methods used by the authors for analysis was created. previous review studies did not include the limitations and number of users' attribute in their analysis. we have included these two attributes in table to make the research more specific and easy to understand for the readers [ ] . the analysis of the papers indicated that twitter has been the most used to predict the personality of social media users. considering table , there is a need for more variety in research methods on instagram to understand the behavior of users. a cut-based classification method was used to analyze the behavior of twitter users by bhagat et al. [ ] . from the analysis done by these authors, they have concluded that cut-based classification method can be extended in the future to provide gui for users for polarity classifications and subjectivity classifications. real-time user messaging can also be analyzed in the future [ ] . this review study is based on the analysis of behavior of individuals, who use social network in their daily life. this study benefits readers as it helps to identify the methods used by different researchers and the number of researchers that applied these methods. this review study provides a clear description of the methods, limitations, and results that have been used by previous researches in studies during - . more than % people of the world use social media; however, the way social media users interact with each other vary greatly. there are demographic and behavioral trends from the facebook, twitter, and instagram that are discussed in table . in this review paper, we have reviewed and analyzed data collected from different published articles from to on the topic of behavior analysis using social media. it is found that there were different methods used by the researchers to analyze their table demographic and behaviour trends from the different social media according to age: age group between and use more facebook than twitter and instagram. more than % of this age group use facebook according to current trend use of smart phones: another reason of using social media have been increased in the past year is smart phones. smart phones have more visual interaction and people can access the social media easily. advancement in the mobile phones play very important role in the increased users of social media according to location: more people use the social media while they go out for dinner with family and friends. other locations where people like to use social media is gym, cinema and home specially in lounge room area more than other rooms according to time: more than % people use internet in the evening and % people use as a first thing in the morning. there is minimum use of social media during breakfast, lunch, at work and commuting frequency of using social networking sites: more than % people use social media more than five times a day as compared to % people who never use social networking site in a day. there are only % people who use once a week apps: more than % use apps to access the social media and fewer people use websites to access the social media data. from these methods, the most common technique to analyze the behavior of individuals was analysis techniques. from this study, it is clear that there is need for more research to predict the personality and behavior of individuals on the instagram. this study found that % of research was done on twitter and different analysis techniques were sued. while reviewing the research articles, it was clear that the researchers have used more than one method for data collection and behavior analysis. table has all the data analysis of the paper reviewed in the study. furthermore, unlike past research papers, this chapter included the attributes of the number of users and the limitations of the work done. these studies mostly focused on twitter with some research on facebook and instagram. in this research paper, we have attempted to fill the gap by including the number of users and limitation attributes. there are some challenges to find the solutions to the issues that have been discussed, but these require urgent attention. this study should be useful as a reference for researchers interested in the analysis of the behavior of social media users. facebook user's like behavior can reveal personality social media user personality classification using computational linguistic, international conference on information technology and electrical engineering (icitee) trendminer: large-scale analysis of political attitudes in public facebook messages cut-based classification for user behavioral analysis on social websites, green computing and internet of things (icgciot) analyzing emotions in twitter during a crisis: a case study of the middle east respiratory syndrome outbreak in korea, big data and smart computing (bigcomp) prediction of cyberbullying incidents in a mediabased social network, international conference on advances in social networks analysis and mining (asonam) media use of nursing students in thailand, international symposium on emerging trends and technologies in libraries and information services human behaviour in different social medias: a case study of twitter and disqus, international conference on advances in social network analysis and mining do private and sexual pictures receive more likes on instagram? demographic analysis of twitter users, advances in computing, communications and informatics (icacci) spotting suspicious behaviors in multimodal data: a general metric and algorithms characterizing behavior of topical authorities in twitter, international conference on innovative mechanisms for industry applications back to # d: predicting venezuelan states political election results through twitter, edemocracy & egovernment (icedeg) quad motif-based influence analyse of posts in instagram, advanced information and communication technologies (aict) the analysis of instagram technology adoption as marketing tools by small medium enterprise, information technology investigating link inference in partially observable networks: friendship ties and interaction consumer acceptance and use of instagram analyzing deviant behaviors on social media using cyber forensics-based methodologies analysis of the behavior of customers in the social networks using data mining techniques, international conference on advances in social networks analysis and mining (asonam) hiding in plain sight: characterizing and detecting malicious facebook pages fuzzy sentiment classification in social network facebook' statuses mining, international conference on sciences of electronics, technologies of information and telecommunications (setit) can visualization techniques help journalists to deepen analysis of twitter data? exploring the "germany x brazil" case measuring the controversy level of arabic trending topics on twitter, international conference on information and communication systems (icics) predicting temperament from twitter data, international congress on advanced applied informatics can twitter posts predict stock behavior, cloud computing and big data analysis (icccbda) probabilistic inference on twitter data to discover suspicious users and malicious content correlation of weather and moods of the italy residents through an analysis of their tweets, international conference on future internet of things and cloud workshop predicting personality traits of chinese users based on facebook wall posts, wireless and optical communication conference (wocc) monitoring adolescent alcohol use via multimodal analysis in social multimedia, big data (big data) personal preferences analysis of user interaction based on social networks, computing, communication and security (icccs) empirical analysis of user behavior in social media, international conference on developments of e-systems engineering fbmapping, an automated system for monitoring facebook data age distribution of active social media users worldwide as of rd quarter key: cord- -z dje authors: dev, jayati title: discussing privacy and surveillance on twitter: a case study of covid- date: - - journal: nan doi: . /rg. . . . sha: doc_id: cord_uid: z dje technology is uniquely positioned to help us analyze large amounts of information to provide valuable insight during widespread public health concerns, like the ongoing covid- pandemic. in fact, information technology companies like apple and google have recently launched tools for contact tracing-the ability to process location data to determine the people who have been in contact with a possible patient, in order to contain the spread of the virus. while china and singapore have successfully led the effort, more and more countries are now implementing such surveillance systems, raising potential privacy concerns about this long term surveillance. for example, it is not clear what happens to the information post-pandemic because people are more likely to share their information during a global crisis without governments having to elaborate on their data policies. digital ethnography on twitter, which has over million users worldwide, with a majority in the united states where the pandemic has the worst effects provides a unique opportunity to learn about real-time opinions of the general public about current affairs in a rather naturalistic setting. consequently, it might be useful to highlight the privacy concerns of users, should they exist, through analysis of twitter data and information sharing policies during unprecedented public health outbreaks. this will allow governments to protect their citizens both during and after health emergencies. technology is uniquely positioned to help us analyze large amounts of information to provide valuable insight during widespread public health concerns, like the ongoing covid- pandemic. in fact, information technology companies like apple and google have recently launched tools for contact tracing -the ability to process location data to determine the people who have been in contact with a possible patient, in order to contain the spread of the virus [ ] . while china and singapore have successfully led the effort [ ] , more and more countries are now implementing such surveillance systems, raising potential privacy concerns about this long term surveillance. for example, it is not clear what happens to the information post-pandemic because people are more likely to share their information during a global crisis without governments having to elaborate their data policies [ ] . digital ethnography on twitter, which has over million users worldwide, with a majority in the united states where the pandemic has the worst effects [ ] provides a unique opportunity to learn about real-time opinions of the general public about current affairs in a rather naturalistic setting. consequently, it might be useful to highlight privacy concerns of users [ ] , should they exist, through analysis of twitter data and information sharing policies during unprecedented public health outbreaks. this will allow governments to protect its citizens both during and after health emergencies. the specific research questions in this study would be to see how the discussion around privacy and surveillance has evolved over the duration of the covid- pandemic. they are as follows: . what are the various discussion topics involving covid- surveillance and how frequently do they occur? . what are users' sentiments about surveillance during the covid- outbreak? using python libraries for topic modelling using latent dirichlet analysis (lda) and sentiment analysis using natural language toolkit (nltk), i report the discussions around privacy and people's sentiments towards surveillance at large. i also observe the discussion over time and user engagement on twitter in these topics, which reveal that users engage in privacy discussions around covid- possibly propelled by popular media articles with a rising negative sentiment for government surveillance (and other privacy concerns). the findings indicate a need for better planning in data collection and analysis by governments and companies that are privacy-preserving while providing an important information source for governments in containing public health emergencies like covid- . the current covid- pandemic has raised important questions about the way we deal with privacy and security concerns of personal information in the wake of a public health emergency. a number of countries have implemented widespread surveillance of its citizens by using location information for contact-tracing. this helps them understand if people with covid- symptoms have been in contact with other people who in turn might get infected. while articles claim that [ ] surveillance has not been particularly effective in controlling the outbreak, many countries argue otherwise [ ] . this has raised concerns among privacy think-tanks about what would be done with user information, which users readily provide in efforts to contain the pandemic, after the outbreak [ ] . with google and apple combining efforts for contact-based tracing using granular location information [ ] , people who are vulnerable during the outbreak, must not be affected by the consequences of big data collection after the same. furthermore, the extent of remote work and school during the novel coronavirus outbreak has also enabled people to connect with their workplace and academic work from home via the internet while maintaining social distancing norms. video conferencing software such as zoom have been found to collect user information and have multiple security bugs that lead to 'zoom-bombing' [ ] , with several efforts being made by zoom to fix such bugs. this also creates a need to study the privacy concerns and sentiment around technological problems (or misuse) of products that are instrumental for enabling people to be connected during extended periods of social isolation and government lockdowns. though twitter is often studied for user sentiment in socio-political contexts like the spread of false information [ ] , it is rarely used for evaluating privacy concerns among people. privacy concerns are usually studied through traditional mixed method research tools like surveys and interviews. to the best of my knowledge, there is limited research on privacy concerns during such a large scale emergency like never before. however, despite its limitations, twitter provides a naturalistic setting to understand popular conversation about privacy and security that emerge as an indirect effect of the novel coronavirus outbreak. the variety of discussion on twitter provides a starting point for a more nuanced analysis of privacy concerns and has been the focus of this study. while twitter users tend to be younger, more educated, and more liberal than the general population [ ] , it nevertheless provides an opportunity to provide insight into privacy concerns of individuals. the data was collected using get old tweets api that maintains an archive of old https://github.com/jefferson-henrique/getoldtweets-python tweets. newer tweets (the last tweets in the dataset) have been collected using the tweepy twitter api . i conducted a time-series analysis from march , (first week of government lockdown [ ] ) to april , (current date of data collection) to measure the number of tweets by users over time. this was followed by a measure of occurrences of retweets and favorites for the specific tweets collected to study how users engaged with privacy-specific content regarding covid- on twitter. i used a predetermined set of keywords that include "coronavirus" and "privacy". the first research question required an in-depth analysis of tweets since this is a novel phenomenon with limited previous research theories. i used latent dirichlet analysis for topic modelling in order to highlight the different privacy themes and opinions that emerge. this was done using the lda model available through the gensim package in python and the natural language toolkit (nltk). nltk was used to tokenize words from tweets, remove existing stop words (like prepositions), and reduce the words to its stem form. the number of topics in lda was set to for the dataset (tweets from march and those from april respectively). the resulting topics were then tagged with a description to address the emerging topic. in order to answer the second research question, i performed a sentiment analysis of tweets using python's using a naive bayes classification approach (previously known as automatic indexing [ ] ). i used pandas for string handling and converting the dataset from csv into a pandas dataframe for easy manipulation. i also used sklearn to access the bayes classification algorithm and nltk packages for data pre-processing. sklearn was also used to create datasets for both training and testing the sentiment analysis model (i used the same dataset for both). the results from the analysis are described below. please refer to the appendix for code and datasets. ethics: i chose not to collect twitter user information in order to make the dataset de-identified and protect user privacy. the collection of publicly available tweets does not require an institutional review board (irb) application. the resulting dataset from mining twitter for march-april generated a total of tweets ( april - ; march - ). i used time-series analysis for tweets, retweets, and favorites, to see the number of discussions over time and the impact per discussion (tweet) during the pandemic. this is followed by topic modelling for the topics that people talk about spring regarding privacy specific to surveillance during covid- , and a sentiment analysis of such tweets (positive, negative, or neutral). the findings are explained in detail below. figure shows the number of tweets over time since march , that express surveillance concerns. the graph peaks on march , , which is the date of publication of the first electronic frontier foundation article on government surveillance during covid- and the resulting privacy concerns. the graph first sees a spike during the first week of lockdown (which started on march , in the united states). a polynomial trendline (r = . ) shows that tweets expressing privacy concerns peaked during the first week of april and then subsequently started to decrease in frequency. analysis of the retweets and favorites for each tweet presented an interesting result. as can be seen in figure and figure , there is a disproportionate amount of engagement with specific tweets while most tweets usually have a number of retweets and favorites under the numeric baseline. table shows the mean, median, maximum value, and minimum value of retweets and favorites respectively for all the tweets. as mentioned in the table, i had to remove an outlier tweet (with retweets = and favorites = as of april , ) to adjust the scatter plots in figure and . the average number of retweets was . and favorites was . , with median for both being zero. this shows that most tweets had very less engagement with users, and certain tweets received a lot of attention (given by the maximum value of the outlier tweet in table . users seem to discuss both positive and negative aspects of surveillance and loss of privacy during covid- . while some users are concerned about the government and private companies tracking individuals, others note the possible benefits of contact-tracing through smartphones in containing the spread. the use of devices like smartphones to collect health information and long-term surveillance in the face of european privacy laws was discussed. similarly, users also discussed the need for surveillance to protect people. a third category of topics just inform about the situation without taking any sides like cross-country efforts in surveillance. another common theme that emerged through topic modelling was privacy concerns about technology in use (topic , , and ). there seemed to be concerns regarding data security about information gathered on mobile applications. there was also discussion around the impact of data collection by companies like google, whose products are being spring increasingly used for long-distance connectivity for both personal and professional reasons. table shows the ten topics modelled using lda for both march and april along with their description. however, the majority of the tweets were neutral in emotion ( % in march to % in april). this indicates that tweets were mostly used to report facts and information rather than expression of an opinion on privacy concerns. this is also supported by the fact that the tweet with the maximum engagement (retweets, favorites) is a short description of an article about widespread surveillance and not an expression of sentiment. figure and figure show the results of sentiment analysis for the month of march and april respectively (expressed as percentages). this study provides insight into privacy expectations of users on twitter during pandemic situations when technology is used to control outbreaks. lower privacy concerns expressed in the initial lockdown period are replaced with a more negative sentiment towards governmental and organizational efforts to maintain privacy in health information disclosure over time. more positive views towards surveillance in controlling the spread of covid- with country-specific discussion topics on public and private sector efforts for technological intervention in monitoring spread of the virus is replaced by post-pandemic privacy concerns. topic modelling indicates that the discussions about privacy is usually geared more towards surveillance, probably driven by the discussion around surveillance started by the electronic frontier foundation with a covid- surveillance article [ ] , followed by a new york times opinion article [ ] . this is supported by the hike in the number of tweets around march , , which is the date of the published article [ ] . further research on privacy concerns of contact-tracing in order to address a public health emergency like the novel coronavirus outbreak would help form policy around long-term continuous location tracking post the pandemic in order to delete such information after use, while being instrumental in protecting public safety during the outbreak. the biggest questions about apple and google's new coronavirus tracker. the verge singapore says it will make its contact tracing tech freely available to developers governments haven't shown location surveillance would help contain covid- . electronic frontier foundation twitter statistics every marketer should know in sizing up twitter users privacy in mining crime data from social media: a south african perspective a comprehensive timeline of the new coronavirus pandemic, from china's first covid- case to the present automatic indexing: an experimental inquiry the spread of true and false news online privacy cannot be a casualty of the coronavirus evaluation of the effectiveness of surveillance and containment measures for the first patients with covid- in singapore zoom releases security updates in response to 'zoom-bombings the data and code is available here (source acknowledgement inline): https://iu.box.com/s/tzn ak ymyuna cvgh w pyo e pr . python scripts may require interpreter debugging due to the various libraries. please use pip install to install the required libraries (especially for nltk). key: cord- -flzqm wh authors: buchanan, tom title: why do people spread false information online? the effects of message and viewer characteristics on self-reported likelihood of sharing social media disinformation date: - - journal: plos one doi: . /journal.pone. sha: doc_id: cord_uid: flzqm wh individuals who encounter false information on social media may actively spread it further, by sharing or otherwise engaging with it. much of the spread of disinformation can thus be attributed to human action. four studies (total n = , ) explored the effect of message attributes (authoritativeness of source, consensus indicators), viewer characteristics (digital literacy, personality, and demographic variables) and their interaction (consistency between message and recipient beliefs) on self-reported likelihood of spreading examples of disinformation. participants also reported whether they had shared real-world disinformation in the past. reported likelihood of sharing was not influenced by authoritativeness of the source of the material, nor indicators of how many other people had previously engaged with it. participants’ level of digital literacy had little effect on their responses. the people reporting the greatest likelihood of sharing disinformation were those who thought it likely to be true, or who had pre-existing attitudes consistent with it. they were likely to have previous familiarity with the materials. across the four studies, personality (lower agreeableness and conscientiousness, higher extraversion and neuroticism) and demographic variables (male gender, lower age and lower education) were weakly and inconsistently associated with self-reported likelihood of sharing. these findings have implications for strategies more or less likely to work in countering disinformation in social media. disinformation is currently a critically important problem in social media and beyond. typically defined as "the deliberate creation and sharing of false and/or manipulated information that is intended to deceive and mislead audiences, either for the purposes of causing harm, or for political, personal or financial gain", political disinformation has been characterized as a significant threat to democracy [ , p. ] . it forms part of a wider landscape of information operations conducted by governments and other entities [ , ] . its intended effects include political influence, increasing group polarisation, reducing trust, and generally undermining civil society [ ] . effects are not limited to online processes. they regularly spill over into other parts of our lives. experimental work has shown that exposure to disinformation can lead to attitude change [ ] and there are many real-world examples of behaviours that have been directly attributed to disinformation, such people as attacking telecommunications masts in response to fake stories about ' g causing coronavirus' [ , ] . social media disinformation is very widely used as a tool of influence: computational propaganda has been described as a pervasive and ubiquitous part of modern everyday life [ ] . once disinformation has initially been seeded online by its creators, one of the ways in which it spreads is through the actions of individual social media users. ordinary people may propagate the material to their own social networks through deliberate sharing-a core function of platforms such as facebook and twitter. other interactions with it, such as 'liking', also trigger the algorithms of social media platforms to display it to other users. this is a phenomenon known as 'organic reach' [ ] . it can lead to false information spreading exponentially. as an example, analysis of the activity of the russian 'internet research agency' (ira) disinformation group in the usa between and concluded that over million users shared and otherwise interacted with the ira's facebook and instagram posts, propagating them to their families and friends [ ] . there is evidence that false material is spread widely and rapidly through social media due to such human behaviour [ ] . when individuals share or interact with disinformation they see online, they have essentially been persuaded to do so by its originators. influential models of social information processing suggest there are different routes to persuasion [e.g . ] . under some circumstances, we may carefully consider the information available. at other times, we make rapid decisions based on heuristics and peripheral cues. when sharing information on social media occurs, it is likely to be spontaneous and rapid, rather than being a considered action that people spend time deliberating over. for example, there are indications of people using the interaction features of facebook in a relatively unthinking and automatic manner [ ] . in such situations, a peripheral route to persuasion is likely be important [ ] . individuals' choices to share, like and so on will thus be guided primarily by heuristics or contextual cues [ ] . three potentially important heuristics in this context are consistency, consensus and authority [ ] . these are not the only heuristics that might possibly influence whether we share false material. however, in each case there is suggestive empirical evidence, and apparent realworld attempts to leverage these phenomena, that make them worth considering. consistency. consistency is the extent to which sharing would be consistent with past behaviours or beliefs of the individual. for example, in the usa people with a history of voting republican might be more likely to endorse and disseminate right-wing messaging [ ] . there is a large body of work based on the idea that people prefer to behave in ways consistent with their attitudes [ ] . research has indicated that social media users consider headlines consistent with their pre-existing beliefs as more credible, even when explicitly flagged as being false [ ] . in the context of disinformation, this could make it desirable to target audiences sympathetic to the message content. consensus. consensus is the extent to which people think their behaviour would be consistent with that of most other people. in the current context, it is possible that seeing a message has already been shared widely might make people more likely to forward it on themselves. in marketing, this influence tactic is known as 'social proof' [ ] . it is widely used in online commerce in attempts to persuade consumers to purchase goods or services (e.g. by displaying reviews or sales rankings). the feedback mechanisms of social networks can be manipulated to create an illusion of such social support, and this tactic seems to have been used in the aftermath of terror attacks in the uk [ ] . bot networks are used to spread low-credibility information on twitter through automated means. bots have been shown to be involved in the rapid spread of information, tweeting and retweeting messages many times [ ] . among humans who see the messages, the high retweet counts achieved through the bot networks might be interpreted as indicating that many other people agree with them. there is evidence which suggests that "each amount of sharing activity by likely bots tends to trigger a disproportionate amount of human engagement" [ , p. ] . such bot activity could be an attempt to exploit the consensus effect. it is relatively easy to manipulate the degree of consensus or social proof associated with an online post. work by the nato strategic communications centre of excellence [ ] indicated that it was very easy to purchase high levels of false engagement for social media posts (e.g. sharing of posts by networks of fake accounts) and that there was a significant black market for social media manipulation. thus, if boosting consensus effectively influences organic reach, then it could be a useful tool for both those seeding disinformation and those seeking to spread counter-messages. authority. authority is the extent to which the communication appears to come from a credible, trustworthy source [ ] . research participants have been found to report a greater likelihood of propagating a social media message if it came from a trustworthy source [ ] . there is evidence of real-world attempts to exploit this effect. in , twitter identified fraudulent accounts that simulated those of us local newspapers [ ] , which may be trusted more than national media [ ] . these may have been sleeper accounts established specifically for the purpose of building trust prior to later active use. factors influencing the spread of disinformation. while there are likely to be a number of other variables that also influence the spread of disinformation, there are grounds for believing that consistency, consensus and authority may be important. constructing or targeting disinformation messages in such a way as to maximise these three characteristics may be a way to increase their organic reach. there is real-world evidence of activity consistent with attempts to exploit them. if these effects do exist, they could also be exploited by initiatives to counter disinformation. not all individuals who encounter untrue material online spread it further. in fact, the great majority do not. research linking behavioural and survey data [ ] found that less than % of participants shared articles from 'fake news' domains during the us presidential election campaign (though of course when extrapolated to the huge user base of social network platforms like facebook, this is still a very large number of people). the fact that only a minority of people actually propagate disinformation makes it important to consider what sets them apart from people who don't spread untrue material further. this will help to inform interventions aimed at countering disinformation. for example, those most likely to be misled by disinformation, or to spread it further, could be targeted with counter-messaging. it is known that the originators of disinformation have already targeted specific demographic groups, in the same way as political campaigns micro-target messaging at those audience segments deemed most likely to be persuadable [ ] . for example, it is believed that the 'internet research agency' sought to segment facebook and instagram users based on race, ethnicity and identity by targeting their messaging to people recorded by the platforms as having certain interests for marketing purposes [ ] . they targeted specific communications tailored to those segments (e.g. trying to undermine african americans' faith in political processes and suppress their voting in the us presidential election). research has found that older adults, especially those aged over , were by far the most likely to spread material originally published by 'fake news' domains [ ] . a key hypothesis advanced to explain this is that older adults have lower levels of digital media literacy, and are thus less likely to be able to distinguish between true and false information online. while definitions may vary, digital media literacy can be thought of as including ". . . the ability to interact with textual, sound, image, video and social medias . . . finding, manipulating and using such information" [ , p. ] and being a "multidimensional concept that comprised technical, cognitive, motoric, sociological, and emotional aspects" [ , p. ] . digital media literacy is widely regarded as an important variable mediating the spread and impact of disinformation [e.g. ]. it is argued that many people lack the sophistication to detect a message as being untruthful, particularly when it appears to come from an authoritative or trusted source. furthermore, people higher in digital media literacy may be more likely to engage in elaborated, rather than heuristic-driven, processing (cf. work on phishing susceptibility [ ] ), and thus be less susceptible to biases such as consistency, consensus and authority. educating people in digital media literacy is the foundation of many anti-disinformation initiatives. examples include the 'news hero' facebook game developed by the nato strategic communications centre of excellence (https://www.stratcomcoe.org/news-hero), government initiatives in croatia and france [ ] or the work of numerous fact-checking organisations. the effectiveness of such initiatives relies on two assumptions being met. the first is that lower digital media literacy really does reduce our capacity to identify disinformation. there is currently limited empirical evidence on this point, complicated by the fact that definitions of 'digital literacy' are varied and contested, and there are currently no widely accepted measurement tools [ ] . the second is that the people sharing disinformation are doing so unwittingly, having been tricked into spreading it. however, it is possible that at least some people know the material is untrue, and they spread it anyway. survey research [ ] has found that believing a story was false was not necessarily a barrier to sharing it. people may act like this because they are sympathetic to a story's intentions or message, or they are explicitly signalling their social identity or allegiance to some political group or movement. if people are deliberately forwarding information that they know is untrue, then raising their digital media literacy would be ineffective as a stratagem to counter disinformation. this makes it important to simultaneously consider users' beliefs about the veracity of disinformation stories, to inform the design of countermeasures. personality. it is also known that personality influences how people use social media [e.g. ]. this makes it possible that personality variables will also influence interactions with disinformation. indeed, previous research [ ] found that people low on agreeableness reported themselves as more likely to propagate a message. this is an important possibility to consider, because it raises the prospect that individuals could be targeted on the basis of their personality traits with either disinformation or counter-messaging. in a social media context, personalitybased targeting of communications is feasible because personality characteristics can be detected from individuals' social media footprints [ , ] . large scale field experiments have shown that personality-targeted advertising on social media can influence user behaviour [ ] . the question of which personality traits might be important is an open one. in the current study, personality was approached on an exploratory basis, with no specific hypotheses about effects or their directions. this is because there are a number of different and potentially rival effects that might operate. for example, higher levels of conscientiousness may be associated with a greater likelihood of posting political material in social media [ ] leading to a higher level of political disinformation being shared. however, people higher in conscientiousness are likely to be more cautious [ ] and pay more attention to details [ ] . they might therefore also be more likely to check the veracity of the material they share, leading to a lower level of political disinformation being shared. the overall aim of this project was to establish whether contextual factors in the presentation of disinformation, or characteristics of the people seeing it, make it more likely that they extend its reach. the methodology adopted was scenario-based, with individuals being asked to rate their likelihood of sharing exemplar disinformation messages. a series of four studies was conducted, all using the same methodology. multiple studies were used to establish whether the same effects were found across different social media platforms (facebook in study , twitter in study , instagram in study ) and countries (facebook with a uk sample in study , facebook with a us sample in study ). data were also collected on whether participants had shared disinformation in the past. a number of distinct hypotheses were advanced: h : individuals will report themselves as more likely to propagate messages from more authoritative compared to less authoritative sources. h : individuals will report themselves as more likely to propagate messages showing a higher degree of consensus compared to those showing a lower degree of consensus. h : individuals will report themselves as more likely to propagate messages consistent with their pre-existing beliefs compared to inconsistent messages. h : individuals lower in digital literacy will report a higher likelihood of sharing false messages than individuals higher in digital literacy. other variables were included in the analysis on an exploratory basis with no specific hypotheses being advanced. in summary, this project asks why ordinary social media users share political disinformation messages they see online. it tests whether specific characteristics of messages or their recipients influence the likelihood of disinformation being further shared online. understanding any such mechanisms will both increase our understanding of the phenomenon and inform the design of interventions seeking to reduce its impact. study tested hypotheses - with a uk sample, using stimuli relevant to the uk. the study was completed online. participants were members of research panels sourced through the research company qualtrics. participants were asked to rate their likelihood of sharing three simulated facebook posts. the study used an experimental design, manipulating levels of authoritativeness and consensus apparent in the stimuli. all manipulations were between, not within, participants. consistency with pre-existing beliefs was not manipulated. instead, the political orientation of the stimuli was held constant, and participants' scores on conservative political orientation were used as an index of consistency between messages and participant beliefs. the effects of these variables on self-rated likelihood of sharing the stimuli, along with those of a number of other predictors, were assessed using multiple regression. the primary goal of the analysis was to identify variables that statistically significantly explained variance in the likelihood of sharing disinformation. the planned analysis was followed by supplementary and exploratory analyses. all analyses were conducted using spss v. for mac. for all studies reported in this paper, ethical approval came from both the university of westminster research ethics committee (eth - ) and the lancaster university security research ethics committee (buchanan ). consent was obtained, via an electronic form, from anonymous participants. a short questionnaire was used to capture demographic information (gender; country of residence; education; age; occupational status; political orientation expressed as right, left or centre; frequency of facebook use). individual differences in personality, political orientation, and digital / new media literacy were measured using established validated questionnaires. ecologically valid stimuli were used, with their presentation being modified across conditions to vary authoritativeness and consensus markers. personality was measured using a -item five-factor personality questionnaire [ ] derived from the international personality item pool [ ] . the measure provides indices of extraversion, neuroticism, openness to experience, agreeableness and conscientiousness that correlate well with the domains of costa and mccrae's [ ] five factor model. conservatism was measured using the -item social and economic conservatism scale (secs) [ ] , which is designed to measure political orientation along a left-right; liberal-conservative continuum. it was developed and validated using a us sample. in pilot work for the current study, mean scores for individuals who reported voting for the labour and conservative parties in the uk general election were found to differ in the expected manner (t ( ) = - . , p = . , d = . ). this provides evidence of its appropriateness for use in uk samples. while the measure provides indices of different aspects of conservatism, it also provides an overall conservatism score which was used in this study. digital media literacy was measured using the -item new media literacy scale (nmls) [ ] . this is a theory-based self-report measure of competences in using, critically interrogating, and creating digital media technologies and messaging. in pilot work with a uk sample, it was found to distinguish between individuals high or low in social media (twitter) use, providing evidence of validity (t ( ) = - . , p < . , d = . ). while the measure provides indices of different aspects of new media literacy, it also provides an overall score which was used in this study. participants were asked to rate their likelihood of sharing three genuine examples of 'fake news' that had been previously published online. an overall score for their likelihood of sharing the stimuli was obtained by summing the three ratings, creating a combined score. this was done, and a set of three stimuli was used, to reduce the likelihood that any effects found were peculiar to a specific story. the stimuli were sourced from the website infowars.com (which in some cases had republished them from other sources). infowars.com has been described [ ] as a high-exposure site strongly associated with the distribution of 'fake news'. rather than full articles, excerpts (screenshots) were used that had the size and general appearance of what respondents might expect to see on social media sites. the excerpts were edited to remove any indicators of the source, metrics such as the numbers of shares, date, and author. all had a right-wing orientation (so that participant conservatism could be used as a proxy for consistency between the messages and existing beliefs). this was established in pilot work rating their political orientation and likelihood of being shared. the three stories were among seven rated by a uk sample (n = ) on an -point scale asking "to what extent do you think this post was designed to appeal to people with right wing (politically conservative) views?" anchored at "very left wing oriented" and "very right wing oriented". all seven were rated statistically significantly above the politically-neutral midpoint of the scale. of the three stimuli selected for use in this study, a one-sample t-test showed that the least right-wing was statistically significantly higher than the midpoint, (t ( ) = . , p < . , d = . ). one of the stimuli was a picture of masked and hooded men titled "censored video: watch muslims attack men, women & children in england". one was a picture of many people walking down a road, titled "revealed: un plan to flood america with million migrants", with accompanying text describing a plan to "flood america and europe with hundreds of millions of migrants to maintain population levels". the third was a picture of the swedish flag titled "'child refugee' with flagship samsung phone and gold watch complains about swedish benefits rules", allegedly describing a year-old refugee's complaints. the authoritativeness manipulation was achieved by pairing the stimuli with sources regarded as relatively high or low in authoritativeness. the source was shown above the stimulus being rated, in the same way as the avatar and username of someone who had posted a message would be on facebook. the lower authoritativeness group were slight variants on real usernames that had previously retweeted either stories from infowars.com or another story known to be untrue. the original avatars were used. the exemplars used in this study were named 'tigre' (with an avatar of an indistinct picture of a female face), 'jelly beans' (a picture of some jelly beans) and 'chucke' (an indistinct picture of a male face). the higher authoritativeness group comprised actual fake accounts set up by the internet research agency (ira) group to resemble local news sources, selected from a list of suspended ira accounts released by twitter. the exemplars used in this study were 'los angeles daily', 'chicago daily news' and 'el paso top news'. pilot work was conducted with a sample of uk participants (n = ) who each rated a selection of usernames, including these , for the extent to which each was "likely to be an authoritative source-that is, likely to be a credible and reliable source of information". a within-subjects t-test indicated that mean authoritativeness ratings for the 'higher' group were statistically significantly higher than the 'lower' group (t ( ) = - . , p < . , d z = . ). the consensus manipulation was achieved by pairing the stimuli with indicators of the number of shares and likes the story had. the indicators were shown below the stimulus being rated, in the same way as they normally would be on facebook. in the low consensus conditions, low numbers of likes ( , , ) and shares ( , , ) were displayed. in the high consensus conditions, higher (but not unrealistic) numbers of likes ( k, k, k) and shares ( k, k, k) were displayed. the information was presented using the same graphical indicators as would be the case on facebook, accompanied by the (inactive) icons for interacting with the post, in order to maximise ecological validity. procedure. the study was conducted completely online, using materials hosted on the qualtrics research platform. participants initially saw an information page about the study, and on indicating their consent proceeded to the demographic items. they then completed the personality, conservatism and new media literacy scales. each of these was presented on a separate page, except the nmls which was split across three pages. participants were then asked to rate the three disinformation items. participants were randomized to different combinations of source and story within their assigned condition. for example, participant a might have seen story attributed to source , story attributed to source , and story attributed to source ; while participant b saw story attributed to source , story attributed to source , and story attributed to source . each participant saw the same three stories paired with one combination of authoritativeness and consensus. there were distinct sets of stimuli. each participant saw an introductory paragraph stating "a friend of yours recently shared this on facebook, commenting that they thought it was important and asking all their friends to share it:". below this was the combination of source, story, and consensus indicators, presented together in the same way as a genuine facebook post would be. they then rated the likelihood of them sharing the post to their own public timeline, on an -point scale anchored at 'very unlikely' and 'very likely'. this was repeated for the second and third stimuli, each on a separate page. having rated each one, participants were then shown all three stimuli again, this time on the same page. they were asked to rate each one for "how likely do you think it is that the message is accurate and truthful" and "how likely do you think it is that you have seen it before today", on -point scales anchored at 'not at all likely' and 'very likely'. after rating the stimuli, participants were asked two further questions: "have you ever shared a political news story online that you later found out was made up?", and "and have you ever shared a political news story online that you thought at the time was made up?", with 'yes' or 'no' response options. this question format directly replicated that used in pew research centre surveys dealing with disinformation [e.g. ] . finally, participants were given the opportunity once again to give or withdraw their consent for participation. they then proceeded to a debriefing page. it was only at the debriefing stage that they were told the stories they had seen were untrue: no information about whether the stimuli were true or false had been presented prior to that point. data screening and processing. prior to delivery of the sample, qualtrics performed a series of quality checks and 'data scrubbing' procedures to remove and replace participants with response patterns suggesting inauthentic or inattentive responding. these included speeding checks and examination of response patterns. on delivery of the initial sample (n = ) further screening procedures were performed. sixteen respondents were identified who had responded with the same scores to substantive sections of the questionnaire ('straightlining'). these were removed, leaving n = . these checks and exclusions were carried out prior to any data analysis. where participants had missing data on any variables, they were omitted only from analyses including those variables. thus, ns vary slightly throughout the analyses. participants. the target sample size was planned to exceed n = , which would give % power to detect r = . (a benchmark for the minimum effect size likely to have realworld importance in social science research [ ] ), in the planned multiple regression analysis with predictors. qualtrics was contracted to provide a sample of facebook users that was broadly representative of the uk census population in terms of gender; the split between those who had post-secondary-school education and those who had not; and age profile ( +). quotas were used to assemble a sample comprising approximately one third each self-describing as left-wing, centre and right-wing in their political orientation. participant demographics are shown in table , column . descriptive statistics for participant characteristics (personality, conservatism, new media literacy and age) and their reactions to the stimuli (likelihood of sharing, belief the stories were likely to be true, and rating of likelihood that they had seen them before) are summarised in table . all scales had acceptable reliability. the main dependent variable, likelihood of sharing, had a very skewed distribution with a strong floor effect: . % of the participants indicated they were 'very unlikely' to share any of the three stories they saw. this is consistent with findings on real-world sharing that indicate only a small proportion of social media users will actually share disinformation [e.g. ] , though it gives a dependent variable with less than ideal distributional properties. to simultaneously test hypotheses - a multiple regression analysis was carried out. this evaluated the extent to which digital media literacy (nmls), authority of the message source, consensus, belief in veracity of the messages, consistency with participant beliefs (operationalised as the total secs conservatism scale score), age and personality (extraversion, conscientiousness, agreeableness, openness to experience and neuroticism), predicted self-rated likelihood of sharing the posts. this analysis is summarised in table . checks were performed on whether the dataset met the assumptions required by the analysis (absence of collinearity, independence of residuals, heteroscedasticity and non-normal distribution of residuals). despite the skewed distribution of the dependent variable, no significant issues were detected. , and with likelihood of having seen the stimuli before (r = . , n = , p = . ). selfreported belief that respondents had seen the stories before also correlated significantly with likelihood of sharing (r = . , n = , p < . ), and a number of other predictor variables. accordingly, a further regression analysis was performed, including these additional predictors (gender, education, level of facebook use, belief they had seen the stories before). given inclusion of gender as a predictor variable, the two respondents who did not report their gender as either male or female were excluded from further analysis. the analysis, summarised in table , indicated that the model explained % of the variance in self-reported likelihood of sharing the three disinformation items. neither the authoritativeness of the story source, nor consensus information associated with the stories, was a significant predictor. consistency of the items with participant attitudes (conservatism) was important, with a positive and statistically significant relationship between conservatism and likelihood of sharing. the only personality variable predicting sharing was agreeableness, with less agreeable people giving higher ratings of likelihood of sharing. in terms of demographic characteristics, gender and education were statistically significant predictors, with men and less-educated people reporting a higher likelihood of sharing. finally, people reported a greater likelihood of sharing the items if they believed they were likely to be true, and if they thought they had seen them before. participants had also been asked about their historical sharing of untrue political stories, both unknowing and deliberate. out of participants ( . %) indicated that they had ever 'shared a political news story online that they later found out was made up', while out of indicated they had shared one that they 'thought at the time was made up' ( . %). predictors of whether or not people had shared untrue material under both sets of circumstances were examined using logistic regressions, with the same sets of participant-level predictors. having unknowingly shared untrue material (table ) was significantly predicted by lower conscientiousness, lower agreeableness, and lower age. having shared material known to be untrue at the time (table ) was significantly predicted by lower agreeableness and lower age. the main analysis in this study (table ) provided limited support for the hypotheses. contrary to hypotheses , , and , neither consensus markers, authoritativeness of source, nor new media literacy were associated with self-rated likelihood of sharing the disinformation stories. however, in line with hypothesis , higher levels of conservatism were associated with higher likelihood of sharing disinformation. this finding supports the proposition that we are more likely to share things that are consistent with our pre-existing beliefs, as all the stimuli were right-wing in orientation. an alternative explanation might be that more conservative people are simply more likely to share disinformation. however, as well as lacking a solid rationale, this explanation is not supported by the fact that conservatism did not seem to be associated with self-reported historical sharing (tables and ). the strongest predictors of likelihood of sharing were belief that the stories were true, and likelihood of having seen them before. belief in the truth of the stories provides further evidence for the role of consistency (hypothesis ), in that we are more likely to share things we believe are true. the association with likely previous exposure to the materials is consistent with other recent research [ , ] that found that prior exposure to 'fake news' headlines led to higher belief in their accuracy and reduced belief that it would be unethical to share them. of the personality variables, only agreeableness was a significant predictor, with less agreeable people rating themselves are more likely to share the stimuli. this is consistent with previous findings [ ] that less agreeable people reported they were more likely to share a critical political message. lower education levels were associated with a higher self-reported likelihood of sharing. it is possible that less educated people may be more susceptible to online influence, given work finding that less educated people were more influenced by micro-targeted political advertising on facebook [ ] . finally, gender was found to be an important variable, with men reporting a higher likelihood of sharing the disinformation messages than women. this was unanticipated: while there are a number of gender-related characteristics (e.g. personality traits) that were thought might be important, there were no a priori grounds to expect that gender itself would be a predictor variable. study also examined predictors of reported historical sharing of false political information. consistent with real-world data [ ] , and past representative surveys [e.g. ], a minority of respondents reported such past sharing. unknowingly sharing false political stories was predicted by low conscientiousness, low agreeableness, and lower age, while knowingly sharing false material was predicted only by lower agreeableness and lower age. the effect of agreeableness is consistent with the findings from the main analysis and from [ ] . the finding that conscientiousness influenced accidental, but not deliberate, sharing is consistent with the idea that less conscientious people are less likely to check the details or veracity of a story before sharing it. clearly this tendency would not apply to deliberate sharing of falsehoods. the age effect is harder to explain, especially given evidence [ ] that older people were more likely to share material from fake news sites. one possible explanation is that younger people are more active on social media, so would be more likely to share any kind of article. another possibility is that they are more likely to engage in sharing humorous political memes, which could often be classed as false political stories. study set out to repeat study , but presented the materials as if they had been posted on twitter rather than facebook. the purpose of this was to test whether the observed effects applied across different platforms. research participants have reported using 'likes' on twitter in a more considered manner than on facebook [ ] , raising the possibility that heuristics might be less important for this platform. the study was completed online, using paid respondents sourced from the prolific research panel (www.prolific.co). the methodology exactly replicated that of study , except in the case of details noted below. the planned analysis was revised to include the expanded set of predictors eventually used in study (see table ). measures and materials were the same as used in study . the key difference from study was in the presentation of the three stimuli, which were portrayed as having been posted to twitter rather than facebook. for the authoritativeness manipulation, the screen names of the sources were accompanied by @usernames, as is conventional on twitter. for the consensus manipulation, 'retweets' were displayed rather than 'shares', and the appropriate icons for twitter were used. participants also indicated their level of twitter, rather than facebook, use. procedure. the procedure replicated study , save that in this case the nmls was presented on a single page. before participants saw each of the three disinformation items, the introductory paragraph stated "a friend of yours recently shared this on twitter, commenting that they thought it was important and asking all their friends to retweet it:", and they were asked to indicate the likelihood of them 'retweeting' rather than 'sharing' the post. data screening and processing. data submissions were initially obtained from participants. a series of checks were performed to ensure data quality, resulting in a number of responses being excluded. one individual declined consent. eleven were judged to have responded inauthentically, with the same responses to all items in substantive sections of the questionnaire ('straightlining'). twenty were not active twitter users: three individuals visited twitter 'not at all' and seventeen 'less often' than every few weeks. three participants responded unrealistically quickly, with response durations shorter than four minutes (the same value used as a speeding check by qualtrics in study ). all of these respondents were removed, leaving n = . these checks and exclusions were carried out prior to any data analysis. participants. the target sample size was planned to exceed n = , as in study . no attempt was made to recruit a demographically representative sample: instead, sampling quotas were used to ensure the sample was not homogenous with respect to education (pre-degree vs. undergraduate degree or above), age (under vs. over ) and political preference (left, centre or right wing orientation). additionally, participants had to be uk nationals resident in the uk; active twitter users; and not participants in prior studies related to this one. each participant received a reward of £ . . participant demographics are shown in table (column ) . for the focal analysis in this study, the sample size conferred . % power to detect r = . in a multiple regression with predictors ( -tailed, alpha = . ). descriptive statistics are summarised in table . all scales had acceptable reliability. the main dependent variable, likelihood of sharing, again had a very skewed distribution with a strong floor effect. to simultaneously test hypotheses - , a multiple regression analysis was carried out using the expanded predictor set from study . given inclusion of gender as a predictor variable, the three respondents who did not report their gender as either male or female were excluded from further analysis. the analysis, summarised in table , indicated that the model explained % of the variance in self-reported likelihood of sharing the three disinformation items. neither the authoritativeness of the story source, nor consensus information associated with the stories, nor new media literacy, was a significant predictor. consistency of the items with participant attitudes (conservatism) was important, with a positive and statistically significant relationship between conservatism and likelihood of sharing. no personality variable predicted ratings of likelihood of sharing. in terms of demographic characteristics, gender and education were statistically significant predictors, with men and less-educated people reporting a higher likelihood of sharing. finally, people reported a greater likelihood of sharing the items if they believed they were likely to be true, and if they thought they had seen them before. participants had also been asked about their historical sharing of untrue political stories, both unknowing and deliberate. out of participants ( . %) indicated that they had out ever 'shared a political news story online that they later found out was made up', while out of indicated they had shared one that they 'thought at the time was made up' ( . %). predictors of whether or not people had shared untrue material under both sets of circumstances were examined using logistic regressions, with the same sets of participant-level predictors. having unknowingly shared untrue material (table ) was significantly predicted by higher extraversion and higher levels of twitter use. having shared material known to be untrue at the time (table ) was significantly predicted by higher neuroticism and being male. for the main analysis, study replicates a number of key findings from study . in particular, hypotheses , and were again unsupported by the results: consensus, authoritativeness, and new media literacy were not associated with self-rated likelihood of retweeting the disinformation stories. evidence consistent with hypothesis was again found, with higher levels of conservatism being associated with higher likelihood of retweeting. again, the strongest predictor of likelihood of sharing was belief that the stories were true, while likelihood of having seen them before was again statistically significant. the only difference was in the role of personality: there was no association between agreeableness (or any other personality variable) and likelihood of retweeting the material. however, for self-reports of historical sharing of false political stories, the pattern of results was different. none of the previous results were replicated, and new predictors were observed for both un-knowing and deliberate sharing. for unintentional sharing, the link with higher levels of twitter use makes sense, as higher usage confers more opportunities to accidentally share untruths. higher extraversion has also been found to correlate with higher levels of social media use [ ] so the same logic may apply for that variable. for intentional sharing, the finding that men were more likely to share false political information is similar to findings from study . the link with higher neuroticism is less easy to explain: one possibility is that more neurotic people are more likely to share falsehoods that will reduce the chances of an event that they worry about (for example, spreading untruths about a political candidate who one is worried about being elected). given that these questions asked about past behaviour in general, and were not tied to the twitter stimuli used in this study, it is not clear why the pattern of results should have differed from those in study . one possibility is that the sample characteristics were different (this sample was younger, better educated, and drawn from a different source). another realistic possibility, especially given the typically low effect sizes and large samples tested, is that these are simply 'crud' correlations [ ] rather than useful findings. going forward, it is likely to be more informative to focus on results that replicate across multiple studies or conceptually similar analyses. study set out to repeat study , but presented the materials as if they had been posted on instagram rather than facebook. instagram presents an interesting contrast, as the mechanisms of engagement with material are different (for example there is no native sharing mechanism). nonetheless, it has been identified as an important theater for disinformation operations [ ] . study therefore sought to establish whether the same factors affecting sharing on facebook also affect engagement with false material on instagram. the study was completed online, using paid respondents sourced from the prolific research panel. the methodology exactly replicated that of study , except in the case of details noted below. the planned analysis was revised to include the expanded set of predictors eventually used in study (see table ). materials. measures and materials were the same as used in study . the only difference from study was in the presentation of the three stimuli, which were portrayed as having been posted to instagram rather than facebook. for the consensus manipulation, 'likes' were used as the sole consensus indicator, and the appropriate icons for instagram were used. procedure. the procedure replicated study , save that in this case the nmls was presented on a single page. before participants saw each of the three disinformation items, the introductory paragraph stated "imagine that you saw this post on your instagram feed:" and they were asked to indicate the probability of them 'liking' the post. data screening and processing. data submissions were initially obtained from participants. a series of checks were performed to ensure data quality, resulting in a number of responses being excluded. four individuals declined consent. twenty-one were judged to have responded inauthentically, with the same scores to substantive sections of the questionnaire ('straightlining'). five did not indicate they were located in the uk. ten were not active instagram users: three individuals visited instagram 'not at all' and seven 'less often' than every few weeks. two participants responded unrealistically quickly, with response durations shorter than four minutes (the same value used as a speeding check by qualtrics in study ). all of these respondents were removed, leaving n = . these checks and exclusions were carried out prior to any data analysis. participants. the target sample size was planned to exceed n = , as in study . no attempt was made to recruit a demographically representative sample: instead, sampling quotas were used to ensure the sample was not homogenous with respect to education (pre-degree vs. undergraduate degree or above) and political preference (left, centre or right-wing orientation). sampling was not stratified by age, given that instagram use is associated with younger ages, and the number of older instagram users in the prolific pool was limited at the time the study was carried out. additionally, participants had to be uk nationals resident in the uk; active instagram users; and not participants in prior studies related to this one. each participant received a reward of £ . . participant demographics are shown in table (column ) . for the focal analysis in this study, the sample size conferred . % power to detect r = . in a multiple regression with predictors ( -tailed, alpha = . ). descriptive statistics are summarised in table . all scales had acceptable reliability. the main dependent variable, probability of liking, again had a very skewed distribution with a strong floor effect. to simultaneously test hypotheses - , a multiple regression analysis was carried out using the expanded predictor set from study . given inclusion of gender as a predictor variable, the three respondents who did not report their gender as either male or female were excluded from further analysis. the analysis, summarised in table , indicated that the model explained % of the variance in self-reported likelihood of sharing the three disinformation items. neither the authoritativeness of the story source, consensus information associated with the stories, nor consistency of the items with participant attitudes (conservatism) was a statistically significant predictor. extraversion positively and conscientiousness negatively predicted ratings of likelihood of sharing. in terms of demographic characteristics, men and younger participants reporting a higher likelihood of sharing. finally, people reported a greater likelihood of sharing the items if they believed they were likely to be true, and if they thought they had seen them before. participants had also been asked about their historical sharing of untrue political stories, both unknowing and deliberate. eighty five out of ( . %) participants who answered the question indicated that they had out ever 'shared a political news story online that they later found out was made up', while out of indicated they had shared one that they 'thought at the time was made up' ( . %). predictors of whether or not people had shared untrue material under both sets of circumstances were examined using logistic regressions, with the same sets of participant-level predictors. having unknowingly shared untrue material (table ) was significantly predicted by higher extraversion, lower conscientiousness and male gender. having shared material known to be untrue at the time (table ) was significantly predicted by higher new media literacy, higher conservatism, and higher neuroticism. as in studies and , results were not consistent with hypotheses , and : consensus, authoritativeness, and new media literacy were not associated with self-rated probability of liking the disinformation stories. in contrast to studies and , however, conservatism did not predict liking the stories. belief that the stories were true was again the strongest predictor, while likelihood of having seen them before was again statistically significant. among the personality variables, lower agreeableness returned as a predictor of likely engagement with the stories, consistent with study but not study . lower age predicted likely engagement, a new finding, while being male predicted likely engagement as found in both in study and study . unlike study and study , education had no effect. with regard to historical accidental sharing, as in study higher extraversion was a predictor, while as in study so was lower conscientiousness. men were more likely to have shared accidentally. deliberate historical sharing was predicted by higher levels of new media literacy. this is counter-intuitive and undermines the argument that people share things because they know no better. in fact, in the context of deliberate deception, motivated individuals higher in digital literacy may actually be better equipped to spread untruths. conservatism was also a predictor here. this could again be a reflection of the consistency hypothesis, given that there are high levels of conservative-oriented disinformation circulating. finally, as in study , higher neuroticism predicted deliberate historical sharing. study set out to repeat study , but with a us sample and using us-centric materials. the purpose of this was to test whether the observed effects applied across different countries. the study was completed online, using as participants members of research panels sourced through the research company qualtrics. the methodology exactly replicated that of study , except in the case of details noted below. the planned analysis was revised to include the expanded set of predictors eventually used in study (see table ). social media disinformation measures and materials were the same as used in study . the only difference from study was in the contents of the three disinformation exemplars, which were designed to be relevant to a us rather than uk audience. two of the stimuli were sourced from the website infowars.com, while a third was a story described as untrue by the fact-checking website politifact.com. in the same way as in study , the right-wing focus of the stories was again established in pilot work where a us sample (n = ) saw seven stories including these and rated their political orientation and likelihood of being shared. all were rated above the mid-point of an -point scale asking "to what extent do you think this post was designed to appeal to people with right wing (politically conservative) views?" anchored at "very left wing oriented" and "very right wing oriented". for the least right-wing of the three stories selected, a one-sample ttest comparing the mean rating with the midpoint of the scale showed it was statistically significantly higher, t ( ) = . , p < . , d = . ). one of the stimuli, also used in study - , was titled "revealed: un plan to flood america with million migrants". one was titled "flashback: obama's attack on internet freedom", subtitled 'globalists, deep state continually targeting america's internet dominance', featuring further anti-obama, china and 'big tech' sentiment, and an image of barack obama apparently drinking wine with a person of east asian appearance. the third was text based and featured material titled "surgeon who exposed clinton foundation corruption in haiti found dead in apartment with stab wound to the chest". the materials used to manipulate authoritativeness (facebook usernames shown as sources of the stories) were the same as used in studies - . these were retained because pilot work indicated that the higher and lower sets differed in authoritativeness for us audiences in the same way as for uk audiences. a sample of us participants again each rated a selection of usernames, including these , for the extent to which each was "likely to be an authoritative source-that is, likely to be a credible and reliable source of information". a within-subjects ttest indicated that mean authoritativeness ratings for the 'higher' group were statistically significantly higher than the 'lower' group (t ( ) = - . p < . , d z = . ). procedure. the procedure replicated study , save that in this case the nmls was presented across two pages. data screening and processing. prior to delivery of the sample, qualtrics performed a series of quality checks and 'data scrubbing' procedures to remove and replace participants with response patterns suggesting inauthentic or inattentive responding. these included speeding checks and examination of response patterns. on delivery of the initial sample (n = ) further screening procedures were performed. nine respondents were identified who had responded with the same scores to substantive sections of the questionnaire ('straightlining'), and one who had not completed any of the personality items. twelve respondents were not active facebook users: six reported using facebook 'not at all' and a further six less often than 'every few weeks'. all of these were removed, leaving n = . these checks and exclusions were carried out prior to any data analysis. participants. the target sample size was planned to exceed n = , as in study . qualtrics was contracted to provide a sample of active facebook users that was broadly representative of the us population in terms of gender; education level; and age profile ( +). sampling quotas were used to assemble a sample comprising approximately one third each self-describing as left-wing, centre and right-wing in their political orientation. sampling errors on the part of qualtrics led to over-recruitment of individuals aged years, who make up of the individuals in the - age group. as a consequence, the - age group is itself over-represented in this sample compared to the broader us population. participant demographics are shown in table , column . for the focal analysis in this study, the sample size conferred . % power to detect r = . in a multiple regression with predictors ( -tailed, alpha = . ). descriptive statistics are summarised in table . all scales had acceptable reliability. the main dependent variable, likelihood of sharing, again had a very skewed distribution with a strong floor effect. to simultaneously test hypotheses - a multiple regression analysis was carried out using the expanded predictor set from study . given inclusion of gender as a predictor variable, the one respondent who did not report their gender as either male or female was excluded from further analysis. the analysis, summarised in table , indicated that the model explained % of the variance in self-reported likelihood of sharing the three disinformation items. neither the authoritativeness of the story source, consensus information associated with the stories, nor consistency of the items with participant attitudes (conservatism) was a statistically significant predictor. extraversion positively predicted ratings of likelihood of sharing. in terms of demographic characteristics, age was a significant predictor, with younger people reporting a higher likelihood of sharing. finally, people reported a greater likelihood of sharing the items if they believed they were likely to be true, and if they thought they had seen them before. participants had also been asked about their historical sharing of untrue political stories, both unknowing and deliberate. of the participants, ( . %) indicated that they had ever 'shared a political news story online that they later found out was made up', while out of indicated they had shared one that they 'thought at the time was made up' ( . %). predictors of whether or not people had shared untrue material under both sets of circumstances were examined using logistic regressions, with the same sets of participant-level predictors. having unknowingly shared untrue material (table ) was significantly predicted by higher new media literacy, lower conscientiousness, higher education, and higher levels of facebook use. having shared material known to be untrue at the time (table ) was significantly predicted by higher extraversion, lower agreeableness, younger age, and higher levels of facebook use. again, the pattern of results emerging from study had some similarities but also some differences from studies - . once again, hypotheses , and were unsupported by the results. similarly to study , but unlike studies and , conservatism (the proxy for consistency) did not predict sharing the stories. belief that the stories were true, and likelihood of having seen them before, were the strongest predictors. higher levels of extraversion (a new finding) and lower ages (as in study ) were associated with higher reported likelihood of sharing the stimuli. for historical sharing, for the first time-and counterintuitively-new media literacy was associated with higher likelihood of having shared false material unknowingly. as in studies and , lower conscientiousness was also important. counterintuitively, higher education levels were associated with higher unintentional sharing, as were higher levels of facebook use. for intentional sharing, higher extraversion was a predictor, as was lower agreeableness, younger age and higher levels of facebook use. when interpreting the overall pattern of results from studies - , given the weakness of most of the associations, it is likely to be most useful to focus on relationships that are replicated across studies and disregard 'one off' findings. tables - provide a summary of the statistically significant predictors in each of the studies. it is clear that two variables consistently predicted self-rated likelihood of sharing disinformation exemplars: belief that the stories were likely to be true, and likely prior familiarity with the stories. it is also clear that three key variables did not: markers of authority, markers of consensus and digital literacy. hypothesis predicted that stories portrayed as coming from more authoritative sources were more likely to be shared. however, this was not observed in any of the four studies. one interpretation of this is that the manipulation failed. however, pilot work (see study , study ) with comparable samples indicated that people did see the sources as differing in authoritativeness. the failure to find the predicted effect could also be due to use of simulated scenarios-though care was taken to ensure they resembled reality-or weaknesses in the methodology, such as the distributional properties of the dependent variables. however, consistent relationships between other predictors and the dependent variable were observed. thus, the current studies provide no evidence that authoritativeness of a source influences sharing behaviour. hypothesis predicted that stories portrayed as having a higher degree of consensus in audience reactions (i.e. high numbers of people had previously shared them) would be more likely to be shared. in fact, consensus markers had no effect on self-reported probability of sharing or liking the stories. therefore, the current studies provide no evidence that indicators of 'social proof' influence participant reactions to the stimuli. hypothesis was that people would be more likely to share materials consistent with their pre-existing beliefs. this was operationalised by measuring participants' political orientation (overall level of conservatism) and using stimuli that were right-wing in their orientation. in studies and , more conservative people were more likely to share the materials. further evidence for hypothesis comes from the finding, across all studies, that level of belief the stories were "accurate and truthful" was the strongest predictor of likelihood of sharing. this is again in line with the consistency hypothesis: people are behaving in ways consistent with their beliefs. the finding from study that more conservative people were more likely to have historically shared material they knew to be untrue could also be in line with this hypothesis, given that a great many of the untrue political stories circulated online are conservativeoriented. hypothesis , that people lower in digital literacy would be more likely to engage with disinformation, was again not supported. as noted earlier, measurement of digital literacy is problematic. however, pilot work showed that the new media literacy scale did differentiate between people with higher and lower levels of social media use in the expected manner, so it is likely to have a degree of validity. in study , higher nmls scores were associated with having unwittingly shared false material in the past, which is counterintuitive. however, this may be due to the fact that more digitally literate people should be more able to see that something was false in hindsight. higher nmls scores were also associated with deliberately sharing falsehoods in study . this could be attributable to greater ease with which digitally literate individuals can do such things, if motivated to do so. a number of other variables were included on an exploratory basis, or for the purpose of controlling for possible confounds. of these, the most important was participants' ratings of the likelihood that they had seen the stimuli before. this variable was originally included in the design so that any familiarity effects could be controlled for when evaluating the effect of other variables. in fact, rated likelihood of having seen the materials before was the second strongest predictor of likelihood of sharing it. it was a predictor in all four studies, and for the facebook studies ( and ) it was the second most important variable. this is consistent with work on prior exposure to false material online, where prior exposure to fake news headlines increased participants' ratings of their accuracy [ ] . furthermore, it has been found that prior exposure to fake-news headlines reduced participants' ratings of how unethical it was to share or publish the material, even when it was clearly marked as false [ ] . thus, repeated exposure to false material may increase our likelihood of sharing it. it is known that repeated exposure to statements increases people's subjective ratings of their truth [ ] . however, there must be more going on here, because the regression analyses indicated that the familiarity effect was independent of the level of belief that it is true. when considering work that found that amplification of content by bot networks led to greater levels of human sharing [ ] , the implication is that repeated actual exposure to the materials is what prompts people to share it, not metrics of consensus such as the number of likes or shares displayed beside an article. of the five dimensions of personality measured, four (agreeableness, extraversion, neuroticism and conscientiousness were predictors of either current or historical sharing in one or more studies. consistent with findings from [ ] , studies and found that lower agreeableness was associated with greater probability of sharing or liking the stories. it was also associated with accidental historical sharing in study , and deliberate historical sharing in studies and . in contrast to this, past research on personality and social media behaviour indicates that more agreeable people are more likely to share information on social media: [ ] reported that its role in this was mediated by trust, while [ ] found that higher agreeableness was associated with higher levels of social media use in general. given those findings, it is likely that the current results are specific to disinformation stimuli rather than social sharing in general. agreeableness could potentially interact with the source of the information: more agreeable people might conceivably be more eager to please those close to them, however, while it is possible that agreeableness interacted in some way with the framing of the material having been shared by 'a friend' in study , study had no such framing. more broadly, the nature of the stories may be important: disinformation items are normally critical or hostile in their nature. this may mean they are more likely to be shared by disagreeable people, who themselves may be critical in their outlook and not concerned about offending others. furthermore, agreeableness is associated with general trusting behaviour. it may be that disagreeable people are therefore more likely to endorse conspiracist material, or other items consistent with a lack of trust in politicians or other public figures. lower conscientiousness was associated with accidental historical sharing of false political stories in studies , and . this is unsurprising, as less conscientious people would be less likely to check the veracity of a story before sharing it. the lack of an association with deliberate historical sharing reinforces this view. higher extraversion was associated with probability of sharing in study , with accidental historical sharing in study and , and with deliberate historical sharing in study . higher neuroticism was associated with historical deliberate sharing in studies and . all these relationships may simply reflect a higher tendency on the part of extraverted and neurotic individuals to use social media more [ ] . there are clearly some links between personality and sharing of disinformation. however, the relationships are weak and inconsistent across studies. it is possible that different traits affect different behaviours: for example low conscientiousness is associated with accidental but not deliberate sharing, while high neuroticism is associated with deliberate but not accidental sharing. thus, links between some personality traits and the spread of disinformation may be context-and motivation-specific, rather than reflecting blanket associations. however, lower agreeableness-and to a lesser extent higher extraversion-may predict an overall tendency to spread this kind of material. demographic variables were also measured and included in the analyses. younger individuals rated themselves as more likely to engage with the disinformation stimuli in studies and , and were more likely to have shared untrue political stories in the past either accidentally (study ) or deliberately (studies and ) . this runs counter to findings that older adults were much more likely to have spread material from 'fake news' domains [ ] . it is possible that the current findings simply reflect a tendency of younger people to be more active on social media. people with lower levels of education reported a greater likelihood of sharing the disinformation stories in studies and . counterintuitively, more educated people were more likely to have accidentally shared false material in the past (study ). one possible explanation is that more educated people are more likely to have realised that they had done this, so the effect in study reflects an influence on reporting of the behaviour rather than on the behaviour itself. in each of studies , and , men reported a greater likelihood of sharing or liking the stimuli. men were also more likely to have shared false material in the past unintentionally (study ) or deliberately (study ). given its replicability, this would seem to be a genuine relationship, but one which is not easy to explain. finally, the level of use of particular platforms (facebook, twitter or instagram) did not predict likelihood of sharing the stimuli in any study. level of use of twitter (study ) predicted accidental sharing of falsehoods, while facebook use predicted both accidental and deliberate sharing (study ). for historical sharing, this may be attributable to a volume effect: the more you use the platforms, the more likely you are to do these things. it should be noted that the level of use metric lacked granularity and had a strong ceiling effect, with most people reporting the highest use level in each case. in all four studies, a minority of respondents indicated that they had previously shared political disinformation they had encountered online, either by mistake or deliberately. the proportion who had done each varied across the four studies, likely as a function of the population sampled ( . %- . % accidentally; . %- . % deliberately), but the figures are a similar magnitude to those reported elsewhere [ , ] . even if the proportion of social media users who deliberately share false information is just . %, the lowest figure found here, then that is still a very large number of people who are actively and knowingly spreading untruths. the current results indicate that a number of variables predict onward sharing of disinformation. however, most of these relationships are very small. it has been argued that the minimum effect size for a predictor that would have real-world importance in social science data is β = . [ ] . considering the effect sizes for the predictors in tables , , and , only belief that the stories are true exceeds this benchmark in every study, while probability of having seen the stories before exceeded it in studies and . none of the other relationships reported exceeded the threshold. this has implications for the practical importance of these findings, in terms of informing interventions to counteract disinformation. some of the key conclusions in this set of studies arise from the failure to find evidence supporting an effect. proceeding from such findings to a firm conclusion is a logically dangerous endeavour: absence of evidence is not, of course, evidence of absence. however, given the evidence from pilot studies that the manipulations were appropriate; the associations of the dependent measures with other variables; and the high levels of power to detect the specified effects, it is possible to say with some confidence that hypotheses , and are not supported by the current data. this means that the current project does not provide any evidence that interventions based on these would be of value. this is particularly important for the findings around digital literacy. raising digital media literacy is a common and appealing policy position for bodies concerned with disinformation (e.g. [ ] ). there is evidence from a number of trials that it can be effective in the populations studied. however, no support was found here for the idea that digital literacy has a role to play in the spread of disinformation. this could potentially be attributed to the methodology in this study. however, some participants- in total across all four studies-reported sharing false political stories that they knew at the time were made up. it is hard to see how raising digital literacy would reduce such deliberate deception. trying to raise digital literacy across the population is therefore unlikely to ever be a complete solution. there is evidence that consistency with pre-existing beliefs can be an important factor, especially in relation to beliefs that disinformation stories are accurate and truthful. this implies that interventions are likely to be most effective when targeted at individuals who already hold an opinion or belief, rather than trying to change people's minds. while this would be more useful to those seeking to spread disinformation, it could also give insights into populations worth targeting with countermessages. targeting on other variables-personality or demographic-is unlikely to be of value given the low effect sizes. while these variables (perhaps gender and agreeableness in particular) most likely do play a role, their relative importance seems so low that the information is unlikely to be useful in practice. alongside other recent work [ , ] , the current findings suggest that repeated exposure to disinformation materials may increase our likelihood of sharing it, even if we don't believe it. the practical implication would be that to get a message repeated online, one should repeat it many times (there is a clear parallel with the 'repeat the lie often enough' maxim regarding propaganda). social proof (markers of consensus) seems unimportant based on current findings, so there is no point in trying to manipulate the numbers next to a post as sometimes done in online marketing. what might be more effective is to have the message posted many times (e.g. by bots) so that people had a greater chance of coming across it repeatedly. this would be true both for disinformation and counter-messages. as a scenario-based study, the current work has a number of limitations. while it is ethically preferable to field experiments, it suffers from reduced ecological validity and reliance on selfreports rather than genuine behaviour. questions could be asked, for example, about whether the authoritativeness and consensus manipulations were sufficiently salient to participants (even though they closely mirrored the presentation of this information in real-life settings). beyond this, questions might be raised about the use of self-reported likelihood of sharing: does sharing intention reflect real sharing behaviour? in fact, there is evidence to suggest that it does, with recent work finding that self-reported willingness to share news headlines on social media paralleled the actual level of sharing of those materials on twitter [ ] . the scenarios presented were all selected to be right-wing in their orientation, whereas participants spanned the full range from left to right in their political attitudes. this means that consistency was only evaluated with respect to one pole of the right-left dimension. there are a number of other dimensions that have been used as wedge issues in real-world information operations: for example, support for the black lives matter movement; climate change; or for or against britain leaving the european union. the current research only evaluated consistency between attitudes and a single issue. a better test of the consistency hypothesis would be to extend that to evaluation of consistency between attitudes and some of those other issues. a key issue is the distributions of the main outcome variables, which were heavily skewed with strong floor effects. while they still had sufficient sensitivity to make the regression analyses meaningful, they also meant that any effects found were likely to be attenuated. it may thus be that the current findings underestimate the strength of some of the associations reported. another measurement issue is around the index of social media use (facebook, twitter, instagram). as table shows, in three of the studies over % of respondents fall into the highest use category. again, this weakens the sensitivity of evaluations of these variables as predictors of sharing disinformation. in order to identify variables associated with sharing disinformation, this research programme took the approach of presenting individuals with examples of disinformation, then testing which of the measured variables was associated with self-reported likelihood of sharing. a shortcoming of this approach is that it does not permit us to evaluate whether the same variables are associated with sharing true information. an alternative design would be to show participants either true or false information, and examine whether the same constructs predict sharing both. this would enable identification of variables differentially impacting the sharing of disinformation but not true information. complexity arises, however, from the fact that whether a story can be considered disinformation, misinformation, or true information, depends on the observer's perspective. false material deliberately placed online would be categorized as disinformation. a social media user sharing it in full knowledge that it was untrue would be sharing disinformation. however, if they shared it believing it was actually true, then from an observer's perspective this would be technically categorised as misinformation (defined as "the inadvertent sharing of false information" [ , p. ] ). in fact, from the user's perspective, it would be true information (because they believe it) even though an omniscient observer would know it was actually false. this points to the importance of further research into user motivations for sharing, which are likely to differ depending on whether or not they believe the material is true. in three of the four studies (studies , , ) , the stimulus material was introduced as having been posted by a friend who wanted them to share it. this is likely to have boosted the rates of self-reported likelihood of sharing in those studies. previous work has shown that people rate themselves as more likely to engage with potential disinformation stories posted by a friend, as opposed to a more distant acquaintance [ ] . to be clear, this does not compromise the testing of hypotheses in those studies (given that the framing was the same for all participants, in all conditions). it is also a realistic representation of how we may encounter material like this in our social media feeds. however, it does introduce an additional difference between studies , and when compared with study . it would be desirable for further work to check whether the same effects were found when messages were framed as having been posted by people other than friends. finally, the time spent reading and reacting to the disinformation stimuli was not measured. it is possible that faster response times would be indicative of more use of heuristics rather than considered thought about the issues. this could profitably be examined, potentially in observational or simulation studies rather than using self-report methodology. a number of priorities for future research arise from the current work. first, it is desirable to confirm these findings using real-world behavioural measures rather than simulations. while it is not ethically acceptable to run experimental studies posting false information on social media, it would be possible to do real-world observational work. for example, one could measure digital literacy in a sample of respondents, then do analyses of their past social media sharing behaviour. another priority revolves around those individuals who knowingly share false information. why do they do this? without understanding the motivations of this group, any interventions aimed at reducing the behaviour are unlikely to be successful. as well as being of academic interest, motivation for sharing false material has been flagged as a gap in our knowledge by key stakeholders [ ] . the current work found that men were more likely to spread disinformation than women. at present, it is not clear why this was the case. are there gender-linked individual differences that influence the behaviour? could it be that the subject matter of disinformation stories is stereotypically more interesting to men, or that men think their social networks are more likely to be interested in or sympathetic to them? while the focus in this paper has been on factors influencing the spread of untruths, it should be remembered that 'fake news' is only one element in online information operations. other tactics and phenomena, such as selective or out-of-context presentation of true information, political memes, and deliberately polarising hyperpartisan communication, are also prevalent. work is required to establish whether the findings of this project related to disinformation, also apply to those other forms of computational propaganda. related to this, it would be of value to establish whether the factors found here to influence sharing of untrue information, also influence the sharing of true information. this would indicate whether there is anything different about disinformation, and also point to factors that might influence sharing of true information that is selectively presented in information operations. the current work allows some conclusions to be drawn about the kind of people who are likely to further spread disinformation material they encounter on social media. typically, these will be people who think the material is likely to be true, or have beliefs consistent with it. they are likely to have previous familiarity with the materials. they are likely to be younger, male, and less educated. with respect to personality, it is possible that they will tend to be lower in agreeableness and conscientiousness, and higher in extraversion and neuroticism. with the exception of consistency and prior exposure, all of these effects are weak and may be inconsistent across different populations, platforms, and behaviours (deliberate v. innocuous sharing). the current findings do not suggest they are likely to be influenced by the source of the material they encounter, or indicators of how many other people have previously engaged with it. no evidence was found that level of literacy regarding new digital media makes much difference to their behaviour. these findings have implications for how governments and other bodies should go about tackling the problem of disinformation in social media. conceptualization: tom buchanan. formal analysis: tom buchanan. investigation: tom buchanan. methodology: tom buchanan. disinformation and 'fake news': final report the global disinformation order: global inventory of organised social media manipulation warring songs: information operations in the digital age the ira, social media and political polarization in the united states the disconcerting potential of online disinformation: persuasive effects of astroturfing comments and three strategies for inoculation against them uk phone masts attacked amid g-coronavirus conspiracy theory. the guardian misinformation in the covid- infodemic government responses to malicious use of social media what's the difference between organic, paid and post reach? the spread of true and false news online the elaboration likelihood model of persuasion one click, many meanings: interpreting paralinguistic digital affordances in social media individual differences in susceptibility to online influence: a theoretical review neutralizing misinformation through inoculation: exposing misleading argumentation techniques reduces their influence influence: the psychology of persuasion less than you think: prevalence and predictors of fake news dissemination on facebook a theory of cognitive dissonance fake news on social media: people believe what they want to believe when it makes no sense at all social influence tactics in e-commerce onboarding: the role of social proof and reciprocity in affecting user registrations. decision support systems disinformation and digital influencing after terrorism: spoofing, truthing and social proofing the spread of low-credibility content by social bots how social media companies are failing to combat inauthentic behaviour online social media and credibility indicators: the effect of influence cues spreading disinformation on facebook: do trust in message source, risk propensity, or personality affect the organic reach of russian influence campaign sought to exploit americans' trust in local news the modern news consumer: news attitudes and practices in the digital era online political microtargeting: promises and threats for democracy bridging the digital divide: measuring digital literacy development and validation of new media literacy scale (nmls) for university students why do people get phished? testing individual differences in phishing vulnerability within an integrated, information processing model many americans believe fake news is sowing confusion personality traits and social media use in countries: how personality relates to frequency of social media use, social media news use, and social media use for social interaction predicting the big personality traits from digital footprints on social media: a meta-analysis human and computer personality prediction from digital footprints. current directions in psychological science psychological targeting as an effective approach to digital mass persuasion impression management and formation on facebook: a lens model approach a broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models implementing a five-factor personality inventory for use on the internet revised neo personality inventory (neo pi-r) and neo five-factor inventory (neo ffi): professional manual the item social and economic conservatism scale (secs) fake news on twitter during the u.s. presidential election an effect size primer: a guide for clinicians and researchers misinformation and morality: encountering fake-news headlines makes them seem less unethical to publish and share prior exposure increases perceived accuracy of fake news politics in the facebook era. evidence from the us presidential elections. cage working paper series ( ) why summaries of research on psychological theories are often uninterpretable the tactics & tropes of the internet research agency wä nke m. the truth about the truth: a meta-analytic review of the truth effect how do personality traits shape information-sharing behaviour in social media? exploring the mediating effect of generalized trust. information research: an international electronic journal self-reported willingness to share political news articles in online surveys correlates with actual sharing on twitter writing -review & editing: tom buchanan. key: cord- - iqerh authors: gorrell, genevieve; farrell, tracie; bontcheva, kalina title: mp twitter abuse in the age of covid- : white paper date: - - journal: nan doi: nan sha: doc_id: cord_uid: iqerh as covid- sweeps the globe, outcomes depend on effective relationships between the public and decision-makers. in the uk there were uncivil tweets to mps about perceived uk tardiness to go into lockdown. the pandemic has led to increased attention on ministers with a role in the crisis. however, generally this surge has been civil. prime minister boris johnson's severe illness with covid- resulted in an unusual peak of supportive responses on twitter. those who receive more covid- mentions in their replies tend to receive less abuse (significant negative correlation). following mr johnson's recovery, with rising economic concerns and anger about lockdown violations by influential figures, abuse levels began to rise in may. , replies to mps within the study period were found containing hashtags or terms that refute the existence of the virus (e.g. #coronahoax, #coronabollocks, . % of a total . million replies, or % of the number of mentions of"stay home save lives"and variants). these have tended to be more abusive. evidence of some members of the public believing in covid- conspiracy theories was also found. higher abuse levels were associated with hashtags blaming china for the pandemic. a successful response to the covid- pandemic depends on effective relationships between the public and decision-makers. yet the pandemic arises in the midst of the age of misinformation [ , ] , [ ] , creating a perfect storm. polarisation and echo chambers make it harder for the right information to reach people, and for them to trust it when it does [ ] , and the damage can be counted in lives. [ ] online verbal abuse is an intrinsic aspect of the misinformation picture, being both cause and consequence: the quality of information and debate is damaged as certain voices are silenced/driven out of the space, [ ] and the escalation of divisive "outrage" culture leads to angry and aggressive expressions [ ] . this white paper charts twitter abuse in replies to uk mps, and a number of other prominent/relevant accounts, from before the start of the pandemic in the [ ] page "infodemic": https://www.who.int/docs/default-source/coronaviruse/ situation-reports/ -sitrep- -ncov-v .pdf, https://www.bbc.co.uk/ news/technology- [ ] https://www.bbc.co.uk/news/stories- , https://www.independent.co.uk/ news/world/middle-east/iran-coronavirus-methanol-drink-cure-deaths-fake-a . html, https://thehill.com/changing-america/well-being/prevention-cures/ -maryland-emergency-hotline-receives-more-than, https://www.mirror.co. uk/news/us-news/coronavirus-man-dies-after-drinking- [ ] https://www.bbc.co.uk/news/election- - arxiv: . v [cs.cy] jun uk until late may, in order to plot the health of relationships of uk citizens with their elected representatives through the unprecedented challenges of the covid- epidemic. we consider reactions to different individuals and members of different political parties, and how they interact with events relating to the virus. we review the dominant hashtags on twitter as the country moves through different phases. the six periods considered are as follows: • february th to th inclusive: control: little attention to • march st to nd inclusive: growing awareness of culminating in the week in which we were advised but not yet obliged to begin social distancing • march rd to march st: beginning of lockdown • april st to april th: middle-lockdown • april th to may th: emergence of global backlash against lockdown • may th to may th: easing of lockdown we show trends in abuse levels, for mps overall as well as for particular individuals and for parties. we conclude with a section that compares prevalence of conspiracy theories, and contextualises them against other popular topics/concerns on twitter. we begin with a brief summary of related work, before outlining the methodology used and then progressing onto findings. a raft of work has rallied to focus attention on covid- , as the scientific community recognises the extent to which outcomes depend on information and strategy. research has begun to address the role of internet and social media in development of attitudes, compliance, and adoption of effective responses, and the way misinformation can derail these [ , , ] . furthermore, as the pandemic increasingly puts pressure on the divisions in society, with mortality risk much greater for some communities than others, [ ] we are forced to recognise the pandemic as a highly political issue (e.g. motta et al [ ] ). pew [ ] repeatedly find attitudes toward the disease split along partisan lines. in the age of covid- , polarisation can be deadly. twitter has become a favoured platform for politicians across the globe, providing a means by which the public can communicate directly with them. previous work [ , , , ] has shown rising levels of hostility towards uk politicians on twitter in the context of divisive issues, and we also see partisan operators seeking to gain influence in the space [ ] . women and minorities have been shown to have different and potentially more threatening online experiences, raising concerns about representation [ , , ] . moving into the covid- era, key questions raised therefore are about the impact of the pandemic on hostility levels towards decisionmakers, the opportunities it presents for partisan operators to further damage social cohesion, and the impact on effectiveness and experience of women and minority politicians. this work focuses on the first two of these subjects. [ ] https://www.theguardian.com/world/ /may/ /black-people-four-times-morelikely-to-die-from-covid- -ons-finds [ ] https://www.pewresearch.org/pathways- /covidcreate/main_source_of_ election_news/us_adults methodology in this work we utilize a large tweet collection on which a natural language processing has been performed in order to identify abusive language. this methodology is presented in detail by gorrell et al [ ] and summarised here. a rule-based approach was used to detect abusive language. an extensive vocabulary list of slurs (e.g. "idiot"), offensive words such as the "f" word and potentially sensitive identity markers, such as "lesbian" or "muslim", forms the basis of the approach. the slur list contained abusive terms or short phrases in british and american english, comprising mostly an extensive collection of insults, racist and homophobic slurs, as well as terms that denigrate a person's appearance or intelligence, gathered from sources that include http://hatebase.org and farrell et al [ ] . offensive worrds were used, along with sensitive words. "bleeped" versions such as "f**k" are also included. on top of these word lists, rules are layered, specifying how they may be combined to form an abusive utterance as described above, and including further specifications such as how to mark quoted abuse, how to type abuse as sexist or racist, including more complex cases such as "stupid jew hater" and what phrases to veto, for example "polish a turd" and "witch hunt". making the approach more precise as to target (whether the abuse is aimed at the politician being replied to or some third party) was achieved by rules based on pronoun co-occurrence. the approach is generally successful, but where people make a lot of derogatory comments about a third party in their replies to a politician, for example racist remarks about others, there may be a substantial number of false positives. the abuse detection method underestimates by possibly as much as a factor of two, finding more obvious verbal abuse, but missing linguistically subtler examples. this is useful for comparative findings, tracking abuse trends, and for approximation of actual abuse levels. the method for detecting covid- -related tweets is based on a list of related terms. this means that tweets that are implicitly about the epidemic but use no explicit covid terms, for example, "@borisjohnson you need to act now," are not flagged. the methodology is useful for comparative findings such as who is receiving the most covid- -related tweets, but not for drawing conclusions about absolute quantities of tweets on that subject. the corpus was created by collecting tweets in real-time using twitter's streaming api. we used the api to follow the accounts of mps -this means we collected all the tweets sent by each candidate, any replies to those tweets, and any retweets either made by the candidate or of the candidate's own tweets. note that this approach does not collect all tweets which an individual would see in their timeline, as it does not include those in which they are just mentioned. however, "direct replies" are included. we took this approach as the analysis results are more reliable due to the fact that replies are directed at the politician who authored the tweet, and thus, any abusive language is more likely to be directed at them. data were of a low enough volume not to be constrained by twitter rate limits. the study spans february th until may th inclusive, and discusses twitter replies to currently serving mps that have active twitter accounts ( mps in time period original retweet reply replyto abusive % abuse / - / , , , , , . / - / , , , , , . / - / , , , , , . / - / , , , , total). table gives the overall statistics for the corpus. dates in the table indicate a period from midnight to midnight at the start of the given date. tweets from earlier in the study have had more time to gather replies. most replies occur in the day or two following the tweet being made, but some tweets continue to receive attention over time, and events may lead to a resurgence of focus on an earlier tweet. reply numbers are a snapshot at the time of the study. we begin with a review of the time period studied, namely february th until may th inclusive. after that, findings are organised into time periods with distinct characteristics with regards to the course of the pandemic in the uk, as listed in the introduction. we then include a section on conspiracy theories. fig shows number of replies to prominent politicians since early february, and shows that for the most part, attention has focused on boris johnson. he received a large peak in twitter attention on march th. , replies were received in response to his tweet announcing that he had covid- . abuse was found in . % of these replies, which is low for a prominent minister as we may discern from admitted to intensive care (april th), left hospital to recuperate at chequers (april th), and more recently, began to ease the lockdown (may th). the late burst of attention on other politicians arises from several tweets by ministers in support of dominic cummings, the senior government advisor who chose to travel north to his parents' home in the early stages of his illness with covid- . the timeline in fig makes it easier to see abuse levels overall, toward all mps. it is on a per-week basis since mid-february, and shows a rise in abuse, back up to over % around the time of the introduction of social distancing, before dipping, and then gradually beginning to rise again in recent times. the dip may be explained, in part at least, by an unusual degree of positive attention focused on the prime minister as he faced personal adversity, depressing the abuse level as a percentage of all replies. we see that the macro-average abuse level (red line) remains relatively steady. difference in responses to different parties fig shows abuse received as a percentage of all replies received by mps, for each of the time periods studied in more detail below. we see that on the whole, response to the conservative party has been favourable, as indicated above. the exception is after may th, when the negative response to dominic cummings' decision to travel north with covid- symptoms came to the fore. responses to liberal democrat mps are more erratic due to their lower number. [ ] in previous studies, we have found conservatives receiving higher abuse levels, yet here we see labour politicians receiving more abuse in most periods. this was in evidence even in february, so precedes the pandemic. twitter has tended to be left-leaning in the uk [ ] -it remains to be seen if this is the beginning of a swing to the right or if it is specific to the times, e.g. arising from a desire to trust authority during times of crisis [ ] . there is a significant negative correlation between receiving a high level of covidrelated attention and receiving abuse (- . , p¡ . , feb th to may th, spearman's pmcc). we see this clearly in prominent government figures below, who are receiving the lion's share of the covid- attention and lower levels of abuse than we usually see for them. however the correlation is significant across the sample of all mps, suggesting perhaps that an association with the crisis is generally good for the image. from the timeline in fig , we see that aside from a blip around the general election, abuse toward mps on twitter has been tending to rise from a minimum of % of replies in , peaking mid- at over % with a smaller peak of around . % around the general election. after the election, however, abuse toward mps fell to around . %. [ ] a spike followed before the beginning of lockdown, before a low, and more recently, a rise, as discussed above. however the low was not as low as in , and the high is not as high as the brexit acrimony of . [ ] the peak in early march arises with a now-deleted tweet from layla moran. the peak in late april/early may arises with the start of ramadan and a supportive tweet from ed davey. [ ] due to the variable quality nature of the historical data, the time periods across which these data points are averaged varies, which may lead to shorter-term peaks being lost. we hope to improve on this in future work. where more datapoints are available, the graph appears more spiky, as we see with the richer recent data. table mps with greatest number of replies from february - inclusive. cell colours indicate party membership; blue for conservative, red for labour. the blue line shows micro-averaged abuse as a percentage of replies. the red line shows macro-averaged abuse percentage (percentage is calculated per person, then averaged). where the two lines differ, we can infer an unusual response at that time particularly to high-profile politicians. table provides some quick reference information for the top mps receiving the largest number of replies to their twitter account during the february period. the column "authored" refers to the number of tweets originally posted from that account that were not retweets or replies. "replyto", refers to all of the replies received to the individual's twitter account in that period. the next column, "covid", is the number of replies received to that account containing an explicit mention of covid- , with the following column representing the number of replies that verbal abuse was found in ("abusive"). the last three columns present the data in a comparative fashion. firstly, we have the percentage of replies that the individual received that were abusive. next, we have the percentage of replies that were covid related. the last column is the percentage of covid-related replies to that individual, in comparison with all covid-related replies received by all mps. we have created a table and histogram for each period. we see from table that boris johnson receives the most tweets, with controversial labour politicians not far behind. whilst he also receives the most abuse by volume, as a percentage of replies received, the . % shown here is unusually low for mr johnson compared with our findings in earlier work (e.g. . % [ ] in the first half of ). the histogram in shows the number of replies received related to covid- , in comparison with the number of replies received in general for that period. once again, this chart indicates that attention to covid- was limited at the start of the pandemic, but what attention there was to the subject was largely aimed at mr johnson. the hashtag cloud in above shows that in february, little attention was focused on covid- , and brexit remained the dominant topic in twitter political discourse. table gives the top ten hashtags in numeric terms. in february, covid- was not a major focus. the most abused tweet of the month was this one, by then-leader of the opposition jeremy corbyn, attracting % of all abusive replies to mps. https://twitter.com/jeremycorbyn/status/ ( % abuse, % of all abusive replies to mps in february): if there was a case of a young white boy with blonde hair, who later dabbled in class a drugs and conspired with a friend to beat up a journalist, would they deport that boy? or is it one rule for young black boys born in the caribbean, and another for white boys born in the us? in this section we review data from march st to march nd inclusive, when boris johnson made the first announcement that citizens were advised to avoid nonessential contact and journeys. in table and fig , one can see how attention on particular individuals has changed in the first period in march. in the table you can see the number of replies they receive, the percentage of those replies that are related to covid- , and how this compares with other mp colleagues. health secretary matt hancock became more prominent on twitter at this time, though attention was not more abusive. attention on chancellor rishi sunak also increased and was not abusive. we see a high level of attention on boris johnson, but the abuse level is lower than was seen for him in previous years (we found . % in the first half of as mentioned above; in as foreign secretary mr johnson received similarly high abuse levels in high volumes). negative attention on labour politicians is high, but note that this was also the case before the start of the epidemic in the uk. a focus on covid- is now in evidence (recall that counts for covid- tweets only include explicit mentions; it is likely that many more replies are about covid- ). the word cloud in fig shows all hashtags in tweets to mps in earlier part of march, and unsurprisingly shows a complete topic shift, to the subject of the epidemic, to the virtual exclusion of all else. the february word cloud shows a variety of non-covid subjects, such as brexit, ir (tax loophole legislation), the russia report, climate change, pension age for women and the accusations against priti patel. now, almost all hashtags are covid- -related. table gives the counts for the top ten. [ ] [ ] a non-standard em dash was used in hashtags referring to covid- for a time on twitter -in the tables we show a standard em dash. as discussed above, feelings ran high around the beginning of lockdown, and examples of high volume tweets with elevated abuse levels were given. here we give tweets receiving high abuse levels. https://twitter.com/borisjohnson/status/ ( % of replies were abusive, tweet received % of all abuse to mps in the period). it also includes a video. this country will get through this epidemic, just as it has got through many tougher experiences before. another tweet that gradually rose to become the second most abused tweet by volume of the period was a now-deleted tweet by layla moran defending china, apropos covid- . % abuse, % of abuse for the period. for the second period in march from the rd - st, attention on individual mps was reshuffled relative to the number of replies they received. from table and fig we can see that attention continues to focus on boris johnson, and is even less abusive than previously, largely due to a surge in non-abusive attention in conjunction with his being diagnosed with covid- . matt hancock becomes more prominent, though attracting less abuse than previously. the rise of the hashtag "#stayhomesavelives" shows a shift toward comment on the practical details (see fig ) . support for the lockdown appears to be high at this stage. table gives counts for the top ten. by volume, the most abused tweet was boris johnson's illness announcement, but as a percentage this was remarkably un-abusive, as discussed above, with . % abuse, and the abuse count follows only from the very high level of attention this tweet drew. the most striking tweet in terms of receiving a high percentage of abuse as well as a notable degree of attention was the one below from richard burgon. https april st to th this section reviews april up to and including april th , before trump issued his liberation tweets to virginia, michigan and minnesota and a backlash against lockdown measures became apparent. as indicated by the table and the histogram depicted in table and fig respectively , boris johnson's abuse level is extremely low as his illness takes a serious turn. keir starmer begins to attract attention in his new role as labour leader, and the attention is much less abusive than jeremy corbyn tended to receive in the same role ( % in the first half of , [ ] but the tables shown here also show consistently high abuse levels for mr corbyn). jeremy corbyn begins to attract less attention by volume of replies on twitter compared to others. nadine dorries attracts a higher abuse level than matthew hancock and rishi sunak for a tweet given below. the high abuse level toward jack lopresti during this period relates to his controversial opinion that churches should open for easter. example tweets are given below. the hashtag cloud in fig shows that attention has begun to focus on the economic cost of the lockdown, as illustrated by the prominence of hashtags such as #newstarterfurlough and #wearethetaxpayers. table gives counts for the top ten. examples of tweets that attracted particularly abusive responses https://twitter.com/jacklopresti/status/ ( % abuse, % of all abuse sent to mps between april st and th inclusive): open the churches for easter -and give people hope https://telegraph.co. uk/news/ / / /open-churches-easter-give-people-hope/?wt.mc_ id=tmg_share_tw via @telegraphnews https://twitter.com/jacklopresti/status/ ( % abuse, % of all abuse sent to mps between april st and th inclusive): today i wrote to the secretary of state @mhclg and also sent a copy of this letter to secretary of state @dcms to ask the government to consider opening church doors on easter sunday for private prayer. https://twitter.com/nadinedorries/status/ ( % abuse, % of all abuse sent to mps between april st and th inclusive -given that many people were dying, of a respiratory virus, it seemed tactless) the boss is in a better place. such a relief. the country can breathe again https://twitter.com/richardburgon/status/ ( % abuse, % of all abuse sent to mps between april st and th inclusive -regarding his work as shadow justice secretary, regarded by some as mistimed considering the prime minister's health at the time). the histogram and table in fig and table respectively reflect the unusual circumstances of this period. boris johnson did not return to work until april th, so the greater prominence of matt hancock on twitter during this period, while the prime minister recuperated at chequers, perhaps reflects this. dominic raab, in his role as acting prime minister, did not attract high attention levels on twitter. keir starmer is now the most replied-to labour politician on twitter, but continues to attract low abuse levels. later in the month, the economic focus continues, as shown by the hashtag cloud in fig . the majority of hashtags now appear to be critical, often economically focused but also including accusations of lying against china, boris johnson and conservatives, and references to the shortage of personal protective equipment for medical workers. the distinct change in tone echoes events in the usa. [ ] table gives counts for the top ten. the tweet receiving the most abusive response by volume also received a striking level of abuse by percentage; this one by ed davey. https://twitter.com/edwardjdavey/status/ ( % abuse, % of all abuse toward mps for the period): a pre-dawn meal today preparing for my first ever fast in the holy month of ramadan for muslims doing ramadan in isolation, you are not alone! #ramadanmubarak #libdemiftar two tweets by richard burgon also received a high level of abuse by volume and as a percentage: [ ] e.g. https://www.theguardian.com/global/video/ /apr/ / armed-protesters-demand-an-end-to-michigans-coronavirus-lockdown-orders-video as evidence of anti-english sentiment, as in the following paraphrased replies for example: "@jeremycorbyn so nothing about st george's day then? ah, that's because we are english, the country you wanted to run but hate with a vengeance. and you wonder why you suffered such a huge defeat at the election" and "@jeremycorbyn so no mention of st. george's day then? you utter cretin."): ramadan mubarak to all muslims in islington north, all across the uk and all over the world. https://twitter.com/simon ndorset/status/ ( % abusive replies, % of abuse for the period): i'm afraid @piersmorgan is not acting as a journalist. as a barrack room lawyer? yes. as a saloon bar bore? yes. as a bully? yes. as a show off? undoubtedly. he is not a seeker after truth: he's a male chicken non-mp accounts from this time period, we began also to collect data for a set of government accounts and other accounts relevant to the epidemic. in this period, as shown in table and fig , neil ferguson, a medical expert formerly included in the scientific advisory group for emergencies (sage), who advise the uk government, received more attention than he goes on to receive in the following period, and a high level of abuse. mr ferguson resigned from sage during this period, following publicity surrounding his lockdown violation. note that neil ferguson was also targeted by conspiracy theorists. [ ] chief medical officer of england chris whitty ("cmo england") and chief scientific advisor patrick vallance ("uksciencechief") both appear in the table, but receive only a fraction of the total and covid-related attention that downing street receives. in table and fig we see a return to a high level of focus on boris johnson, with other senior ministers also prominent. of labour politicians, only keir starmer attracts notable attention, with others much further down the list. higher levels [ ] https://thefederalist.com/ / / /the-scientist-whose-doomsday-pandemicmodel-predicted-armageddon-just-walked-back-the-apocalyptic-predictions/ the four tweets to attract the highest volumes of abuse were all by cabinet members defending dominic cummings. abuse levels are high but not the highest we have seen. there was clearly a high level of attention on the issue however. https://twitter.com/matthancock/status/ ( % abuse, % of abuse for the period): i know how ill coronavirus makes you. it was entirely right for dom cummings to find childcare for his toddler, when both he and his wife were getting ill. https://twitter.com/oliverdowden/status/ ( % abuse, % of abuse for the period): dom cummings followed the guidelines and looked after his family. end of story. https://twitter.com/michaelgove/status/ ( % abuse, % of abuse for the period): caring for your wife and child is not a crime not that i should be surprised by the lazy left but interesting how workshy socialist and nationalist mps tried to keep the remote parliament going beyond june. dominic cummings does not have a clearly labelled and verified twitter account, though the account "odysseanproject", rumoured to be his, does show elevated attention in this period, and some abuse, though not sufficient to appear in the top non-mp accounts shown in fig and table . it is unlikely that many twitter users are aware of this anonymous account (and indeed, our information may be incorrect!) however the extent of the controversy around mr cummings' lockdown violation shows itself better in responses to mps defending his actions, and in the use of hashtags, as shown above. this report [ ] from moonshot cve was used as a guide to the overall conspiracy landscape within covid- . the areas they highlight are anti-chinese feeling/conspiracy theory, theories that link the virus to a jewish plot, theories that link the virus to an american plot, generic "deep state" and g-based theories and general theories that the virus is a plot or hoax. further search areas were then added as controls. the three controls appear at the bottom of table ; "#endthelockdown" and close variants, "#newstarterfurlough" and variants, and "#stayhomesavelives", in order to contrast volumes with anti-lockdown feeling, the leading economy-related campaign and pro-lockdown feeling respectively. [ ] http://moonshotcve.com/covid- -conspiracy-theories-hate-speech-twitter/ the table shows substantial evidence of ill-feeling toward china. classic conspiracy theories are in evidence but numbers of mentions are low (though note that most of the mentions of "nwo" ("new world order") are now covid- -related, suggesting opportunistic incorporation of covid- into existing mythologies). there is considerable evidence of some twitter users not believing in the virus, and that numbers of mentions to this effect are within one order of magnitude of the popular "stay home save lives". yet all are surpassed by the theme of economic support for those not in established employment. the crisis has led to elevated engagement with uk politicians by the public, and we have seen that this may be more positive and less abusive than the dialogue at other times. the leading hashtag campaign of the period, "#newstarterfurlough", is associated with a remarkably low level of abuse (< . % of replies) despite being a complaint hashtag. the surge of attention on boris johnson during his illness was substantially lower in abuse than his previous levels. receiving more tweets mentioning the virus is associated with receiving lower levels of abuse. it may be that the crisis is leading to different people engaging with politicians than usually do, who are less inclined to verbally abuse them than those that usually occupy the space. yet the usual, more uncivil contingent remains active on twitter, with politicians receiving abuse for particular topics, that may or may not be covid- -related, in much the same manner as they did before. tweets from mps expressing positive engagement with the muslim community have been met with hostile and abusive responses, and hashtags associating the virus with china have an elevated likelihood of abuse, continuing an already noted pattern [ , ] that racism and xenophobia are associated with particularly abusive tweets. previous work has also described a substantial presence of overt islamophobia in dialogue with mps [ ] . xenophobia has not gone away, and indeed has found new fuel in the crisis. in terms of responses to the handling of the crisis, feelings run high on both sides. elevated levels of abuse are associated with hashtags supporting lockdown as well as those opposing it. labour politicians in favour of stricter measures have received search terms (in all replies to mps, not case-sensitive) # tweets # abusive % abusive "nukechina" or "bombchina" or "deathtochina" or "destroychina" . or "nuke china" or "bomb china" or "death to china" or "destroy china" or "#nukechina" or "#bombchina" or "#deathtochina" or "#destroychina" "ccpvirus" or "chinaliedpeopledied" or "ccp virus" or , . "china lied people died" or "#ccpvirus" or "#chinaliedpeopledied" "#chinesevirus" or "chinesevirus" or "chinese virus" , . "soros virus" or "israel virus" or " nwo virus" or "sorosvirus" or (all "israelvirus" or " nwovirus" or "#sorosvirus" or "#israelvirus" or "nwo "#nwovirus" virus") "#nwo" . "gates virus" or "cia virus" or "america virus" or "gatesvirus" or . "ciavirus" or "americavirus" or "#gatesvirus" or "#ciavirus" or "#americavirus" "deepstatevirus" or "deep state virus" or "#deepstatevirus" "# gcoronavirus" . "# gkills" "coronahoax" or "corona hoax" or "hoax virus" or "fake virus" or . "hoaxvirus" or "fakevirus" or "#coronahoax" or "#hoaxvirus" or "#virushoax" or "#fakevirus" "corona bollocks" or "coronabollocks" or "corona bollox" or "coronabollox" . or "#coronabollocks" or "#coronabollox" "plandemic" or "scamdemic" or "#plandemic" or "#scamdemic" or , . "#fakepandemic" or "#whereisthepandemic" or "#plandemic " "#filmyourhospital" or "#emptyhospitals" . "end the lockdown" or "endthelockdown" or "end lockdown" or , . "endlockdown" or "end lock down" or "#endlockdown" or "#endthelockdown" or "#lockdownend" or "#endthislockdown" or "#endthelockdownuk" or "#endthelockdownnow" "newstarterfurlough" or "new starter furlough" or "new starter justice" or , . or "newstarterjustice" or hashtag string:"#newstarterfurlough" or "newstarterjustice" or "#newstarterfurlough" or "#newstarterjustice" "stayhomesavelives" or "stay home save lives" or "#stayhomesavelives" or , . "#stayathomesavelives" or "#stayathomeandsavelives" or "#stayhomestaysafe" or "#stayhomeandstaysafe" or "#stayathomestaysafe" or "#stayathomeandstaysafe" or "#stayathome" or "#stayhome" , replies to mps were found containing hashtags or terms that refute the existence of the virus (e.g. #coronahoax, #coronabollocks, . % of a total . million replies, or % of the number of mentions of "stay home save lives" and variants). evidence of some members of the public believing in covid- conspiracy theories was also found. the high prevalence of disbelieving the existence of the virus is a cause for concern. science vs conspiracy: collective narratives in the age of misinformation infodemic and risk communication in the era of cov- emotional dynamics in the age of misinformation from incivility to outrage: political discourse in blogs, talk radio, and cable news misinformation of covid- on the internet: infodemiology study coronavirus goes viral: quantifying the covid- misinformation epidemic on twitter coronavirus conspiracy beliefs, mistrust, and compliance with government guidelines in england how right-leaning media coverage of covid- facilitated the spread of misinformation in the early stages of the pandemic in the us twits, twats and twaddle: trends in online abuse towards uk politicians race and religion in online abuse towards uk politicians and they thought papers were rude turds, traitors and tossers: the abuse of uk mps via twitter. ecpr joint sessions partisanship, propaganda and post-truth politics: quantifying impact in online debate a large-scale crowdsourced analysis of abuse against women journalists and politicians on twitter exploring misogyny across the manosphere in reddit local media and geo-situated responses to brexit: a quantitative analysis of twitter, news and survey data blind trust: large groups and their leaders in times of crisis and terror key: cord- -qp k fz authors: goswamy, tushar; parmar, naishadh; gupta, ayush; tandon, vatsalya; shah, raunak; goyal, varun; gupta, sanyog; laud, karishma; gupta, shivam; mishra, sudhanshu; modi, ashutosh title: ai-based monitoring and response system for hospital preparedness towards covid- in southeast asia date: - - journal: nan doi: nan sha: doc_id: cord_uid: qp k fz this research paper proposes a covid- monitoring and response system to identify the surge in the volume of patients at hospitals and shortage of critical equipment like ventilators in south-east asian countries, to understand the burden on health facilities. this can help authorities in these regions with resource planning measures to redirect resources to the regions identified by the model. due to the lack of publicly available data on the influx of patients in hospitals, or the shortage of equipment, icu units or hospital beds that regions in these countries might be facing, we leverage twitter data for gleaning this information. the approach has yielded accurate results for states in india, and we are working on validating the model for the remaining countries so that it can serve as a reliable tool for authorities to monitor the burden on hospitals. social media websites like twitter and facebook encourage frequent user expressions of their thoughts, opinions, and random details of their lives. india has the th largest user base of twitter in the world, with . million users and growing, followed by indonesia with . million users [statista, ] . this highlights the potential for gaining useful insights from the tweets posted by millions of users in these countries. tweets and status updates range from significant events to inane comments. most messages contain little informational value, but the aggregation of millions of messages can generate valuable knowledge. twitter users often publicly express personal experience about overcrowding at hospitals, difficulties faced due to a shortage of equipment by them or their relatives and other issues arising due to the pandemic, which can help understand the ground reality of the situation. previous research has studied the correlation between twitter trends and influenza rates using tweets about the symptoms [paul and dredze, ] . statistical techniques have been used to forecast flu rates using twitter data [santillana et al., ] . influenza rates have been monitored at the * contact author local level in the usa during the influenza epidemic of [broniatowski et al., ] . similarly, signorini et al [signorini et al., ] have studied the correlation between twitter data and h n cases for tracking of the infection. in this study, we are using the twitter data of users to study the surge in hospitalization volumes due to the covid- pandemic. we have focused our work on india, indonesia, and bangladesh for the scope of this study, with plans to extend this approach to other geographies in south-east asia. our research aims to identify incidents of overcrowding at hospitals, shortage of critical equipment like ventilators, and lack of available icu units. this system can help understand the medical preparedness levels of the health facilities in these countries and the burden on their hospitals as the pandemic spreads. the system pipeline includes scraping historical tweets at a granular level to obtain a corpus, processing the corpus using natural language processing tools, calculating signals from the processed data, and finally evaluating the results by comparing ground reports and bulletins. we have deployed neural translation models to account for the usage of regional language. our primary contribution to the ai community through this research is to demonstrate the application of an nlp-based twitter model to monitor the burden on health facilities due to the covid- pandemic. to the best of our knowledge, this is the first and the only approach of its kind, which can detect the trends in the worst-hit regions accurately based on twitter data. we are closely working with members from who's regional office for south-east asia (who-searo) to study and monitor our model's signals, and it is intended to help them with monitoring the situation in these countries and in identifying regions which are facing a resource crunch due to the pandemic. our model can thus be used by public health organisations to recommend appropriate actions to the authorities in the regions which the model has identified. data extraction and pre-processing . natural language processing for tweets historical tweet extraction we used the getoldtweets api [mottl, updated ] to scrape and extract historical tweets from the twitter website. unfortunately, twitter has some restrictions due to which we are unable to access all the tweets beyond seven days from the date of scraping. this leads to a misleading spike in the data (fig. ) . to address this, we scaled the tweets using the factor of change across the peak. to eliminate noise in the data and extract the important information, we performed the following operations on the tweet corpus: • removing website links: to prevent the same information from being captured twice. • removing non-ascii characters: to eliminate noise and focus on relevant keywords only • removing stopwords: removed words like 'is', 'an', 'the' to focus on hospital-related words in the frequency analysis • tokenisation: we utilized the nltk tweettokenizer api [loper and bird, ] to tokenize tweets. this was done to aid the keyword calculation process in subsequent steps. • lemmatisation: implemented lemmatization on the tokens obtained for each tweet to convert the higher form of each word to their base forms. we observed that the indonesian tweets were heavily codemixed as indonesian bahasa and english. thus we implemented a modified version of the pipeline described by barik et al. [barik et al., ] to normalize and process the indonesian tweets before calculating the scores. for tweets from bangladesh, the majority of the tweets were not codemixed and were either in the roman english script or in the bengali script. thus, we processed the english tweets using the same set of operations mentioned above and implemented tokenization and normalization for the bangla tweets. to shortlist keywords which are most relevant to our analysis and can yield accurate signals for the trend, we first created a corpus of common words related to the study like 'hospital', 'icu', etc. this was followed by applying topic modelling using latent dirichlet allocation [blei et al., ] , to find words under similar category as our initial corpus. topic modelling provides clusters of similar words based on their usage, as well as their weight to indicate how closely the words of a cluster are related. we also performed an n-gram analysis to find the frequency of these keywords in our corpus. this was followed by finding the most similar words to these keywords using word vec [mikolov et al., ] . it allowed us to create vector representations for all the words in the vocabulary by taking into account the lexical as well as semantic features of the word. the context of all the keywords was studied to minimise noise in our corpus by avoiding irrelevant words/phrases, and at the same time ensuring that the critical signals are captured. finally, based on the approaches outlined above, we shortlisted the following keywords for india, indonesia and bangladesh: • india: 'hospital', 'medical college', 'beds', 'icu', 'shortage' we experimented with different combinations of scores for the model, and finally shortlisted the following based on the requirements of public health agencies who will use this model: we obtain the twitter word count/day plot by calculating the daily count of the shortlisted keywords for a region. it is aimed at capturing incidents of overcrowding of hospitals as well as the shortage of beds and critical equipment. the twitter volume/day score is calculated as the count of all the words in the filtered tweets. this indicates the trend in the volume of tweets related to the covid- pandemic in that region. data adjustment . adjusting the peak we discovered an abrupt peak in both the plots mentioned in the previous section. after a thorough analysis and observing the trend by re-scraping the data for a week, we found that the peak shifts by a day, if we scrape the data again, and always occurs at the th historical day from the date of scraping. this can be attributed to a possible restriction imposed by twitter on accessing historical tweets. to overcome this issue, we normalised the historical tweets older than days using the ratio of values across the peak. this was done since the full volume of tweets are scraped for the most recent days, and the issue only arises for the tweets which are older than days from out date of scraping. the original and adjusted plots for delhi can be seen in fig. when we directly plot the data, it picks up the noise in the data, and this is visible as random fluctuations. this can be misleading in the analysis, and thus we 'smooth' the data by statistical techniques. we experimented with the following smoothing techniques and shortlisted the approach which gave the highest correlation with the positive cases data: • moving averages: we successively plot the average of n -days (which is the window size) to get a smoother curve which captures the overall trend better. different includes two smoothing constants, one for the level and one for the trend. two equations, one for an estimate of the local level, and the local trend's estimate are applied iteratively to each point, that apply exponential smoothing [nau, updated ] . we compared the pearson correlation coefficient from the results of these techniques with the positive cases data and found the -day moving average to give the highest correlation and thus, the best results. since social media data is sensitive to political events, we marked the major political events of each country on the plots and studied the peaks which did not overlap with any major national events. we analyzed the trends for the worst-hit states and provinces, studied the tweets corresponding to the peaks, and compared them with news reports and bulletins to validate our results. a detailed analysis of maharashtra (fig. ) , delhi (fig. )(worst-affected states in india) and kerala (fig. ) (state in india where the cases have started falling, and it did not witness any overcrowding or shortage incidences at hospitals) has been provided below. for indonesia and bangladesh, we are monitoring the trends and finetuning the model to capture the signals accurately. these two countries' results have not been included in this paper as the work is still in progress. we observe major peaks near th april and th may, as seen in fig. . we studied the tweets corresponding to these timestamps to understand the rise in the usage of the selected figure : twitter word count/day plot for maharashtra, with major political events marked as vertical lines keywords like 'hospital' and 'overcrowding'. we found that majority of the tweets were indicative of the rise in hospitalisation numbers, as well as the increase in the incidences of overcrowding at hospitals in cities like mumbai which is the financial capital of india and the most populated city of maharashtra. some sample tweets can be seen in fig. and . we validated this information using official news reports about these incidents [tare, ] . the overall trend is also increasing and the moving average is at a higher level compared to march, which is in agreement with ground reports that the situation in hospitals is worse now compared to march [staf, ] . we observed peaks in delhi at earlier dates compared to maharashtra, which was verified by news reports confirming overcrowding and shortage of beds at major hospitals in delhi like lnjp, deen dayal hospital, etc. peaks near th march, th april, th april, th may, th may, st june and a rising trend thereafter can be seen in fig. . similar to maharashtra, we found that most of the tweets corresponding to these peaks were indicative of the increasing burden [sibtain, ] , [jha, ] and [dutt, ] confirm the incidences reported by the tweets and observed as peaks on the plots. also, the moving average is at a higher level compared to march and continues to increase. this is in agreement with news reports about the worse condition of delhi now as compared to march [lalwani, ] . we obtained similar results for the states of tamil nadu, gujarat and west bengal which are the next worst-hit states in india. kerala provides an interesting counter-case study to the examples we have provided above. kerala was the first state in india to identify a confirmed case of covid- [rawat, ] , and has tackled the situation well. it is observing a declining curve for the number of active cases, while the rest of the country continues to witness a surge in numbers. the state was able to ensure that the health facilities do not face shortage of critical equipment [roy and babu, ] , and kept checks on overcrowding at hospitals [biswas, ] . the state performed better compared to other states in the country like maharashtra and delhi and its model to combat the covid- pandemic is being studied as a case study [faleiro, ] . kerala has also reported a low death toll of only deaths , which indicates that the health facilities weren't burdened to the extent other states are suffering. this trend is reflected in our model's plot as the values have remained low since the beginning of the study, and has stagnated at a level https://www.mohfw.gov.in/ close to since th april (fig. ) . the plot, corresponding tweets and news articles validate our claim that the model is successfully able to capture that the state has remained free of any incidences of overcrowding or shortage of critical equipment. from the literature review and results obtained, we can conclude that information obtained from twitter data can provide useful insights about disease spread and its impact on the healthcare system. twitter can provide trends about the ground reality of the burden on medical facilities, which might not be captured in the official government reports. we found increasing signals and spikes, which were in accordance with the increase in the number of covid- cases, as well as the incidences of overcrowding at hospitals as confirmed by the news reports. thus, researchers and epidemiologists can expand their range of methods used for monitoring of the covid- pandemic by using the twitter data model, as described in this paper. however, twitter cannot provide all answers, and it may not be reliable for certain types of information. a significant limitation of the model is that social media is a platform where users can freely post anything, and thus, there is no way to verify the claims of any individual tweet. therefore, we are relying on the assumption that if thousands of people are tweeting an incident, it is real and worth reporting. however, these need to be verified by trustable sources such as verified news articles to establish the claims reported by the twitter data. soutik biswas. coronavirus: how india's kerala state 'flattened the curve'. bbc news national and local influenza surveillance through twitter: an analysis of the - influenza epidemic anonna dutt. % beds in private hospitals to be reserved for covid- surge. hindustan times what the world can learn from kerala about how to fight covid- . mit technology review durgesh jha. covid beds running out in delhi private hospitals. times of india please help': as coronavirus cases soar in delhi, patients are struggling to find hospital beds. scroll india efficient estimation of word representations in vector space statistical forecasting: notes on regression and time series analysis you are what you tweet: analyzing twitter for public health coronavirus in india: tracking country's first covid- cases; what numbers tell. india today combining search, social media, and traditional data sources to improve influenza surveillance the use of twitter to track levels of disease activity and public concern in the u.s. during the influenza a h n pandemic mumbai runs out of hospital beds for suspected covid- patients, starts a 'waitlist'. the wire mumbai: viral video shows bodies of coronavirus victims lying next to patients at sion hospital. india today key: cord- -flp s wd authors: lamsal, rabindra title: design and analysis of a large-scale covid- tweets dataset date: - - journal: appl intell doi: . /s - - -z sha: doc_id: cord_uid: flp s wd as of july , , more than thirteen million people have been diagnosed with the novel coronavirus (covid- ), and half a million people have already lost their lives due to this infectious disease. the world health organization declared the covid- outbreak as a pandemic on march , . since then, social media platforms have experienced an exponential rise in the content related to the pandemic. in the past, twitter data have been observed to be indispensable in the extraction of situational awareness information relating to any crisis. this paper presents cov tweets dataset (lamsal a), a large-scale twitter dataset with more than million covid- specific english language tweets and their sentiment scores. the dataset’s geo version, the geocov tweets dataset (lamsal b), is also presented. the paper discusses the datasets’ design in detail, and the tweets in both the datasets are analyzed. the datasets are released publicly, anticipating that they would contribute to a better understanding of spatial and temporal dimensions of the public discourse related to the ongoing pandemic. as per the stats, the datasets (lamsal a, b) have been accessed over . k times, collectively. during a crisis, whether natural or man-made, people tend to spend relatively more time on social media than the normal. as crisis unfolds, social media platforms such as facebook and twitter become an active source of information [ ] because these platforms break the news faster than official news channels and emergency response agencies [ ] . during such events, people usually make informal conversations by sharing their safety status, querying about their loved ones' safety status, and reporting ground level scenarios of the event [ , ] . this process of continuous creation of conversations on such public platforms leads to accumulating a large amount of socially generated data. the amount of data can range from hundreds this article belongs to the topical collection: artificial intelligence applications for covid- , detection, control, prediction, and diagnosis rabindra lamsal rabindralamsal@outlook.com be (i) trimmed [ ] or summarized [ , , , ] and sent to the relevant department for further analysis, (ii) used for sketching alert-level heat maps based on the location information contained within the tweet metadata or the tweet body. similarly, twitter data can also be used for identifying the flow of fake news [ , , , ] . if miss-information and unverified rumors are identified before they spread out on everyone's news feed, they can be flagged as spam or taken down. further, in-depth textual analyses of twitter data can help (i) discover how positively or negatively a geographical region is being textually-verbal towards a crisis, (ii) understand the dissemination processes of information throughout a crisis. as of july , , the number of novel coronavirus (covid- ) cases across the world had reached more than thirteen million, and the death toll had crossed half a million [ ] . states and countries worldwide are trying their best to contain the spread of the virus by initiating lockdown and even curfew in some regions. as people are bound to work from home, social distancing has become a new normal. with the increase in the number of cases, the pandemic's seriousness has made people more active in social media expression. multiple terms specific to the pandemic have been trending on social media for months now. therefore, twitter data can prove to be a valuable resource for researchers working in the thematic areas of social computing, including but not limited to sentiment analysis, topic modeling, behavioral analysis, fact-checking and analytical visualization. large-scale datasets are required to train machine learning models or perform any kind of analysis. the knowledge extracted from small datasets and region-specific datasets cannot be generalized because of limitations in the number of tweets and geographical coverage. therefore, this paper introduces a large-scale covid- specific english language tweets dataset, hereinafter, termed as the cov tweets dataset. as of july , , the dataset has more than million tweets and is available at ieee dataport [ ] . the dataset gets a new release every day. the dataset's geo version, the geocov tweets dataset, is also made available [ ] . as per the stats reported by the ieee platform, the datasets [ , ] have been accessed over . k times, collectively, worldwide. the paper is organized as follows: section reviews related research works. section discusses the design methodology of the cov tweets dataset and its geo version. section focuses on the hydration of tweets id for obtaining full tweet objects. section presents the analysis and discussions, and section concludes the paper. multiple other studies have also been collecting and sharing large-scale datasets to enable research in understanding the public discourse regarding covid- . some of those publicly available datasets are multi-lingual [ , , , ] , and some are language-specific [ , ] . among those datasets, [ , , ] have significantly large numbers of tweets in their collection. [ ] provides more than million multi-lingual tweets and also an english version as a secondary dataset. however, with the last update released on may , , the dataset [ ] does not seem to be getting frequent releases. [ ] shares around million multilingual tweets alongside the most frequently used terms. [ ] provides million multi-lingual tweets, with around million tweets in the english language. however, neither of them [ , ] have english version releases. first, the volume of english tweets in multi-lingual datasets can become an issue. twitter sets limits on the number of requests that can be made to its api. its filtered stream endpoint has a rate limit of requests/ -minutes per app., which is why the maximum number of tweets that can be fetched in hours is just above million. the language breakdown of multi-lingual datasets shows a higher prevalence of english, spanish, portuguese, french, and indonesian languages [ , ] . therefore, multi-lingual datasets contain relatively fewer english tweets, unless multiple language-dedicated collections are run and merged later. second, the size and multi-lingual nature of large-scale datasets can become a concern for researchers who need only the english tweets. for that purpose, the entire dataset must be hydrated and then filtered, which can take multiple weeks. recent studies have done sentiment analysis on different samples of covid- specific twitter data. a study [ ] analyzed . million covid- specific tweets collected between february , , and march , , using frequencies of unigrams and bigrams, and performed sentiment analysis and topic modeling to identify twitter users' interaction rate per topic. another study [ ] examined tweets collected between january , , and april , , to understand the worldwide trends of emotions (fear, anger, sadness, and joy) and the narratives underlying those emotions during the pandemic. a regional study [ ] in spain performed sentiment analysis on , conversations collected from various digital platforms, including twitter and instagram, during march and april , to examine the impact of risk communications on emotions in spanish society during the pandemic. in a similar regional study [ ] concerning china and italy, the effect of covid- lockdown on individuals' psychological states was studied using the conversations available on weibo (for china) and twitter (for italy) by analyzing the posts published two weeks before and after the lockdown. multiple studies have performed social network analysis on twitter data related to the covid- pandemic. a case study [ ] examined the propagation of the #filmyourhospital hashtag using social network analysis techniques to understand whether the hashtag virality was aided by bots or coordination among twitter users. another study [ ] collected tweets containing the # gcoronavirus hashtag between march , , and april , , and performed network analysis to understand the drivers of the g covid- conspiracy theory and strategies to deal with such misinformation. a regional study [ ] concerning south korea used network analysis to investigate the information transmission networks and news-sharing behaviors regarding covid- on twitter. a similar study [ ] investigated the relationship between social network size and incivility using the tweets originating from south korea between february , , and february , , when the korean government planned to bring its citizens back from wuhan. twitter provides two api types: search api [ ] and streaming api [ ] . the standard version of search api can be used to search against the sample of tweets created in the last seven days, while the premium and enterprise versions allow developers to access tweets posted in the previous days ( -day endpoint) or from as early as (fullarchive endpoint) [ ] . the streaming api is used for accessing tweets from the real-time twitter feed [ ] . for this study, the streaming api is being used since march , . the original collection of tweets was started on january , . the study commenced as an optimization design project to investigate how much social media data volume can be analyzed using minimal computing resources. twitter's content redistribution policy restricts researchers from sharing tweets data other than tweet ids, direct message ids and/or user ids. the original collection did not have tweet ids. therefore, tweets collected between january , , and march , , could not be released to the public. hence, a fresh collection was started on march , . figure shows the daily distribution of the tweets in the cov tweets dataset. between march , , and april , , four keywords, "corona," "#corona," "coronavirus," and "#coronavirus," were used for filtering the twitter stream. therefore, the number of tweets captured in that period per day, on average, is around k. however, a dedicated collection was started on a linux-based highperformance cpu-optimized virtual machine (vm), with additional filtering keywords, after april , . as of july , , keywords are being tracked for streaming the tweets. the number of keywords has been evolving continuously since the inception of this study. table gives an overview of the filtering keywords currently in use. as the pandemic grew, a lot of new keywords emerged. in this study, n-grams are analyzed every hours using the recent most . million tweets to keep track of emerging keywords. twitter's "worldwide trends" section is also monitored for the same purpose. on may , , twitter also published a list of multi-lingual filtering keywords used in its covid- stream endpoint [ ] . the streaming api allows developers to use up to keywords, , user ids, and location boxes for filtering the twitter stream. the keywords are matched against the tokenized text of the body of the tweet. keywords have been identified as filtering rules for extracting covid- specific tweets. user id filtering was not used. also, the location box filtering was avoided as the intention was to create a global dataset. twitter adds a bcp language identifier based on the machine-detected language of the tweet body. since the aim was to pull only the english tweets, the "en" condition was assigned to the language request parameter. the collection of tweets is a small portion of the dataset design. the other tasks include filtration of geo-tagged tweets and computation of sentiment score for each captured tweet, all that in real-time. a dashboard is also required to visualize the information extracted from the collected tweets. a stable internet connection is needed to download the continuously incoming json. the computation of sentiment score for each captured tweet requires the vm to constitute powerful enough cpus to avoid a bottleneck scenario. every information gathered to this point needs to be stored on a database, which necessitates a disk with excellent performance. summing up, a cloud-based vm is required to automate all these tasks. in this study, the vm has to process thousands of tweets every minute. also, the information extracted from the captured data is to be visualized on an active front-end server that requires plotting of hundreds of thousands of data points. therefore, a linux-based compute-optimized hyper-threading vm is used for this study. table gives an overview of the vm considered in the dataset design. figure a -e shows the resource utilization graphs for various performance parameters of the vm. a new collection starts between - hrs gmt+ : , every day. therefore, the cpu usage and average load increase gradually as more and more tweets get captured. the cpu usage graph, in fig. a , shows that the highest percentage of cpu usage at any given time does not exceed %. few python scripts and libraries, and a web server is actively running in the back-end. the majority of the tasks are cpu intensive; therefore, memory usage does not seem to exceed %, as shown in fig. b . past data show that memory usage exceeds % only when the web traffic on the visualization dashboard increases; otherwise, it is usually constant. the load average graph, in fig. c , shows that the processors do not operate overcapacity. the three colored lines, magenta, green and purple, represent -minute, minute, and -minute load average. the disk i/o graph, in fig. d , interprets the read and write activity of the vm. saving thousands of tweets information every minute triggers continuous writing activity on the disk. the disk i/o graph shows that the write speed is around . mb/s, and the read speed is insignificant. the bandwidth usage graph, in fig. e , reveals the public bandwidth usage pattern. on average, the vm is receiving a continuous data stream at mb/s. the vm connects with the backup server's database to download the recent half a million tweets for extracting a list of unigrams and bigrams. a new list is created every hours; therefore, peaks in the bandwidth usage graph. geotagging is the process of placing location information in a tweet. when a user permits twitter to access his/her location via an embedded global positioning system (gps), the geo-coordinates data is added to the tweet location metadata. this metadata gives access to various geo objects [ ] such as "place type": "city", "name": "manhattan", "full name": "manhattan, ny", "country code": "us", "country": "united states" and the bounding box (polygon) of coordinates that encloses the place. previous studies have shown that significantly less number of tweets are geo-tagged. a study [ ] , conducted between - in southampton city, used local and spatial data to show that around k tweets out of million had "point" geolocation data. similarly, in another work [ ] done in online health information, it was evident that only . % of tweets were geo-tagged. further, a multilingual covid- global tweets dataset from crisisnlp [ ] reported having around . % geo-tagged tweets. in this study, the tweets received from the twitter stream are filtered by applying a condition on the ["coordinates"] twitter object to design the geocov tweets dataset. algorithm shows the pseudo-code for filtering the geotagged tweets. figure shows the daily distribution of tweets present in the geocov tweets dataset. out of million tweets, k tweets ( . %) were found to be geo-tagged. if the collection after april , , is considered, k ( . %) tweets are geo-tagged. twitter's content redistribution policy restricts the sharing of tweet information other than tweet ids, direct message ids and/or user ids. twitter wants researchers to pull fresh data from its platform. it is because users might delete their tweets or make their profile private. therefore, complying with twitter's content redistribution policy, only the tweet ids are released. the dataset is updated every day with the addition of newly collected tweet ids. first, twitter allows developers to stream around % of all the new public tweets as they happen, via its streaming api. therefore, the dataset is a sample of the comprehensive covid- tweets collection twitter has on its servers. second, there is a known gap in the dataset. due to some technical reasons, the tweets collected between march , , hrs gmt+ : , and march , , hrs gmt+ : could not be retrieved. third, tweets analysis in a single language increases the risks of missing essential information available in tweets created in other languages [ ] . therefore, the dataset is primarily applicable for understanding the covid- public discourse originating from native english-speaking nations. twitter does not allow json of the tweets to be shared with third parties; the tweet ids provided in the cov tweets dataset must be hydrated to get the original json. this process of extracting the original json from the tweet ids is known as the hydration of tweets ids. there are multiple libraries/applications such as twarc (python fig. daily distribution of tweets in the geocov tweets dataset library) and hydrator (desktop application) developed for this purpose. using the hydrator application is relatively straightforward; however, working with the twarc library requires basic programming knowledge. algorithm is the pseudo-code for using twarc to hydrate a list of tweet ids. the tweet data dictionary provides access to a long list of root-level attributes. the root-level attributes, such as user, coordinates, place, entities, etc., further provide multiple child-level attributes. when hydrated, the tweet ids produce json that contains all the root-level and childlevel attributes with their values. twitter's documentation [ ] can be referred for more information on the tweet data dictionary. the cov tweets dataset has global coverage, and it can also be used to extract tweets originating from a particular region. an implementable solution for this will be to check if a tweet is geo-tagged or has place boundary defined in its data dictionary. if none of these fields are available, the address given on the user's profile can be used. however, twitter does not validate the profile address field for authentic geo-information. even addresses such as "milky way galaxy," "earth," "land," "my dream," etc. are accepted entries. a user can also create a tweet from a particular place while having an address of a different one. therefore, considering user's profile address might not be an effective solution while dealing with location information. algorithm is the pseudocode for extracting tweets originating from a region of interest. tweets received from the twitter stream can be analyzed for making multiple inferences regarding an event. the tweets collected between april , , and july , , were considered to generate an overall covid- sentiment trend graph. the sampling time is minutes, which means a combined sentiment score is computed for tweets captured in every minutes. figure shows the covid- sentiment trend based on public discourse related to the pandemic. in fig. , there are multiple drops in the average sentiment over the analysis period. in particular, there are fourteen drops where the scores are negative. among those fourteen drops, seven of the significant drops were studied. the tweets collected in those dates were analyzed to see what particular terms (unigrams and bigrams) were trending. table lists the most commonly used terms during those seven drops. the tweets are pre-processed before extracting the unigrams and bigrams. the pre-processing steps include transforming the texts to their lowercases and removing noisy data such as retweet information, urls, special characters, and stop words [ ] . it should be noted that the removal of stop words from the tweet body results in a different set of bigrams. therefore, the bigrams listed in table should not be considered the sole representative of the context in which the terms might have been used. next, the geocov tweets dataset was used for performing network analysis to extract the underlying relationship between countries and hashtags. only the hashtags that appear more than ten times in the entire dataset were considered. the dataset resulted in , number of [country, hashtag] relations from countries and territories, and unique hashtags. there were , unique relations when weighted. finally, the resulting relations were used for generating a network graph, as shown in fig. . the graph shows interesting facts about dataset. the network has a dense block of nodes forming a sphere and multiple sparsely populated nodes connected to the nodes inside the sphere through some relations. the nodes that are outside the sphere are country-specific hashtags. for illustration, fig. a -d shows the countryspecific hashtags for new zealand, qatar, venezuela, and argentina. the nodes of these countries are outside the sphere because of outliers in their respective sets of hashtags. however, these countries do have connections with the popular hashtags present inside the sphere. the majority of the hashtags in fig. a -d do not relate directly to the pandemic. therefore, these hashtags can be considered as outliers while designing a set of hashtags for the pandemic. the network graph, shown in fig. , is further expanded by a scale factor, as shown in fig. a and b. the network graphs are colored based on the communities detected by a modularity algorithm [ , ] . the algorithm detected communities in the geocov tweets dataset. the weight='weight' and resolution= . parameters were used for the experimentation. table gives an overview of the communities identified in the geocov tweets dataset. country names are represented by their iso codes. community constitutes . % of the nodes in the network. the number of members in community was relatively high; therefore, the iso column for that community lists only the countries that have associations with at least different hashtags. for the remaining communities, all the members are listed. communities are formed based on the usage of similar hashtags. the united states has associations with the highest number of different hashtags, it is therefore justified to find most countries in the same group with the united states. however, other native english-speaking nations such as the united kingdom and canada seem to be forming their own communities. this formation of separate communities is because of the differences in their sets of hashtags. for example, the united kingdom appears to be mostly using "lockdown," "lockdown ," "isolation," "selfisolation," etc. as hashtags, but the presence of these hashtags in the hashtag set of the united states is limited. the iso codes for each community in table are sorted in descending order; fig. country specific outlier hashtags detected using network analysis the country associated with the highest number of unique hashtags is mentioned first. next, a set of popular hashtags and their communities are identified. table lists the top commonly used hashtags, their weighted in-degree, and their respective communities. the community for a hashtag in table means that the hashtag has appeared the most in that particular community. the [country, hashtag] relations can also be used to trace back a hashtag's usage pattern. the hashtags "flattenthecurve," "itsbetteroutside," "quarantine," "socialdistancing," etc. seem to be first used in the tweets originating from the united states. in the fourth week of march , countries such as the united kingdom, india, and south africa experienced their first phase of lockdown. for the same reason, there is an unusual increase in the usage of "lockdown" related hashtags during that period in those countries. it should be noted that a thorough tracing back of hashtag usage would require analysis of tweets collected since december , when the "first case" of covid- was identified [ ] . as of july , , the number of tweets in the geocov tweets dataset is , . the dataset is hydrated to create a country-level distribution of the geotagged tweets, as shown in table . the united states dominates the distribution with the highest number of us, au, ng, za, ae, es, id, ie, mx, pk, sg, fr, be, gh, ke, th, se, at, sa, pt, lb, ug, eg, co, ma, lk, ec, hk, kw, ro, pe, fi, hr, no, zw, pa, tz, vn, bs, pg, hu, bh, cr, bb, om, sx, rs, tw, bg, do, zm, aw, kh, gu, bt, bw, cm, cg, cd, fj, aq, sv, al, requirement for converting geo-coordinates to a humanreadable address. next, the geo-tagged tweets were visualized on a map based on their sentiment scores. figures and are the sentiment maps generated based on the location information extracted from the tweets collected between march , , and july , . the world view of the covid- sentiment map, in fig. , shows that the majority of the tweets are originating from north america, europe, and the indian subcontinent. interestingly, some tweets are also seen to be originating from countries where the government has banned twitter. around . % of the geotagged tweets have come from the people's republic of china, while north korea does appear on the list, the number is insignificant. when a region-specific sentiment map, as shown in fig. , is generated, numerous clusters of geo-location points are observed. such clusters can be a bird's-eye view for the authorities to create first-hand sketches of tentative locations to start for responding to a crisis. for example, the location information extracted from the tweets classified to the "infrastructure and utilities damage" category can help generate near real-time convex closures of the crisis-hit area. such convex closures can prove to be beneficial for the first responders (army, police, rescue teams, first-aid volunteers, etc.) to come up with actionable plans. in general, the inferences made from geo-specific data can help (i) understand knowledge gaps, (ii) perform surveillance for prioritizing regions, and (iii) recognize the urgent needs of a population [ ] . understanding the knowledge gaps involves identifying the crisis event-related queries posted by the public on social media. the queries can be anything, a rumor, or even some casual inquiry. machine learning models can be trained on large-scale tweets corpus for classifying the tweets into multiple informational categories, including a separate class for "queries." even after the automatic classification, each category still contains hundreds of thousands of tweets conversations, which require further indepth analysis. those classified tweets can be summarized to extract concise and important set of conversations. recent studies have used extractive summarization [ , ] , abstractive summarization [ ] , and the hybrid approach [ ] for summarizing microblogging streams. if the queries are identified and duly answered, the public's tendency to panicking can be settled to some extent. further, geo-specific data can assist in surveillance purposes. the social media messages can be monitored actively to identify the messages that report a disease's signs and symptoms. if such messages are detected quite early, an efficient response can be targeted to that particular region. the authorities and decision-makers can come up with effective and actionable plans to minimize possible future severity. furthermore, social media messages can also be analyzed to understand the urgent needs of a population. the requirements might include anything related to everyday essentials (shelter, food, water) and health services (medicines, checkups). the above-discussed research implications fall under the crisis response phase of the disaster management cycle. however, other sub-areas in the social computing domain enforce the computational systems to also understand the psychology, and sociology of the affected population/region as part of the crisis recovery phase. the design of such computational systems requires a humongous amount of data for modeling intelligence within them to track the public discourse relating to any event. therefore, a largescale twitter dataset for the covid- pandemic was presented in this paper, hoping that the dataset and its geo version would help researchers working in the social computing domain to better understand the covid- discourse. in this paper, a large-scale global twitter dataset, cov tweets dataset, is presented. the dataset contains more than million english language tweets, originating from different countries and territories worldwide, collected over march , , and july , . earlier studies have shown that geo-specific social media conversations aid in extracting situational information related to an ongoing crisis event. therefore, the geo-tagged tweets in the cov tweets dataset is filtered to create its geo version, the geocov tweets dataset. out of million tweets, it was observed that only k tweets ( . %) had "point" location in their metadata. the united states dominates the country-level distribution of the geo-tagged tweets and is followed by the united kingdom, canada, and india. designing a large-scale twitter dataset requires a reliable vm to fully automate the associated tasks. five performance metrics (specific to cpu, memory, average load, disk i/o, bandwidth) were analyzed to see how the vm was performing over a period ( hour). the paper then discussed techniques to hydrate tweet ids and filter tweets originating from a region of interest. next, the cov tweets dataset and its geo version were used for sentiment analysis and network analysis. the tweets collected between april , , and july , , were considered to generate an overall covid- sentiment trend graph. based on the trend graph, seven significant drops in the average sentiment over the analysis period were studied. trending unigrams and bigrams on those particular dates were identified. further, a detailed social network analysis was done on the geocov tweets dataset using [country, hashtag] relations. the analysis confirmed the presence of different communities within the dataset. the formation of communities was based on the usage of similar hashtags. also, a set of popular hashtags and their communities were identified. furthermore, the geocov tweets dataset was used for generating world and region-specific sentimentbased maps, and the research implications of using geospecific data were briefly outlined. top concerns of tweeters during the covid- pandemic: infoveillance study covid- and the g conspiracy theory: social network analysis of twitter data large arabic twitter dataset on covid- a large-scale covid- twitter chatter dataset for open scientific research-an international collaboration assessing twitter geocoding resolution fast unfolding of communities in large networks a survey on fake news and rumour detection techniques influence of fake news in twitter during the us presidential election right time, right place" health communication on twitter: value and accuracy of location information crowd sourcing disaster management: the complex nature of twitter usage in padang indonesia big crisis data: social media in disasters and time-critical situations tsunami early warnings via twitter in government: net-savvy citizens' coproduction of time-critical public information services tracking social media discourse about the covid- pandemic: development of a public coronavirus twitter data set a microblogging-based approach to terrorism informatics: exploration and chronicling civilian sentiment and response to terrorism events via twitter multilingual sentiment analysis: state of the art and independent comparison of techniques omg earthquake! can twitter improve earthquake response going viral: how a single tweet spawned a covid- conspiracy theory on twitter arcov- : the first arabic covid- twitter dataset with propagation networks clinical features of patients infected with novel coronavirus in wuhan, china processing social media messages in mass emergency: a survey aidr: artificial intelligence for disaster response twitter as a lifeline: humanannotated twitter corpora for nlp of crisis-related messages using ai and social media multimodal content for disaster response and management: opportunities, challenges, and future directions detection of spam-posting accounts on twitter prediction and characterization of high-activity events in social media triggered by real-world news effects of social grooming on incivility in covid- laplacian dynamics and multiscale modular structure in networks coronavirus (covid- ) geo-tagged tweets dataset lamsal r ( a) coronavirus (covid- ) tweets dataset twitter based disaster response using recurrent nets using tweets to support disaster planning, warning and response sentiment analysis and emotion understanding during the covid- pandemic in spain and its impact on digital ecosystems global sentiments surrounding the covid- pandemic on twitter: analysis of twitter trends robust classification of crisis-related data on social networks using convolutional neural networks efficient online summarization of microblogging streams conversations and medical news frames on twitter: infodemiological study on covid- in south korea what kind of# conversation is twitter? mining# psycholinguistic cues for emergency coordination geocov : a dataset of hundreds of millions of multilingual covid- tweets with location information summarizing situational tweets in crisis scenarios: an extractiveabstractive approach sumblr: continuous summarization of evolving tweet streams examining the impact of covid- lockdown in wuhan and lombardy: a psycholinguistic analysis on weibo and twitter communicating on twitter during a disaster: an analysis of tweets during typhoon haiyan in the philippines twitter: covid- stream twitter: filter realtime tweets twitter: standard search api twitter: twitter object rumor response, debunking response, and decision makings of misinformed twitter users during disasters on summarization and timeline generation for evolutionary tweet streams spatial, temporal, and content analysis of twitter for wildfire hazards automatic identification of eyewitness messages on twitter during disasters mining twitter data for improved understanding of disaster resilience the author is grateful to digitalocean and google cloud for funding the computing resources required for this study. the author declares that there is no conflict of interest.publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. key: cord- - r vycnr authors: chire saire, j. e. title: infoveillance based on social sensors to analyze the impact of covid in south american population date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: r vycnr infoveillance is an application from infodemiology field with the aim to monitor public health and create public policies. social sensor is the people providing thought, ideas through electronic communication channels(i.e. internet). the actual scenario is related to tackle the covid impact over the world, many countries have the infrastructure, scientists to help the growth and countries took actions to decrease the impact. south american countries have a different context about economy, health and research, so infoveillance can be a useful tool to monitor and improve the decisions and be more strategical. the motivation of this work is analyze the capital of spanish speakers countries in south america using a text mining approach with twitter as data source. the preliminary results helps to understand what happens two weeks ago and opens the analysis from different perspectives i.e. economics, social. infodemiology is a new research field, with the objective of monitoring public health and support public policies based on electronic sources, i.e. internet. usually this data is open, textual and with no structure and comes from blogs, social networks and websites, all this data is analysed in real time. and infoveillance is related to applications for surveillance proposals, i.e. monitor h n pandemic with data source from twitter , monitor dengue in brazil , monitor covid symptoms in bogota, colombia . besides, social sensors is related to observe what people is doing to monitor the environment of citizens living in one city, state or country. and the connection to internet, the access to social networks is open and with low control, people can share false information(fake news) . a disease caused by a kind coronavirus, named coronavirus disease (covid ) started in wuhan, china at the end of year. this virus had a fast growth of infections in china, italu and many countries in asia, europe during january and february. countries in america(central, north, south) started with infections at the middle of february or beginning of march. this disease was declared a global concern at the end of january by world health organization(who) . south america has different context about economics, politics and social issues than the rest of the world and share a common language: spanish. the decisions made for each government were over the time, with different dates and actions: i.e. social isolation, close limits by air, land. but, there is no tool to monitor in real time what is happening in all the country, how the people is reacting and what action is more effective and what problems are growing. for the previous context, the motivation of this work is analyze the capitals of countries with spanish as language official to analyze, understand and support during this big challenge that we are facing everyday. this paper follows the next organization: section explains the methodology for the experiments, section presents results and analysis. section states the conclusions and section introduces recommendations for studies related. the present analysis is inspired on cross industry standard process for data mining(crisp-dm) steps, the phases are very frequent on data mining tasks. so, the steps for this analysis are the next: • select the scope of the analysis and the social network • find the relevant terms to search on twitter • build the query for twitter and collect data • cleaning data to eliminate words with no relevance(stopwords) • visualization to understand the countries . cc-by-nc . international license it is made available under a author/funder, who has granted medrxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) the copyright holder for this preprint . https://doi.org/ . / . . . doi: medrxiv preprint considering the countries where spanish is the official language, there are countries in south america: argentina, bolivia, chile, colombia, ecuador, paraguay, perú, uruguay, venezuela and every nation has a different territory size as the table tab. shows. therefore, analyze the whole countries could take a great effort about time then the scope of this paper considers the capital of each country because the highest population is found there. at the same time, there are many social networks with like facebook, linkedin, twitter, etc. with different kind of objective: entertainment, job search and so on. during the last years, data privacy is an important concern and there is update on their politics, so considering the previous restriction twitter is chosen because of the open access through twitter api, the api will help us to collect the data for the present study. although, the free access has a limitation of seven days, the collecting process is performed every week. actually, there is hundreds of news around the world and dozens of papers about the coronavirus so to perform the queries is necessary to select the specific terms and consider the popular names over the population. the selected terms are: ideally, people only uses the previous terms but, citizens does not write following this official names then special characters are found like @, #, -, _. for this reason, variations of coronavirus and covid are created, i.e. { '@coronavirus', #covid- ', '@covid_ ' } the extraction of tweets is through twitter api, using the next parameters: • date: - - to - - , the last two weeks . cc-by-nc . international license it is made available under a author/funder, who has granted medrxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) the copyright holder for this preprint . https://doi.org/ . / . . . doi: medrxiv preprint • change format of date to year-month-day • eliminate alphanumeric symbols • uppercase to lowercase • eliminate words with size less or equal than • add some exceptions to eliminate, i.e. 'https', 'rt' this step will help to answer some question to analyze what happens in every country. • how is the frequency of posts everyday? • can we trust on all the posts? • the date of user account creation • tweets per day to analyze the increasing number of posts • cloud of words to analyze the most frequent terms involved per day the next graphics presents the results of the experiments and answer many questions to understand the phenomenon over the population. at beginning, a fast preview about the frequency of post per country will support us to understand how many active users are in every capital. four things are important to highlight from fig. : ( ) venezuela is a smaller country but the number of posts are pretty similar to argentina, ( ) paraguay is almost a third from peru territory and the number of publications are very similar, chile is one small country but the number of publication are higher than peru and ( ) uruguay is the smallest one with more tweets than bolivia and colombia even ecuador has more. by other hand, considering data from table , there is a strong relationship between internet, social media and mobile connections in argentina, venezuela with the number of tweets and but a different context for colombia, this insight show us the level of using in bogota and says how the internet users are spread in other cities on colombia. so, a similar behavior explained previously is present over this data. . cc-by-nc . international license it is made available under a author/funder, who has granted medrxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) the copyright holder for this preprint . https://doi.org/ . / . . . doi: medrxiv preprint considering the image fig. , the number of post for each country, the total number of tweets is up to five millions( ), close to half of million( ) per day. so, the question about veracity is important to filter and analyze what people is thinking, because the noise could be a limitation to understand what truly happens. by consequence, it is necessary to consider some criterion to filter this data. first, argentina has the highest number of publications in last two weeks. for example, the firt dozen of the top users in buenos aires are: 'portal diario', '.', 'clarín', 'radio dogo', 'camila','el intransigente', 'agustina', 'pablo', 'frentedetodos', 'ale','lucas', 'diario crónica' later, a search about the users, one natural finding is: they are related to newspapers, radio or television(mass media). but there is people with many hundreds of tweets and regular people. the next image fig. has the names of users and quantity of posts. #bolivia siete covid @pagina gobierno salud @larazon cruz ministro medidas cuarentena caso #coronavirusbo santa @rtp pais #esultimo emergencia personas @erboldigital anibal paciente confirma pide #lapaz informa confirmados nuevo oruro presidenta hospital primer @jeanineanez #elalto #urgente pacientes #deahora prevencion presidente ciudad evitar anuncia informo @sumaj warmi nacional nuevos dice sospechosos #loultimo alto prevenir centro medicos atencion china italia declara cuba tres horas #deultimo poblacion #santacruz ministerio propagacion video mundo hospitales enfrentar autoridades tras @yerkogarafulic #coronavirusmundo luis @luchoxbolivia #anibalcruz jeanine medico debido #mundo #ultimo virus gobernacion #videonoticias pandemia municipal estan primera manos policia reporta suspension #oruro @radiolider frente helping the visualisation from monday to sunday during the last two weeks, a cloud of words is presented in fig. showing the first thirty terms per country. it is important to remember every country promote different actions on different dates. . cc-by-nc . international license it is made available under a author/funder, who has granted medrxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) the copyright holder for this preprint . https://doi.org/ . / . . . doi: medrxiv preprint infodemiology and infoveillance: tracking online health information and cyberbehavior for public health social web mining and exploitation for serious applications: technosocial predictive analytics and related technologies for public health, environmental and national security surveillance pandemics in the age of twitter: content analysis of tweets during the h n outbreak building intelligent indicators to detect dengue epidemics in brazil using social networks what is the people posting about symptoms related to coronavirus in bogota, colombia? early epidemiological analysis of the coronavirus disease outbreak based on crowdsourced data: a population-level observational study who. who statement regarding cluster of pneumonia cases in wuhan, china. beijing: who the crisp-dm model: the new blueprint for data mining infoveillance based on social sensors with data coming from twitter can help to understand the trends on the population of the capitals. besides, it is necessary to filter the posts for processing the text and get insights about frequency, top users, most important terms. this data is useful to analyse the population from different approaches. key: cord- - gd sz z authors: little, jessica s.; romee, rizwan title: tweeting from the bench: twitter and the physician-scientist benefits and challenges date: - - journal: curr hematol malig rep doi: . /s - - - sha: doc_id: cord_uid: gd sz z purpose of review: social media platforms such as twitter are increasingly utilized to interact, collaborate, and exchange information within the academic medicine community. however, as twitter begins to become formally incorporated into professional meetings, educational activities, and even the consideration of academic promotion, it is critical to better understand both the benefits and challenges posed by this platform. recent findings: twitter use is rising amongst healthcare providers nationally and internationally, including in the field of hematology and oncology. participation on twitter at national conferences such as the annual meetings of american society of hematology (ash) and american society of clinical oncology (asco) has steadily increased over recent years. tweeting can be used advantageously to cultivate opportunities for networking or collaboration, promote one’s research and increase access to other’s research, and provide efficient means of learning and educating. however, given the novelty of this platform and little formal training on its use, concerns regarding patient privacy, professionalism, and equity must be considered. summary: these new technologies present unique opportunities for career development, networking, research advancement, and efficient learning. from “tweet ups” to twitter journal clubs, physician-scientists are quickly learning how to capitalize on the opportunities that this medium offers. yet caution must be exercised to ensure that the information exchanged is valid and true, that professionalism is maintained, that patient privacy is protected, and that this platform does not reinforce preexisting structural inequalities. social media is a rapidly evolving platform for communication that is increasingly being utilized across the academic medicine community. twitter, a free microblogging platform, enables users to read and post -character messages called "tweets" [ •, ] . twitter provides novel opportunities for physician-scientists to interact and collaborate across institutions and diverse fields. it increases access to research and enables real-time discussion of new publications [ ] . not only does it serve to disseminate information, it also may be utilized as a means to generate data [ , ] . as this platform is increasingly integrated into the academic medical community, it is important to consider both the benefits and potential challenges posed by this technology. opportunity to connect and advance common interests. even trainees at an early stage are able to follow and engage with leading experts in a particular specialty with greater ease, thus advancing their understanding of key scholarship or topics of discussion at the forefront of the field [ , ] . additionally, engagement on twitter prior to and during academic meetings can help build professional relationships and communities that may lead to future collaborations or opportunities for career advancement [ •, , ] . in one recent analysis of tweets during the american society of clinical oncology annual meetings between and , pemmaraju and colleagues found that both individual authors and overall number of tweets significantly increased over the year period [ ] . meeting attendees may tweet responses and commentary to presented scholarship and even arrange "tweet ups" or face-to-face meetings for those who met virtually on twitter [ , ] . and while in the past, missing a national or international conference may have led to loss of access to important new data, ideas, or opportunities for collaboration, now, as academic meetings are increasingly integrated with social media, physicians can watch presentations, participate in discussions, and network with other attendees remotely [ , , ] . mentorship and academic sponsorship can also be practiced through the medium of twitter. mentors or academic sponsors advanced in their field who have increased influence or impact on twitter can promote the accomplishments of their mentees to increase their individual visibility. likewise individuals can promote their own accomplishments including research publications, academic promotions, or awards targeting a broader audience that may result in additional career opportunities [ , ] . and as social media engagement continues to grow, academic institutions such as mayo clinic have even begun to consider ways to incorporate social media scholarship into metrics for academic promotion and tenure [ • ]. while there are considerable potential professional benefits to engaging in social media platforms such as twitter, there are also challenges. social media may blur the line between the professional and personal identity of a physician and missteps may harm the professional reputation of users [ , ] . it is therefore critical to compose each "tweet" with the understanding that the post will be public and permanent [ ] . in one study let by chretian et al., tweets from selfidentified physicians were analyzed over one month. of those, tweets were categorized as unprofessional with representing potential patient privacy violations, containing profanity, with sexually explicit material, and with discriminatory statements. and amongst the users responsible for privacy violations, ( %) were identifiable by full listed name on profile, photo, or linked website [ , ] . furthermore, physicians are not simply at risk of disapproval by colleagues and patients or punitive actions by employers. a survey of the directors of medical and osteopathic boards revealed % ( out of respondents) indicating at least one of several online professionalism violations had been reported to the board. in response % held disciplinary hearings and serious disciplinary outcomes including license restriction, suspension, or revocation occurred at % of the boards [ ] . in response to these concerns, the american medical association created guidelines for social media use amongst physicians [ ] . however, this guidance does not provide clear rules of conduct and should serve as simply the first step in the construction of formal policies and training across institutions for physicians on social media. another key issue that is introduced by the use of twitter is the potential amplification of implicit biases and structural inequality already problematic in academic medicine. while many maintain that twitter can increase equity by opening new channels of communication to diverse individuals across geographic, socioeconomic, and disciplinary barriers, others argue that social media may increase the impact of those who already have the most impact and exacerbate inequality [ ] . gender inequalities have already been identified in many key areas across medicine, and gender bias in the way women are addressed and perceived may affect career advancement [ , ] . how twitter reinforces these biases must be considered. one study by zhu et al. identified twitter users amongst speakers and coauthors presenting at academy health's annual research meeting and evaluated their most recent tweets. amongst more than health services researchers, women had less influence on twitter than men with half of the mean number of followers, and fewer mean likes and retweets per year. these differences were largest amongst full professors and similar across the distribution of number of tweets [ • ]. further investigation is needed into whether these inequities exist for other underrepresented minorities on twitter. finally, it is important to acknowledge that twitter may have detrimental effects on the productivity of participants. while there are small steps being taken towards acknowledging activity and scholarship on social media at certain institutions, there is still minimal formal recognition of physician use of twitter in a professional sense [ , ] . it can be easy to sacrifice the slower more laborious work of designing studies, writing papers or book chapters, and keeping up with patient charting when faced with the potential positive feedback loop of a popular tweet. benefits social media and twitter in particular have radically transformed the landscape of information sharing, and this is especially relevant in relation to biomedical research. the platform presents opportunities for rapid review of new papers, easy access to multiple journals and expert opinions, increased potential for crowdsourcing, and enhanced postpublication peer review. physicians can follow respected journals, professional societies, and mentors or colleagues who may be sharing important advances in the field. in this way, physicians can stay up to date with minimal time expended. tweets and articles can be saved or "bookmarked" to review in more detail later [ ] . similarly, researchers may increase the impact of their work by using twitter. one study analyzing tweets showed that highly tweeted articles were times more likely to be highly cited than less-tweeted articles [ ] . journals may also utilize social media such as twitter to increase the impact factor of their work. one group recently proposed instituting a tif or twitter impact factor for journals to measure the academic reach and impact of a journal on the social media platform [ ] . twitter has also encouraged innovative forms of communicating research findings. another recent prospective case control crossover study looked at research articles published in the same year in annals of surgery. each article was tweeted in two formats: as the title alone or as the title with a visual abstract. a strong correlation was found between the use of visual abstracts and increased dissemination on social media. additionally, the articles with a visual abstract tweeted received more site visits than the articles without visual abstracts. [ ] one area that has expanded rapidly on twitter is postpublication peer review and twitter journal clubs. journal clubs have long served as important tools for propagating new research, practicing evidence-based medicine, and developing skills to evaluate research design and validity of the findings [ ] [ ] [ ] . recently a diverse range of twitter journal clubs have arisen including id journal club, nephjc, jgim twitter journal club, and others [ , ] . organizers will choose articles and indicate a date and time for the meeting. tweets are organized and referenced by hashtags and participants can follow along or interact by commenting on individual tweets. content experts or authors may be invited and physicians at all levels may join in to learn collectively. and while these meetings often cater to physicians and physicianscientists, journal clubs are typically open to any individual including patients, allowing improved public dissemination of new research advances. crowdsourcing and collaboration during peer review may lead to important findings of design or methodology errors, statistical inconsistencies, or other flaws in publications. in one case, twitter critics rapidly identified errors in methodology in an article in science titled "genetic signatures of exceptional longevity in humans". within a week, the authors released a statement acknowledging a technical error in the lab test used and the paper was eventually retracted [ ] . as the speed and breadth of scientific publication increases, twitter remains an important resource to critically appraise the expanding literature. crowdsourcing and network utilization may also be used positively to impact public health efforts by disseminating educational information to communities, amplifying emergency notifications, and enhancing aid efforts when needed [ ] . this has been a particularly useful tool during the covid- pandemic of as the cdc and local health departments have used twitter to circulate critical health information. while social media provides an immense opportunity for information uptake and dissemination, there are important caveats to this information exchange. misinformation is rampant, and developing the ability to discern true facts from misinformation is increasingly challenging as technology advances. new innovations such as the verified badge allow users to know if accounts are authentic, though this may not apply uniformly. as pershad et al. noted, while a celebrity may be verified due to his/her role in the public eye, that individual's views on healthcare topics such as vaccination may not be valid health information [ ] . additionally, twitter engagement may be purchased unbeknownst to viewers. in one analysis of the asco annual meeting, the second largest number of retweets was from fake engagement or purchased retweets by a third party [ ] . in another study by desai et al., tweets contained in the official twitter hashtags of thirteen medical conferences from to were analyzed. the twitter influence of third-party commercial entities was found to be similar to that of healthcare providers [ ] . it is critical to curb this fake engagement at professional medical meetings moving forward to reduce bias and promote transparency. even if physician accounts and engagement are authentic, financial conflicts of interest are frequently not revealed on social media. this may also lead to bias in transmission of information, particularly if populations with less medical expertise such as patients are involved. in one study in jama, out of hematologist-oncologists in the usa who use twitter were found to have some financial conflicts of interest [ ] . however, no clear regulations regarding disclosure exist in regard to physician social media and this should be duly considered when evaluating information sources. another example of the potential challenges of twitter was demonstrated with the rapid increase in preprints over recent years and notably during the covid- pandemic. while preprints are beneficial in making novel findings rapidly available, these manuscripts often have not undergone the full peer review process. inexperience from the media and lay public in distinguishing peer-reviewed from non-peer-reviewed publications can lead to magnification of findings that are erroneous [ , ] . twitter not only creates unique opportunities for learning about new research findings, it can also provide rich clinical educational content [ ] . "tweetorials" or threaded tweets are used frequently to present lessons on clinical topics and engage learners at all levels [ ] . teaching podcasts such as "the curbsiders" and "clinical problem solvers" have also utilized twitter to widen their audience and condense important lessons into easily digestible tweets. one systematic review examined studies that assessed the effect of social media platforms on graduate medical education. these modalities were used to share clinical teaching, points, disseminate evidence-based medicine, and circulate conference materials. given the fast-paced nature of medical residency, social media provides a logical space for on-the-go learning and review. one notable finding was that most studies offered mixed results and provided little guidance on how best to incorporate social media platforms formally into graduate medical education [ ] . not only does twitter provide opportunities for trainee and continuing medical educations, it also may be used as a critical tool for patient education [ ] . in one survey-based study, a breast cancer social media twitter support community was created. respondents reported increased knowledge about their breast cancer in a variety of areas and participation led . % to seek a second opinion or bring additional information to the attention of their treatment team [ ] . on twitter, communities can be created for and by patients using diseasespecific hashtags [ ] . for rare diseases in particular, these communities can facilitate new avenues for connection, education, and collaboration between patients and physicians working in highly specialized areas [ ] . these networks can even be used as modes to propagate information about available clinical trials to diverse populations [ ] . important limitations to learning via twitter remain. patient privacy issues can arise, particularly as photos, radiology, and case descriptions are more widely shared [ , , ] . twitter can serve as an echo chamber, where ideas are magnified by like-minded individuals in close networks, reducing the sharing of outside perspectives [ ] . finally, the volume of information can overwhelm users, making it difficult to distinguish valuable knowledge from irrelevant comments. there are significant benefits to the effective utilization of social media platforms such as twitter. physicians and scientists may grow their networks, gain career opportunities, expand the impact of their research, connect with patients, stay up to date on novel discoveries, and much more. however, clear frameworks for professional use of this technology are still being developed. it is vital to better understand the risks to patients and providers in order to safely and deliberately integrate this valuable tool into our institutions and practices. conflict of interest the authors declare that there is no conflict of interest. human and animal rights and informed consent this article does not contain any studies with human or animal subjects performed by any of the authors. the use and impact of twitter at medical conferences: best practices and twitter etiquette social medicine: twitter in healthcare scientists in the twitterverse twitter and beyond: introduction to social media platforms available to practicing hematologist/oncologists risks and benefits of twitter use by hematologists/oncologists in the era of digital medicine twitter as a tool for communication and knowledge exchange in academic medicine: a guide for skeptics and novices trends in twitter use by physicians at the american society of clinical oncology annual meeting analysis of the use and impact of twitter during american society of clinical oncology annual meetings from to : focus on advanced metrics and user trends social media and the practicing hematologist: twitter for the busy healthcare provider tweeting the meeting leveraging social media for cardio-oncology using social media to promote academic research: identifying the benefits of twitter for sharing academic work academics and social networking sites: benefits, problems and tensions in professional engagement with online networking first demonstration of one academic institution's consideration of incorporation of social media scholarship into academic promotion professionalism in the digital age social media and physicians' online identity crisis physicians on twitter physician violations of online professionalism and disciplinary actions: a national survey of state medical boards report of the ama council on ethical and judicial affairs: professionalism in the use of social media evaluating unconscious bias: speaker introductions at an international oncology conference gender differences in publication rates in oncology: looking at the past, present, and future gender differences in twitter use and influence among health policy and health services researchers can tweets predict citations? metrics of social impact based on twitter and correlation with traditional metrics of scientific impact introducing the twitter impact factor: an objective measure of urology's academic impact on twitter visual abstracts to disseminate research on social media the journal club and medical education: over one hundred years of unrecorded history the evolution of the journal club: from osler to twitter the times they are a-changin': academia, social media and the jgim twitter journal club peer review: trial by twitter quantifying the twitter influence of third party commercial entities versus healthcare providers in thirteen medical conferences from financial conflicts of interest among hematologist-oncologists on twitter will the pandemic permanently alter scientific publishing? how swamped preprint servers are blocking bad coronavirus research twitter-based learning for continuing medical education? from tweetstorm to tweetorials: threaded tweets as a tool for medical education and knowledge dissemination the use of social media in graduate medical education cancer patients on twitter: a novel patient community on social media twitter social media is an effective tool for breast cancer patient education and support: patient-reported outcomes by survey disease-specific hashtags for online communication about cancer care rare cancers and social media: analysis of twitter metrics in the first years of a rare-disease community for myeloproliferative neoplasms on social media-#mpnsm cancer communication in the social media age pathology image-sharing on social media: recommendations for protecting privacy while motivating education publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations key: cord- - w rtn authors: lewandowsky, stephan; jetter, michael; ecker, ullrich k. h. title: using the president’s tweets to understand political diversion in the age of social media date: - - journal: nat commun doi: . /s - - - sha: doc_id: cord_uid: w rtn social media has arguably shifted political agenda-setting power away from mainstream media onto politicians. current u.s. president trump’s reliance on twitter is unprecedented, but the underlying implications for agenda setting are poorly understood. using the president as a case study, we present evidence suggesting that president trump’s use of twitter diverts crucial media (the new york times and abc news) from topics that are potentially harmful to him. we find that increased media coverage of the mueller investigation is immediately followed by trump tweeting increasingly about unrelated issues. this increased activity, in turn, is followed by a reduction in coverage of the mueller investigation—a finding that is consistent with the hypothesis that president trump’s tweets may also successfully divert the media from topics that he considers threatening. the pattern is absent in placebo analyses involving brexit coverage and several other topics that do not present a political risk to the president. our results are robust to the inclusion of numerous control variables and examination of several alternative explanations, although the generality of the successful diversion must be established by further investigation. o n august , , a devastating earthquake maimed and killed thousands in china's yunnan province. within hours, chinese media were saturated with stories about the apparent confession by an internet celebrity to have engaged in gambling and prostitution. news about the earthquake was marginalized, to the point that the chinese red cross implored the public to ignore the celebrity scandal. the flooding of the media with stories about a minor scandal appeared to have been no accident, but represented a concerted effort of the chinese government to distract the public's attention from the earthquake and the government's inadequate disaster preparedness . this organized distraction was not an isolated incident. it has been estimated that the chinese government posts around million social media comments per year , using a -cent army of operatives to disseminate messages. unlike traditional censorship of print or broadcast media, which interfered with writers and speakers to control the source of information, this new form of internet-based censorship interferes with consumers by diverting attention from controversial issues. inconvenient speech is drowned out rather than being banned outright . in western democracies, by contrast, politicians cannot orchestrate coverage in social and conventional media to their liking. the power of democratic politicians to set the political agenda is therefore limited, and it is conventionally assumed that it is primarily the media, not politicians, that determine the agenda of public discourse in liberal democracies , . several lines of evidence support this assumption. for example, coverage of terrorist attacks in the new york times has been causally linked to further terrorist attacks, with one additional article producing . attacks over the following week . coverage of al-qaeda in premier us broadcast and print media has been causally linked to additional terrorist attacks and increased popularity of the al-qaeda terrorist network . similarly, media coverage has been identified as a driver-rather than an echo-of public support for right-wing populist parties in the uk . further support for the power of the media emerges from two quasi-experimental field studies in the uk. in one case, when the sun tabloid switched its explicit endorsement from the conservative party to labour in , and back again to the conservatives in , each switch translated into an estimated additional , votes for the favored party at the next election . in another case, the longstanding boycott of the anti-european sun tabloid in the city of liverpool (arising from its untruthful coverage of a tragic stadium incident in with multiple fatalities among liverpool soccer fans), rendered attitudes towards the european union in liverpool more positive than in comparable areas that did not boycott the sun . in the united states, the gradual introduction of fox news coverage in communities around the country has been directly linked to an increase in republican vote share . finally, a randomized field experiment in the us that controlled media coverage of local papers by syndicating selected topics on randomly chosen days, identified strong flow-on effects into public discourse. the intervention increased public discussion of an issue by more than % . this conventional view is, however, under scrutiny. more nuanced recent analyses have invoked a market in which the elites, mass media, and citizens seek to establish an equilibrium . in particular, the rapid rise of social media, including the microblogging platform twitter, has provided new avenues for political agenda setting that have increasingly discernible impact. for example, the content of twitter discussions of the hpv vaccine explains differences in vaccine uptake beyond those explainable by other socioeconomic variables. greater spread of misinformation and conspiracy theories on twitter are associated with lower vaccination rates . similarly, fake news (fabricated stories that are presented as news on social media) have controlled the popularity of many issues in us politics , mainly owing to the responsiveness of partisan media outlets. the entanglement of partisan media and social media is of considerable generality and can sometimes override the agendasetting power of leading outlets such as the new york times . one important characteristic of twitter is that it allows politicians to directly influence the public's political agenda . for example, as early as , a sample of journalists acknowledged their reliance on twitter to generate stories and obtain quotes from politicians . with the appearance of donald trump on the political scene, twitter has been elevated to a central role in global politics. president trump has posted around , tweets as of february . to date, research has focused primarily on the content of donald trump's tweets [ ] [ ] [ ] [ ] . relatively less attention has been devoted to the agenda-setting role of the tweets. some research has identified the number of retweets trump receives as a frequent positive predictor of news stories (though see for a somewhat contrary position.). during the election campaign, trump's tweets on average received three times as much attention than those of his opponent, hillary clinton, suggesting that he was more successful at commanding public attention . here we focus on one aspect of agenda-setting, namely donald trump's presumed strategic deployment of tweets to divert attention away from issues that are potentially threatening or harmful to the president , . unlike the chinese government, which has a -cent army at its disposal, diversion can only work for president trump if he can directly move the media's or the public's attention away from harmful issues. anecdotally, there are instances in which this diversion appears to have been successful. for example, in late , president-elect trump repeatedly criticized the cast of a broadway play via twitter after the actors publically pleaded for a "diverse america." this twitter event coincided with the revelation that trump had agreed to a $ million settlement (including a $ million penalty ) of lawsuits against his (now defunct) trump university. an analysis of people's internet search behavior using google trends confirmed that the public showed far greater interest in the broadway controversy than the trump university settlement , attesting to the success of the presumed diversion. however, to date evidence for diversion has remained anecdotal. this article provides the first empirical test of the hypothesis that president trump's use of twitter diverts attention from news that is politically harmful to him. in particular, we posit that any increase in harmful media coverage may be followed by increased diversionary twitter activity. in turn, if this diversion is successful, it should depress subsequent media coverage of the harmful topic. to operationalize the analysis, we focused on the mueller investigation as a source of potentially threatening or harmful media coverage. special prosecutor robert mueller was appointed in march to investigate russian interference in the election and potential links between the trump campaign and russian officials. given that legal scholars discussed processes by which a sitting president could be indicted even before mueller delivered his report , , and given that mueller indicted associates of trump during the investigation , there can be no doubt that this investigation posed a serious political risk to donald trump during the first years of his presidency. the center panel of fig. provides an overview of the presumed statistical paths of this diversion. any increase in harmful media coverage, represented by the word cloud on the left, should be followed by increased diversionary twitter activity, which is captured by the word cloud on the right. the expected increase is represented by the "+" sign along that path. if this diversion is successful, it should depress subsequent media coverage of the harmful topic (path labeled by "−" to represent the opposite direction of the association). each of the two paths is represented by a regression model that relates twitter activity (represented by the number of relevant tweets) with media coverage (represented by the number of reports concerning russia-mueller). we approached the analysis in two ways: first, we used ordinary least squares (ols) in which each of the two paths was captured by a separate regression model. second, we used three-stage least squares ( sls) to estimate both path coefficients in a single model simultaneously. we obtained daily news coverage from two acknowledged benchmark sources in tv and print media during the first years of trump's presidency ( january - january ): the american broadcasting corporation's abc world news tonight, a -min daily evening news show that ranks first among all evening news programs in the us , and the new york times (nyt), which is widely considered to be the world's most influential newspaper by leaders in business, politics, science, and culture . the nyt was also ranked first in the world, based on traffic flow, by the international search engine imn in (https://www. imn.com/top /). a recent quantitative analysis of more than , news articles in the nyt through a combination of machine learning and human judgment (involving a sample of nearly judges) has identified the new york times as being quite close to the political center, with a slight liberal leaning (which was found to be smaller than, e.g., the corresponding conservative slant of the wall street journal) . similarly, abc news is known to be favored by centrist voters without however being shunned by partisans on either side . in the online news ecosystem, abc (combined with yahoo!) and nyt form some of the most central nodes in the network, and nearly the entire online news audience tends to congregate at those brand-name sites . specification of diversionary topics on twitter is challenging a priori because potentially any topic, other than the mueller investigation itself, could be recruited by the president as a diversionary effort. we addressed this problem in two ways. first, we conducted a targeted analysis in which the diversionary topics were stipulated a priori to be those that president trump prefers to talk about, based on our analysis of his political position and rhetoric during the first years of this presidency. second, we conducted an expanded analysis that considered the president's entire twitter vocabulary as a potential source of diversion. this analysis allowed for the possibility that trump would divert by highlighting topics other than those that he consistently favors. both analyses included a number of controls and robustness checks, such as randomization, sensitivity analyses, and the use of placebo keywords, to rule out artifactual explanations. both analyses were approached in two different ways. the first approach fitted two independent ordinary least squares (ols) regression models that (a) predicted diversion from mueller coverage and (b) captured a suppression of mueller coverage in the media as a downstream consequence of diversion (see fig. and "methods"). the second approach used a three-stage least squares ( sls) regression model. in a sls regression, multiple equations are estimated simultaneously; in our case there are two equations that capture diversion and suppression, respectively. this approach is particularly suitable when phenomena may be reciprocally causal. targeted analysis. the targeted analysis focused on the association between media coverage of the mueller investigation and president trump's use of twitter to divert attention from that coverage. the analysis also asked whether that diversion, if it is triggered, might in turn suppress media coverage of the mueller investigation. we assumed that the president's tweets would divert attention from mueller to his preferred topics. we considered the three keywords "china," "jobs," and "immigration" as markers of those topics and explored all combinations of those words being used in trump's tweets. our choice of keywords was based on the following considerations. first, at the time of analysis, which predates the covid- crisis, the us unemployment rate was at its lowest in at least a decade ( . % in , monthly rates provided by the bureau of labour statistics averaged through june; https://www.bls.gov/webapps/legacy/cpsatab .htm), and president trump is routinely claiming credit for job creation . the president has also made china-related issues some of his main international policy topics (e.g., when it comes to international trade) , suggesting that this is also one of his focal areas of political activity. finally, controlling and curtailing immigration was central to trump's election campaign and continues to be a major policy plank. in further support of our choice of keywords, an analysis of donald trump's campaign fig. the center panel shows a conceptual model of potential strategic diversion by donald trump via twitter (where his handle is @realdonaldtrump). the word cloud on the left contains the most frequent words from all articles in the nyt that contained "russia" or "mueller" as keywords. the nyt articles captured by the word cloud contained a total of , unique words. we excluded "president" and "trump" because of their outlying high frequency. in addition to the words shown here, the top % of high-frequency items included terms such as "collusion," "impeachment," "conspiracy," and numerous names of actors relevant to the investigation, such as "mueller," "putin," "comey," "manafort," and so on. the word cloud on the right represents the most frequent words occurring in donald trump's tweets that were chosen, on the basis of keywords, to represent his preferred topics. the expected sign of the regression coefficient is shown next to each path and is identical for both approaches (ols and sls). the delay between media coverage and diversionary tweets is assumed to be shorter than the subsequent response of the media to the diversion. this reflects the relative sluggishness of the news cycle compared to donald trump's instant-response capability on twitter. promises in by independent fact-checker politifact revealed that the top promises were consonant with our topic keywords . early in his term, and during our sampling period, job growth in the us and withdrawing from trade agreements together with action on immigration were again identified as being among the three issues the president "got most right" . at the time of this writing, a website by the trump campaign (https://www.promiseskept.com/) that is recording the president's accomplishments lists "economy and jobs," "immigration," and "foreign policy" as the first three items. the "foreign policy" page, in turn, mentions china times (with another nine occurrences of "chinese"). the only other countries mentioned more than once were israel ( ), iran ( ), canada ( ), and japan ( ). the word cloud on the right of fig. summarizes the content of the diversionary tweets by showing the most frequent words used in the tweets (omitting function words and stop words). table summarizes the results for two different variants of a pair of independent ols linear regression models using those keywords (see "methods" for details). standard errors derived from conventional ols analyses are displayed in parentheses, whereas newey-west adjusted standard errors that accommodate potential autocorrelations are reported in brackets (see "methods" for a detailed discussion of the variables in the regression models). the supplementary information reports a full exploration of the autocorrelation structure of the data (tables s -s ). all analyses included the relevant lags to model autocorrelations. the first model predicted diversion, represented by the number of times the three diversionary keywords appeared in tweets, from adverse media coverage on the same day and is shown in the first three columns of the table. if the president's tweets about his preferred issues lead to diversion, then regression coefficients for the diversion model should be positive and statistically significant. the table shows that this was indeed observed for all media coverage (nyt, abc, and the combination of the two formed by averaging their standardized coverage). the magnitude of the associations is illustrated in the top row of panels in fig. . numerically, each additional abc news headline containing russia or mueller would have been associated with . additional mentions of one of our keywords in tweets (column in table ). the second model, which predicted media coverage as a function of the number of diversionary tweets on the previous day, is shown in the rightmost three columns in table . if the diversion was successful, then these regression coefficients are expected to be negative, indicating suppression of potentially harmful media coverage by the president's tweets. the table shows that threatening media coverage was negatively associated with diversionary tweets. the magnitude of the association is illustrated in the bottom row of panels in fig. . each additional keyword mention in tweets is associated with a decrease of nearly one-half of an occurrence of "russia" or "mueller" from the next day's nyt (column in table ). table provides an existence proof for the relationships of interest. the supplementary information reports an additional, more nuanced set of analyses for different combinations and subsets of the three critical keywords (tables s -s ). these analyses generally confirm the overall pattern in table . one problematic aspect of this initial analysis is that artifactual explanations for the pattern cannot be ruled out. in particular, although one interpretation of these estimates is consistent with our hypotheses of ( ) media coverage causing diversion and ( ) the diversion in turn suppressing media coverage, the available data do not permit an unequivocal interpretation. specifically, remaining endogeneity concerns (measurement error, reverse causality, and omitted variables) threaten a pure interpretation of these results as causal. we address each of those concerns in turn, and the associated conclusions suggest endogeneity would be unlikely to fully explain away our findings. first, measurement error is unlikely to explain the relationships we found given that we draw from the universe of all nyt articles, abc news segments, and trump tweets in our sample period. even if we were to miss some articles, news segments, or tweets (perhaps because our keywords did not fully catch all relevant articles or news segments), it is not clear how this would produce a systematic bias in one direction that could fundamentally influence our estimates. we support this judgment by displaying word clouds of all selected content (e.g., fig. ). if we systematically mismeasured the content of tweets and media coverage, the word clouds would reveal the error through intrusion of unexpected content or absence of content that would be expected to be present based on knowledge of the topic. similarly, there are reasons to believe that reverse causality cannot fully account for our results. for the first model (diversion; columns - in table ), it is less likely that trump's diversionary tweets could generate more news about the mueller investigation on the same day than the reverse, namely that more news generate more tweets. given the lag time of news reports, even in the digital age, there is limited opportunity for a tweet to generate coverage within the same -h period. moreover, even putting aside that timing constraint, there is no obvious mechanism that would explain why the media systematically reports more on mueller/russia because president trump tweeted on "china," "jobs," or "immigration"-we are not able to formulate a hypothesis why the media would respond in this manner. the reverse, however, motivated our analysis, namely the hypothesis that when trump is confronted with uncomfor- by contrast, we consider the hidden role of omitted variables to be the largest threat to causal identification. in the absence of controlled experimentation (or another empirical identification strategy suited to identify causality), one can never be certain that an effect is not caused by hidden omitted variables that interfere in the presumed causal path. this is an in-principle problem that no observational study can overcome with absolute certainty. it is, however, possible to test whether omitted variables are likely to explain the observed pattern. our first line of attack was to conduct a sensitivity analysis to obtain a robustness value for the diversion and suppression models involving average media coverage (columns and in table ) . the robustness value captures the minimum strength of association that any unobserved omitted variables must have with the variables in the model (predictor and outcome) to change the statistical conclusions. the details of the sensitivity analysis are reported in the supplementary information ( fig. s and table s ). the results further lend support to our hypothesis that adverse media coverage causes the president to engage in diversion, and that this diversion, in turn, causes the media to reduce that coverage, although endogeneity from potentially omitted variables remains less likely to be a concern for the diversion model than the suppression model (see fig. s and table s for detailed quantification). we additionally tackled the omitted-variable problem by fitting both models (diversion and suppression) simultaneously using sls (see "methods"). table reveals that the sls results replicated the overall pattern of the ols analysis, although the significance of the suppression is attenuated. a noteworthy aspect of our sls analysis is that it used two ways to model the temporal offset between tweets and subsequent, potentially suppressed, media coverage. in panel a in table , we used yesterday's tweets to predict today's coverage. this parallels the the sls results further diminish the likelihood of an artifactual explanation: for omitted variables to explain the observed joint pattern of diversion and suppression, those confounders would have to simultaneously explain a positive association between two variables on the same day and a negative association from one day to the next across two different intervals -namely from yesterday to today as well as from today to tomorrow. moreover, those omitted variables would have to exert their effect in the presence of more than other control variables and a large number of lagged variables. we consider this possibility to be unlikely. finally, to further explore whether the observed pattern of diversion and suppression was a specific response to harmful coverage, we conducted a parallel analysis using brexit as a placebo topic. like the mueller inquiry, brexit was a prominent issue throughout most of the sampling period and not under president trump's control. unlike mueller, however, brexit was not potentially harmful to the president-on the contrary, british campaigners to leave the european union were linked to trump and his team . table shows the results of a model predicting diversionary tweets using the same three twitter keywords but nyt coverage of brexit as a predictor (using days of lagged variables as suggested by an analysis of autocorrelations). figure illustrates the content of the brexit coverage and confirms that the topic does not touch on issues that are likely to be politically harmful to the president. abc news did not report on brexit with sufficient frequency to permit analysis. neither of the coefficients involving nyt are statistically significant, as one would expect for media coverage that is of no concern to the president. to provide a more formal contrast between brexit and russia-mueller, we combined the two models (brexit: column of table ; russia-mueller: column of table ) into a single system of equations for a seemingly unrelated regression (sur) analysis . within a sur framework, the consequences of constraining individual parameters can be jointly estimated for the two models. we found that forcing the coefficient for nyt russia-mueller coverage to be zero led to a significant loss of fit, χ ( ) = . , p = . , whereas setting the coefficient to zero for brexit coverage had no effect, χ ( ) = . , p > . . (forcing both coefficients to be equal entailed no significant loss of fit, owing to the imprecision with which the brexit coefficient was estimated, with a % confidence interval that spanned zero and was nearly five times wider than for russia-mueller.) considered as a whole, these targeted analyses suggest that, during the sampling period, president trump's tweets about his preferred topics diverted attention from inconvenient media coverage. that diversion, in turn, appears to be followed by suppression of the inconvenient coverage. because we have no experimental control over the data, this conclusion must be caveated by allowing for the possibility that the results instead reflect the operation of hidden variables. however, additional analyses to explore that possibility produce results to discount that possibility, at least for the diversion model. we acknowledge that the status of the suppression effect is less robust in statistical terms. the expanded analysis further buttresses our conclusion by showing its generality and robustness. expanded analysis. although the twitter keywords for the targeted analysis were chosen to reflect the president's preferred topics, the president may divert using other issues as well. the expanded analysis therefore considered all pairs of words in the president's twitter vocabulary (see "methods"). for each word pair, we modeled diversion as a function of russia-mueller media coverage and suppression of subsequent coverage either using two independent ols models, or using a table three-stage least squares ( sls) models to predict diversionary tweets (cji) from threatening media coverage (russia-mueller; columns - ) and predicting suppression from diversionary tweets (columns - ) simultaneously. figure shows results from the expanded analysis for coverage of russia-mueller in the nyt (panels a-c), abc news (d-f), and the average of both (g-i). each data point represents a pair of words whose co-occurrence in tweets is predicted by media coverage (position along x-axis) and whose association with subsequent media coverage is also observed (position along y-axis). each point thus represents a diversion and a suppression regression simultaneously. the further to the right a point is located, the more frequently the corresponding word pair occurs on days with increasing russia-mueller coverage. the lower a point is located, the less russia-mueller is covered in the media on the following day as a function of increasing use of the corresponding word pair. if there is no association between the president's twitter vocabulary and surrounding media coverage, then all points should lie in the center and largely within the significance bounds represented by the red lines. if there is only diversion, then the point cloud should be shifted to the right. if there is diversion followed by suppression, then a notable share of the point cloud should fall into the bottom-right quadrant. figure shows that irrespective of how the data were analyzed (ols or two variants of sls; columns of panels from left to right), in each instance a notable share of the point cloud sits outside the significance bounds (red lines) in the bottom-right quadrant (summarized further in the supplementary table s ). these points represent word pairs that occur significantly more frequently in donald trump's tweets when russia-mueller coverage increases (i.e., they lie to the right of the vertical red line), and that are in turn followed by significantly reduced media coverage of russia-mueller on the following day (i.e., they lie below the horizontal red line). the results are remarkably similar across rows of panels, suggesting considerable synchronicity between the nyt (top row) and abc (center). the synchronicity is further highlighted in the bottom row of panels, which show the data for the average of the standardized values of coverage in the nyt and abc. to provide a chance-alone comparison, the figure also shows the results for the same set of regressions when the twitter timeline is randomized for each word pair. this synthetic null distribution is represented by the gray contour lines in each panel (the red perimeters represent the % boundary). the contrast between the observed data and what would be expected from randomness alone is striking. to illustrate the linguistic aspects of the observed pattern, the word cloud in fig. visualizes the words present in all the pairs of words in tweets that occurred significantly more often in response to russia/mueller coverage in nyt and abc news (average of standardized coverage) and that were associated with successful suppression of coverage the next day. the prominence of the keywords from our targeted analysis is immediately apparent. the supplementary information presents additional quantitative information about those tweets (table s ) . we performed two control analyses involving placebo keywords to explore whether the observed pattern of diversion and suppression in fig. reflected a systematic interaction between the president and the media rather than an unknown artifact. the first control analysis involved nyt brexit coverage (abc coverage was too infrequent for analysis, with only a single mention during the sampling period.) for this analysis, articles that contained "russia" or "mueller" were excluded to avoid contamination of the placebo coverage by the threatening topic. the pattern for brexit (fig. ) differs from the results for russia-mueller. although brexit coverage stimulates twitter activity by president trump (i.e., points to the right of vertical red line), the word pairs that fall outside the significance boundary tend to be distributed across all four quadrants. the second control analysis examined other placebo topics using the ols approach, represented by the nyt keywords "skiing," "football," "economy," "vegetarian," and "gardening" (fig. ) . the keywords were chosen to span a variety of unrelated domains and were assumed not to be harmful or threatening to the president. for this analysis, the corpus was again restricted to articles that contained neither "russia" nor "mueller" to guard against contamination of the coverage by a threatening topic. for most of these keywords, abc news headlines had zero or one mention only. the exceptions were "football" and "economy," which had and occurrences, respectively. we analyzed abc news for those two keywords and found the same pattern as for all placebo topics in the nyt. across the keywords, less than . % of the word pairs ( out of for ols) fell into the (panels (a-c) ). the center (d-f) and bottom (g-i) row of panels show results for abc news and the average of both media outlets, respectively. the left column of panels (a, d, g) shows results from two independent ols models, the center column (b, e, h) is for a single sls model in which suppression is modeled by relating yesterday's tweets to today's coverage, and the right column (c, f, i) is a sls model in which suppression is modeled by relating today's tweets to tomorrow's coverage. in each panel, the axes show jittered t-values of the regression coefficients for diversion (x-axis) and suppression (y-axis). each point represents diversion and suppression for one pair of words in the twitter vocabulary. red vertical and horizontal lines denote significance thresholds (± . ). word pairs that are triggered by mueller coverage (p < . ) and affect subsequent coverage (p < . ) are plotted in red. the gray contour lines in each panel show the distribution of points obtained if the timeline of tweets is randomized (red perimeter represents % cutoff, see "methods"). the blue rugs represent univariate distributions. bottom-right quadrant. this number is times smaller than for the parallel nyt analysis for russia-mueller. our analysis presents empirical evidence that is consistent with the hypothesis that president trump's use of social media leads to systematic diversion, which in turn may suppress media coverage that is potentially harmful to him. this association was observed after controlling for long-term trends (linear and quadratic), week-to-week fluctuations, and accounting for substantial levels of potential autocorrelations in the dependent variable. the pattern was observed when diversionary topics were chosen a priori to represent the president's preferred political issues, and it also emerged when all possible topics in the president's twitter vocabulary were considered. crucially, in our analysis the diversion and suppression were absent with placebo topics that present no political threat to the president, ranging from brexit to various neutral topics such as hobbies and food preferences. our data thus provide empirical support for anecdotal reports suggesting that the president may be employing diversion to escape scrutiny following harmful media coverage and, ideally, to reduce additional harmful media coverage , , . our evidence for diversion is strictly statistical and limited to two media outlets-albeit those commonly acknowledged to be agenda setting in american public discourse-and it is possible that other outlets might show a different pattern. it is also possible that the observed associations do not reflect causal relationships. these possibilities do not detract from the fact, however, that leading media organs in the us are intertwined with the president's public speech in an intriguing manner. we also cannot infer intentionality from these data. it remains unclear whether the president is aware of his strategic deployment of twitter or acts on the basis of intuition. it is notable in this context that a recent content analysis of donald trump's tweets identified substantial linguistic differences between factually correct and incorrect tweets, permitting out-of-sample classifications with % accuracy . the existence of linguistic markers for factually incorrect tweets makes it less likely that those tweets represent random errors and suggests that they may be crafted more systematically. in other contexts, deliberate deception has also been shown to affect language use . questions surrounding intentionality also arise with the recipients of the diversion. it is particularly notable that results consistent with suppression were observed for the new york times, whose coverage has responded strongly to accusations from the president that it spreads fake news, treating those accusations as a badge of honor for professional journalism . the nyt explicitly warns of the impact of trump's presidency on journalistic standards such as self-censorship, thus curtailing the president's interpretative power. these actions render it unlikely that the nyt would intentionally reduce its coverage of topics that are potentially harmful to the president. the fact that suppression nonetheless occurs implies that important editorial decisions may be influenced by contextual variables without the editors' intention-or indeed against their stated policies. this finding is not without precedent. other research has also linked media coverage to extraneous variables that are unlikely to have been explicitly considered by editors or journalists. for example, opinion articles about climate change in major american media have been found to be more likely to reflect the scientific consensus after particularly warm seasons, whereas "skeptical" opinions are more prevalent after cooler temperatures . it is worth drawing links between our results and the literature on the diversionary theory of war . although it is premature to claim consensual status for the notion that politicians launch wars to divert attention from domestic problems, recent work in history and political science has repeatedly shown an association between domestic indicators, such as poor economic performance or waning electoral prospects, and the use of military force [ ] [ ] [ ] [ ] . perhaps ironically, this association is particularly strong in democracies . against this background, the notion that the president's tweets divert attention from inconvenient coverage appears unsurprising. finally, we connect our analysis of diversion to other rhetorical techniques linked to president trump . each of these presents a rich avenue for further exploration. one related technique involves deflection, which differs from diversion by directly attacking the media (e.g., accusing them of spreading fake news) . another technique involves pre-emptive framing, by launching information with a new angle . an extension of preemptive framing may involve the notion of inoculation, which is the idea that by anticipating inconvenient information, the public may be more resilient to its impact . inoculation has been shown to be particularly useful in the context of protecting the public against misinformation , and it remains to be seen whether preemptive tweets by the president may affect the media or the public in their response to an unfolding story. the availability of social media has provided politicians with a powerful tool. our data suggest that, whether intentionally or not, president trump exploits this tool to divert the attention of mainstream media. further investigation is needed to understand if future us presidents or other global leaders will use the tool in a similar way. our findings have implications for journalistic practice. the american media have, for centuries, given much emphasis to the president's statements. this tradition is challenged by presidential diversions in bites of characters. how journalistic practice can be adapted to escape those diversions is one of the defining challenges to the media for the twenty-first century. materials. the sampling period covered days, from donald trump's inauguration ( january ) through the end of his nd year in office (january , ). we sampled content items from three sources: ( ) all of donald trump's tweets from the @realdonaldtrump handle. tweets that only contained weblinks or were empty after filtering of special characters and punctuation were removed. (table s ). search keys for the targeted analysis threatening media content. we used the search keys "russia or mueller" to identify media content that was potentially threatening to president trump. for each day in the sampling period, we counted how many news items in each of our media sources, nyt and abc, contained one or both of those keywords. diversionary twitter topics. we used the search keys "china," "jobs," and "immigration" to identify potentially diverting content in donald trump's tweets. for each day in the sampling period, we counted the number of times any of those keywords appeared in the tweets for that day. top row of panels shows the results of the expanded analysis using two independent ols models for the placebo keywords "economy" (panel (a)); "football" (b); "gardening" (c); "skiiing" (d); and "vegetarian" (e). in each panel, the axes show jittered t-values of the regression coefficients for diversion (x-axis) and suppression (y-axis). each point represents diversion and suppression for one pair of words in the twitter vocabulary. red vertical and horizontal lines denote significance thresholds (± . ). word pairs that are triggered by coverage of the corresponding keyword (p < . ) and affect subsequent coverage (p < . ) are plotted in red. the gray contour lines in each panel show the distribution of points obtained if the timeline of tweets is randomized (red perimeter represents % cutoff, see "methods"). the blue rugs represent univariate distributions. the word clouds accompanying each plot represent the most frequent words found in the nyt articles selected on the basis of the corresponding keyword. keywords for the expanded analysis. the expanded analysis tested all possible pairs of words in donald trump's twitter vocabulary. the vocabulary comprised words that occurred at least times and were not stopwords. (the results are not materially affected if the occurrence threshold is lowered to .) stopwords (such as "the" or "are") are considered unimportant for text analysis. here we identified stopwords using the smart option for r package tm. we also removed numbers and web links (urls). because focus was on diversion, we also excluded "russia," "mueller," and "collusion," yielding a final vocabulary of n = (n = unique pairs). each pair was used as a set of keywords in a separate regression analysis. the average number of vocabulary items in a tweet was . (s = . ), and we therefore considered two vocabulary words to be sufficient to uniquely identify the topic of a tweet. control variables. all regression models included at least control variables. a separate intercept was modeled for each of the weeks (n = ) during the sampling period to account for potential endogenous short-term fluctuations. the date (number of days since january , ), and square of the date, were entered as two additional control variables to account for long-term trends. in addition, lagged variables were included as suggested by a detailed examination of the underlying autocorrelation structures. autocorrelation structure. we established the autocorrelation structure for each of the variables under consideration by regressing the observations on day t onto the control variables and a varying number of lagged observations of the dependent variable from days t − , t − , …, t − k. supplementary tables s -s report the analyses, which identified the maximum lag (k) for each variable. for tweets, the suggested lag was k = , and for nyt, abc, and the average it was k = , k = , and k = , respectively. all regressions (ols and sls; see below) included the appropriate k lagged variables. ols regression models. the ordinary least-squares (ols) analyses involved two independently estimated regression models. the first model examined whether donald trump's tweets divert the media from threatening media coverage. given the president's ability to respond nearly instantaneously to media coverage, we considered media coverage and tweets on the same day. that is, we regressed the number of diverting keywords in tweets posted on day t on the number of threatening news items also on day t. the second model examined whether donald trump's diverting tweets affected media coverage relating to the potentially threatening topics. this analysis used the number of diverting keywords in the tweets on day t − as a predictor for threatening coverage on day t. the second model thus assumed a lag between tweets (on day t − ) and an association between those tweets and media coverage (day t) to reflect the delay inherent in the news cycle. this model again also included media coverage on day t − to capture the inherent momentum of a topic. all analyses were conducted in r and stata, either using a robust heteroskedasticity-consistent estimation of the covariance matrix of the coefficient estimates (i.e., function vcovhc in the sandwich library with the hc option) or using newey-west standard errors that are designed to deal with autocorrelated time-series data (neweywest option for function coeftest in r). for a subset of the targeted analysis, the results were reproduced in stata as a cross-check. all statistical results reported in all tables here and in the supplementary information are based on two-tailed t-tests for the coefficients and are not adjusted for multiple tests. because the dependent variables were event counts, a negative binomial regression is often considered to be the most appropriate model for analysis. we report a negative binomial analysis of our main results in the supplementary information (table s ). however, negative binomial regressions are known to suffer from frequent convergence problems, and the present analysis was no exception, with the suppression model for abc failing to converge. in addition, the suppression model for average media coverage cannot be analyzed by a binomial model because the dependent variable (the average of standardized coverage for each media outlet) does not involve counted data. for those reasons, we report the ols results. sls regression models. given that we expected a tight coupling between diversion and suppression, we applied a three-stage-least-squares ( sls) regression approach to analyze the coupled system of equations . the sls model jointly predicted trump's distracting tweets today as a function of adverse coverage on the same day, and mueller/russia news today as a function of those tweets yesterday. because the two components of the sls analysis (diversion and suppression) span different days, we explored another approach in which the suppression was estimated as tomorrow's media coverage as a function of today's tweets. the sls models were fit using stata's reg command (called from r via the rstata package). noise-only distribution for expanded analysis. to create a distribution of tvalues for the coefficients of the expanded analysis when only noise was present, the expanded analysis was repeated for each word pair with the twitter timeline randomized anew. randomization disrupts any potential relationship between media coverage and the president's tweets. the density of the bivariate distribution of the t-values associated with the regression coefficients was estimated via a twodimensional gaussian kernel using the kde d function in the r package mass. the distribution from this randomized analysis should be centered on the origin and be confined largely within the bivariate significance bounds. the contours of the estimated densities are shown in figs. , , and as gray lines, with a perimeter line (in red) representing the th percentile. reporting summary. further information on research design is available in the nature research reporting summary linked to this article. a reporting summary is available as a supplementary information file. all data is available at https://osf.io/f bqx/. censored: distraction and diversion inside china's great firewall how the chinese government fabricates social media posts for strategic distraction, not engaged argument the new censors won't delete your words-they'll drown them out how the news media activate public expression and influence national agendas a look at agenda-setting: past, present and future the effect of media attention on terrorism the inadvertent consequences of al-qaeda news coverage does media coverage drive public support for ukip or does public support for ukip drive media coverage? it's the sun wot won it': evidence of media influence on political attitudes and voting from a uk quasi-natural experiment tabloid media campaigns and public opinion: quasiexperimental evidence on euroscepticism in england the fox news effect: media bias and voting the relationships between mass media, public opinion, and foreign policy: toward a theoretical synthesis mapping information exposure on social media to explain differences in hpv vaccine coverage in the united states the agenda-setting power of fake news: a big data analysis of the online media landscape from networks, big data, and intermedia agenda setting: an analysis of traditional, partisan, and emerging online u.s. news reassessing twitter's agenda-building power the agenda-building function of political tweets trumpstyle: the political frames and twitter attacks of donald trump twitter as arena for the authentic outsider: exploring the social media campaigns of trump and clinton in the us presidential election the age of twitter: donald j. trump and the politics of debasement the politics of embarrassment: considerations on how norm-transgressions of political representatives shape nation-wide communication of emotions on social media how trump drove coverage to the nomination: hybrid media campaigning the more attacks, the more retweets: trump's and clinton's agenda setting on twitter trump has turned words into weapons. and he's winning the linguistic war discursive deflection: accusation of "fake news" and the spread of mis-and disinformation in the tweets of using hamilton controversy to distract from $ m fraud settlement and other scandals beyond misinformation: understanding and coping with the post-truth era the office of legal counsel juggernaut: no one is above the law all the president's lawyers-problems and potential solutions in prosecuting presidential criminal conduct abc's david muir grows total viewership lead over nbc's lester holt the global elite. world's best newspapers reflect political changes fair and balanced? quantifying media bias through crowdsourced content analysis the gop hates the 'lamestream media' even more than you think the myth of partisan selective exposure: a portrait of the online political news audience trump responds: 'did you hear the latest con job i am a tariff man": the power of populist foreign policy rhetoric under president trump trump's top campaign promises what trump gets most right and most wrong making sense of sensitivity: extending omitted variable bias trump and the special relationship weak instruments in instrumental variables regression: theory and practice a tale of two paranoids: a critical analysis of the use of the paranoid style and public secrecy by donald trump and viktor orbán a personal model of trumpery: deception detection in a real-world high-stakes setting lying words: predicting deception from linguistic styles a badge of honor? the influence of national temperature fluctuations on opinions about climate change in the u.s. since diversionary theory of war in foreign policy analysis the dividends of diversion: mature democracies' proclivity to use diversionary force and the rewards they reap from it democratic accountability and diversionary force diversionary despots? comparing autocracies' propensities to use and to benefit from military force a taxonomy of trump tweets inoculating against misinformation neutralizing misinformation through inoculation: exposing misleading argumentation techniques reduces their influence three-stage least squares: simultaneous estimation of simultaneous equations all source code for analysis is available at https://osf.io/f bqx/. the authors declare no competing interests. supplementary information is available for this paper at https://doi.org/ . /s - - - .correspondence and requests for materials should be addressed to s.l.peer review information nature communications thanks dean eckles and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. peer reviewer reports are available.reprints and permission information is available at http://www.nature.com/reprintspublisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.open access this article is licensed under a creative commons attribution . international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creativecommons.org/ licenses/by/ . /. key: cord- -rqerh u authors: patel, v.; haunschild, r.; bornmann, l.; garas, g. title: a call for governments to pause twitter censorship: a cross-sectional study using twitter data as social-spatial sensors of covid- /sars-cov- research diffusion date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: rqerh u objectives: to determine whether twitter data can be used as social-spatial sensors to show how research on covid- /sars-cov- diffuses through the population to reach the people that are especially affected by the disease. design: cross-sectional bibliometric analysis conducted between rd march and th april . setting: three sources of data were used in the analysis: ( ) deaths per number of population for covid- /sars-cov- retrieved from coronavirus resource center at john hopkins university and worldometer, ( ) publications related to covid- /sars-cov- retrieved from who covid- database of global publications, and ( ) tweets of these publications retrieved from altmetric.com and twitter. main outcome(s) and measure(s): to map twitter activity against number of publications and deaths per number of population worldwide and in the usa states. to determine the relationship between number of tweets as dependent variable and deaths per number of population and number of publications as independent variables. results: deaths per one hundred thousand population for countries ranged from to , and deaths per one million population for usa states ranged from to . total number of publications used in the analysis was , and total number of tweets used in the analysis was , . mapping of worldwide data illustrated that high twitter activity was related to high numbers of covid- /sars-cov- deaths, with tweets inversely weighted with number of publications. poisson regression models of worldwide data showed a positive correlation between the national deaths per number of population and tweets when holding the country's number of publications constant (coefficient . , s.e. . , p< . ). conversely, this relationship was negatively correlated in usa states (coefficient - . , s.e. . , p< . ). conclusions: this study shows that twitter can play a crucial role in the rapid research response during the covid- /sars-cov- global pandemic, especially to spread research with prompt public scrutiny. governments are urged to pause censorship of social media platforms during these unprecedented times to support the scientific community's fight against covid- /sars-cov- . what is already known on this topic: • twitter is progressively being used by researchers to share information and knowledge transfer. • tweets can be used as 'social sensors', which is the concept of transforming a physical sensor in the real world through social media analysis. • previous studies have shown that social sensors can provide insight into major social and physical events. • using twitter data used as social-spatial sensors, we demonstrated that twitter activity was significantly positively correlated to the numbers of covid- /sars-cov- deaths, when holding the country's number of publications constant. • twitter can play a crucial role in the rapid research response during the covid- /sars-cov- global pandemic. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint results: deaths per one hundred thousand population for countries ranged from to , and deaths per one million population for usa states ranged from to . total number of publications used in the analysis was , and total number of tweets used in the analysis was , . mapping of worldwide data illustrated that high twitter activity was related to high conclusions: this study shows that twitter can play a crucial role in the rapid research response during the covid- /sars-cov- global pandemic, especially to spread research with prompt public scrutiny. governments are urged to pause censorship of social media platforms during these unprecedented times to support the scientific community's fight against covid- /sars-cov- . altmetrics, twitter, spatial maps, covid- /sars-cov- . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint twitter is a social network created in , that brings together hundreds of millions of users around its minimalist concept of microblogging, allowing users to post and interact with messages known as 'tweets".( ) twitter has short delays in reflecting what its users perceive, and its principle of "following" users without obligatory reciprocity, together with a very open application programming interface, make it an ideal medium for the study of online behaviour.( ) tweets can be used as 'social sensors', which is the concept of transforming a physical sensor in the real world through social media analysis. tweets can be regarded as sensory information and twitter users as sensors. studies have demonstrated that tweets analysed as social sensors can provide insight into major social and physical events like earthquakes ( ), sporting events ( ), celebrity deaths ( ), and presidential elections.( ) twitter data contain location information which can be converted into geo-coordinates and be spatially mapped. in this way tweets can be used as social-spatial sensors to demonstrate how research diffuses within a population. ( ) researchers are increasingly using twitter as a communication platform, and tweets often contain citations to scientific papers.( ) twitter citations can form part of a rapid dialogue between users which may express and transmit academic impact and support traditional citation analysis. twitter citations are defined 'as direct or indirect links from a tweet to a peer-reviewed scholarly article online' ( , ) , and reflect a broader discussion crossing traditional disciplinary boundaries, as well as representing 'attention, popularity or visibility' rather than influence. ( ) . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. we use twitter data as social-spatial sensors to demonstrate how research on covid- /sars-cov- diffuses through the population and to investigate whether research reaches the people that are especially affected by the disease. we used three sources of data in this study: ( ) deaths per number of population for covid- /sars-cov- , ( ) publications related to covid- /sars-cov- , and ( ) tweets of these publications. all data was retrieved and analysed between rd march and th april . we used deaths per number of population as a measure of severity of the outbreak of the virus in countries and usa states. we used deaths per one hundred thousand population for country specific data, which was retrieved from coronavirus resource center at john hopkins university.( ) we used deaths per one million population for us state specific data, which was retrieved from worldometer, a provider of global covid- statistics trusted by . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint the deaths per one hundred thousand population for countries ranged from (ethiopia) to (san marino). the deaths per one million population for usa states ranged from (wyoming) to (new york). the total number of publications that were used in the analysis was , and the total number of tweets that were used in the analysis was , (see supplementary material). one of the problems with twitter data in the context of this study is that twitter activity is generally high where more research is done (e.g., western europe or the boston region in figure ). since this is not the activity which we intended to measure, we inversely weighted . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . notes. *** p<. we did not only use the twitter data as social-spatial sensors to investigate global trends, but also within a single country. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint we calculated poisson regression models with deaths per number of population and number of publications as independent variables and number of tweets as dependent variable. table reports the results. the results are based on usa states (out of ) since only usa states with at least one tweet were considered. the percentage changes in expected counts in table point out that deaths per number of population and twitter activities are negatively correlated: for a standard deviation increase in the deaths per number of population of a usa state, the expected number of tweets in that state decreases by . %, holding the usa state's number of publications constant. the results in table further show that the influence of the number of publications is significantly greater than that of the deaths per number of population (and positive). in the usa states, there is a strong dependency of twitter data on the number of publications. figure s demonstrates that at the time of the analysis the usa was an outlier because of lower national deaths per number of population and higher numbers of publications . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint and tweets, when compared to other countries that were significantly impacted by covid- /sars-cov- (e.g., uk, france, spain, and italy). social media can be an effective tool for broadcasting research both within and beyond the academic community. ( ) twitter is one of the best social media platforms for sharing scientific research and knowledge because it allows users to post links of recent publications, write a . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint disease. our study suggests that governments should consider relaxing censorship of social media at times of global crisis, such as the covid- /sars-cov- pandemic. moreover, allowing researchers greater access to platforms such as twitter during a global pandemic can aid the scientific community's fight against misinformation and pseudoscience. ( ) the usa appears to be an outlier in the worldwide data and the country-specific data shows that the usa has a different relationship between tweets and deaths, both of which may be due to the pandemic reaching the usa later than most other countries in the northern hemisphere. before concluding, it is important to consider the limitations of this study. we have analysed tweets mentioning publications in a quantitative manner which does not account for the association of the tweet with the publication (i.e. a tweet may reference a valid study but claim it to be 'fake news' or have another negative overtone). we have not performed any thematic analysis of the tweets in terms of their content (e.g., are tweets referring to testing for covid- /sars-cov- , therapies, or vaccines), quality (e.g., whether tweets are referring to randomised controlled trials or letters), or who tweeted these (e.g., individual researchers, members of the public, universities or pharmaceutical industries). moreover, no distinction was made between tweets and retweets (of original tweets), which raises the question whether a . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . different handling of retweets could yield different results. these are interesting questions which might be an interesting topic of further research. despite these limitations, our study has a number of strengths. we have used an evidence-based and robust methodology (see supplementary material) to clean and analyse data, as well as extracting data from several well-established databases containing real world evidence updated in real time.( - ) our study comes at a very critical point in time, when a rapid research response is vital to develop therapies and vaccines to slow the covid- /sars-cov- pandemic and lessen the damage caused by the disease. our study utilising twitter data as social-spatial sensors can serve as proof-of-concept for future studies on twitter and the evolving pandemic. covid- /sars-cov- began as a cluster of cases of pneumonia in wuhan, hubei province, but the outbreak quickly progressed from an pheic to a pandemic, which highlights the dynamic process of the spread of an infectious disease.( , ) our study has simply investigated a snapshot of the relationship between this pandemic, research outputs, and twitter activity, and demonstrates the importance of how social media platforms can be crucial to spread research with rapid scrutiny, which may also impede the degree of misinformation. we urge governments to pause censorship of social media platforms such as twitter during these unprecedented times to support the scientific community's battle against covid- /sars-cov- . . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint acknowledgements meta-data for publications were downloaded via the dimensions api. twitter data were retrieved from the altmetric.com api. tweets with their location information were retrieved from the twitter api. the authors thank rodrigo costas (cwts) and stacy konkiel (altmetric.com) for helpful discussions regarding the analysis of location information of twitter users. the lead author affirms that the manuscript is an honest, accurate, and transparent account of the study being reported. no important aspects of the study have been omitted, and any discrepancies from the study as planned have been explained. the full data set and the statistical code can be obtained, upon request, from the corresponding author. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint we have read and understood the medrxiv policy on declaration of interests and confirm that we have no conflict of interests. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint a social network analysis of twitter: mapping the digital humanities community earthquake shakes twitter users: real-time event detection by social sensors twitter as social sensor: dynamics and structure in major sporting events. alife twitterstand: news in tweets. proceedings of the th acm sigspatial international conference on advances in geographic information systems tweet the debates: understanding community annotation of uncollected sources are papers addressing certain diseases perceived where these diseases are prevalent? the proposal of twitter data to be used as social-spatial sensors how and why scholars cite on twitter can alternative indicators overcome language biases in citation counts? a comparison of spanish and uk research statement on the second meeting of the international health regulations ( ) emergency committee regarding the outbreak of novel coronavirus ( -ncov) world health organization . . who director-general's opening remarks at the media briefing on covid- - worldometer's covid- data: worldometer global research on coronavirus disease (covid- ): world health organization statacorp. stata statistical software: release working with spmap and maps guide to creating maps with stata limited dependent variable models and probabilistic prediction in informetrics modelling count data regression models for categorical dependent variables using stata detecting influenza epidemics using search engine query data twitter as a tool for health research: a systematic review social impact in social media: a new method to evaluate the social impact of research peer review: trial by twitter tweeting biomedicine: an analysis of tweets and citations in the biomedical literature academic information on twitter: a user survey chinese and iranian scientific publications: fast growth and poor ethics pseudoscience and covid- -we've had enough already. nature. . . crew b, jia h. leading research institutions in the nature index annual tables: nature index key: cord- -plw dukq authors: chire saire, j. e.; oblitas, j. title: covid surveillance in peru on april using text mining date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: plw dukq the present outbreak as consequence by coronavirus covid has generated an big impact over the world. south american countries had their own limitations, challengues and pandemic has highlighted what needs to improve. peru is a country with good start with quarantine, social distancing policies but the policies was not enough during the weeks. so, the analysis over april is performed through infoveillance using posts from different cities to analyze what population was living or worried during this month. results presents a high concern about international context, and national situation, besides economy and politics are issues to solve. by constrast, religion and transport are not very important for peruvian citizens. public health vigilance is the practice of public health agencies that collect, manage, analyze and interpret data systematically and continuously, and spread such data to programs facilitating the measures in public health [ ] . in this field, many ways of public health analysis appear, among them, infodemiology, an emerging area of research studying the relationship between information technology and consumer health, as well as the tools of infometrics and web analysis whose final objective is to inform and collaborate with public health and public policies [ ] . along with this, it is necessary to find determining disease outbreaks in advance in order to reduce their impact on the populations. the supposed advantage of getting information provided by automated systems falls short facing the impossibility of accessing data in real time, as well as inter-operational fragmented systems, which leads to the transfer and processing of longer data [ ] . this kind of technology has been used for diseases, such as whooping cough [ ] , flu [ ] , and immunosuppressive diseases [ ] , among others. currently, we are facing coronavirus disease (covid- ) which is a viral infection highly pathogenic caused by sars-cov- . currently, it is already causing global concern on health [ ] . officially declared as a global pandemic by the world health organization (who) on march , , covid- outbreak (coronavirus disease) has evolved at an unprecedented rate [ ] . in order to help public health and to make better decisions regarding public health and to help with their monitoring, twitter has demonstrated to be an important information source related to health on the internet, due to the volume of information shared by citizens and official sources. twitter provides researchers an information source on public health, in real time and globally. thus, it could be very important for public health research [ ] . within the context of covid , users from all over the world may use it to identify quickly the main thoughts, attitudes, feelings and matters in their minds regarding this pandemic. this may help those in charge to make policies, health professionals and public in general to identify the main problems that concern everybody and deal with them more properly [ ] . this research is aimed to identify the main topics published by twitter users related to the pandemic covid . making the analysis of that information may help those in charge to make policies and healthcare organizations to assess the needs of interest groups and to deal with them properly. the remainder of the paper follows. section presents related works regarding the retrieval infectious diseases information from social media. in section , the data collection methodology for extracting relevant information of covid- from twitter is presented. section describes experimental findings and a discussion related to the analysis. finally, conclusions and future work are described in section . surveillance pretends to observe what happens over one population, region or city to support on politics decisions. one good advantage are cost and time because usually surveys are two components: collection and processing, both can spend many days, even months.. sinnernberg [ ] performs a study about twitter as tool for research on public health, is necessary to highlight researchers uses traditional databases for studies and twitter can provide useful data from people. from papers for the review, research fields as public health ( ), infectious disease ( ). breland [ ] express in social media people create content, exchange information and use this tool for communication. a four benefits from the use: a) disseminate research on public health field, b) fight against misinformation, c) influence policies, d) aid public health research and e) enhance . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . professional development. and yepes [ ] can support the affirmation: twitter is source from useful data for surveillance, considering relevant terms and geographical locations. more applications using twitter and natural language processing are found: monitor h n flu [ ] , dengue in brazil [ ] , covid symptomatology in colombia [ ] , covid infoveillance in south america countries [ ] and monitor city of mexico [ ] finally, ear [ ] found, peruvian internal agencies have overlapping functions so this can limit collaboration, there is not enough technical capacity and resources outside the capital, lima. besides, cultural diversity and geographical issues can present challenges to fight agains one disease infection. therefore, the use of a infoveillance tool based on text mining can provide a support to the goverment and public policies creation. the process to analyze the situation in peru, follows the next steps: • select the relevant terms related to covid pandemic • set the parameters to collect related posts • pre-processing • visualization the scope of the analysis is peru, and this regions so considering news about covid- , the selected terms are: the collection process is through twitter search function, with the next parameters: • date: - - to - - • terms: the chosen words mentioned in previous subsection • geolocalization: the capital of every state from peru, see this step is very important to take relevant words and this is the source to create graphics to help understanding the country. • uppercase to lowercase • eliminate alphanumeric symbols • eliminate words with size less or equal than the next graphics presents the results of the experiments and answer some questions to understand the phenomenon of the pandemic over perú country population. helping the visualisation from monday to sunday during the last two weeks, a cloud of words is presented in fig. . analyzing lima fig. , the one hundred of more frequent terms are related to cases of coronavirus and extracting a value according to the data analysis, it can be seen the regions including words related to "entertainment" issues such as, lambayeque, la libertad, piura and loreto, those which have the highest level of contagion in peru. this issue is related to social and culture differences of the northern coast of the country. a similar case occurs when searching words related to "religious" issues, where regions such as, cajamarca, cuzco and huanuco, include them due to their traditions, common in the zone. see fig. , that since the current situation has made the population to "abandon" some customs and to adopt . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may , . . https://doi.org/ . / . . . doi: medrxiv preprint other new ones, for many people it was hard to adopt these recommended measures firstly, so they has had to look for them by using social networks. another point to consider is that, besides the information on covid , the international information related to the situation in other countries is present in every region. this kind of information is followed by domestic issues from the national situation. publications referring to regional or local issues are scarcely present, see fig. . this may be because, even though health is an aspect that causes society concern, regarding prevention, information coming from the national government is preferred. a similar scenario is present in all the regions, mass media the social network explored is useful to provide data for a exploratory analysis, to know what concerns can have citizens and map the issues per city so public policies can be more efficient and located. peruvian citizens have a high concern related to covid , international context and economy, politics from national context and a minor worrying about religion, transport. digital disease detection -harnessing the web for public health surveillance infodemiology: tracking flu-related searches on the web for syndromic surveillance infodemiology for syndromic surveillance of dengue and typhoid fever in the philippines monitoring public interest toward pertussis outbreaks: an extensive google trends-based analysis infodemiology and infoveillance: tracking online health information and cyberbehavior for public health disease monitoring and health campaign evaluation using google search activities for hiv and aids, stroke, colorectal cancer, and marijuana use in canada: a retrospective observational study novel coronavirus disease (covid- ): a pandemic (epidemiology, pathogenesis and potential therapeutics) covid- , sars and mers: a neurological perspective using twitter for public health surveillance from monitoring and prediction to public response top concerns of tweeters during the covid- pandemic: infoveillance study twitter as a tool for health research: a systematic review social media as a tool to increase the impact of public health research investigating public health surveillance using twitter pandemics in the age of twitter: content analysis of tweets during the h n outbreak building intelligent indicators to detect dengue epidemics in brazil using social networks what is the people posting about symptoms related to coronavirus in bogota, colombia infoveillance based on social sensors to analyze the impact of covid in south american population text mining approach to analyze coronavirus impact: mexico city as case of study towards effective emerging infectious diseases surveillance: evidence from kenya, peru, thailand, and the u.s.-mexico key: cord- - pd mamv authors: shisode, parth title: using twitter to analyze political polarization during national crises date: - - journal: nan doi: nan sha: doc_id: cord_uid: pd mamv democrats and republicans have seemed to grow apart in the past three decades. since the united states as we know it today is undeniably bipartisan, this phenomenon would not appear as a surprise to most. however, there are triggers which can cause spikes in disagreements between democrats and republicans at a higher rate than how the two parties have been growing apart gradually over time. this study has analyzed the idea that national events which generally are detrimental to all individuals can be one of those triggers. by testing polarization before and after three events (hurricane sandy [ ], n. korea missile test surge [ ], covid- [ ]) using twitter data, we show that a measurable spike in polarization occurs between the democrat and republican party. in order to measure polarization, sentiments of twitter users aligned to the democrat and republican parties are compared on identical entities (events, people, locations, etc.). using hundreds of thousands of data samples, a . % increase in polarization was measured during times of crisis compared to times where no crises were occurring. regardless of the reasoning that the gap between political parties can increase so much during times of suffering and stress, it is definitely alarming to see that among other aspects of life, the partisan gap worsens during detrimental national events. currently, the covid- pandemic is posing a danger to the welfare of virtually every single nation. however, as noted in bol et al. [ ] , one thing generally measured to be improved is a population's relationship with its government, given that it is a democracy. in the context of covid- , they also found a general trend of increased approval for the current governmentaffiliated political party, as well as government authorities in general. it should be stated that on average, the majority of a population also approves of a government's response to the covid- epidemic. however according to the pew research center [ ] , this has not been the case for the united states, where % of americans president trump was too slow in the initial national covid- response. attitudes regarding restriction different significantly between political parties: % of dem./dem.-leaning americans are concerned about public restrictions being lifted too quickly, while the same figure is % for rep./rep.-leaning americans. the purpose of this research study is to explore the degree to which levels of bipartisanship in the us have changed during times of crisis, beyond background change which has already been occurring. our hypothesis is that during times of crisis, levels of partisanship in the united states will increase, with both democrats and republicans becoming more polarized on average in the messages they send via twitter. partisanship is defined in this study using sentiment differences around a singular, concrete set of entities; if the perception of a single set of facts still leads to different interpretations and opinions, then bias and polarization are responsible for those differing sentiments. as to how polarization and partisanship will be measured, twitter has proved to be a resourceful tool in scraping sentiment of the public on entities such as popular organizations, trending celebrities or political figures, geographic regions, and even twitter handles (ringsquandl et al., [ ] ). the crises tested will be non-acute, in that they did not occur during one singular day or moment: hurricane sandy, north korean missile test, and lastly, covid- pandemic. non-acute events were chosen because it would be more likely to see bias develop over time, once information regarding the event is widespread over a considerable time length. additionally, these events were chosen due to the fact that the twitter data surrounding these time periods was accessible relative to other events. prior research has shown that there is often a uniform sense of grief after trauma and tragedy, typically resulting in solidarity throughout nations (kropf et al. [ ] ). it is interesting that we didn't see this reaction from the american public during the u.s. government's response to the covid- pandemic, given the difference between how democrats and republicans still disagree on governmentmandated restriction due to covid- [ ] . we classify political perspectives of a twitter user by comparing the number of democrat political figureheads with the number of republican political figureheads that they follow. from demszky et al. [ ] , we have received a "follower list" of democrat and republican figureheads, used to classify the political perspective of a tweet based on its user. using corenlp, entities were then extracted from tweets into a dataset alongside entity metadata including corresponding sentiments from the users. then, polarity for every individual entity with sentiments from both democrats and republicans is calculated by analyzing the difference in sentiment between the two political parties. the polarity across all entities as a single value is then calculated by taking a weighted average of the polarities for every individual entity. as a result, it is possible to examine how a single set of events and facts can be seen through a different lens based on political party. this process is further touched upon in the "methodology" section of this study. after analysis of the results, it was determined that polarization during a national detrimental event can be greater than the polarization between political parties that has already been expected to occur. it was also very interesting to note that for the majority of measured sentiments, republicans tended to have a lower sentiment on average than democrats. the "baseline" tweets were taken for a period of weeks before the event started, while the "crisis" tweets were taken for a period of weeks after the event started; the goal here was to provide an equal timeframe to account for the fact events may take time in order to for the effect of changing sentiment to take place. rel ated wo rk the hypothesis of the "partisanship perceptual screen" plays a critical role in how this study chooses to structure its methodology, as well as its overall goal. according to mary mcgrath, a perceptual screen is capable of "causing adherents of opposing parties to perceive different information from the same set of facts" (mcgrath [ ] ). this has been based on the research of gerber and huber, analyzing the effect of the partisanship perceptual screen on real-world economic behavior (gerber et al., [ ] ). these papers served to aid in understanding how to analyze the set of tweets and define partisanship; in this study, the same specific entities only with measurable sentiment from both the democrat and republican are considered. this study is also examining the effect of "unity after tragedy" a social phenomenon where during/after a tragic event, the victims and those surrounding victims are able to display social solidarity (hawdon, j., & ryan, j. [ ] , sweet [ ] ). the hypothesis surrounding our study states that this phenomenon may not occur in the united states between members of the democrat and republican party. therefore, this phenomenon is undergoing validation through this study. according to the pew research center [ ] , a further gap of political ideology occurred between democrats and republicans once donald trump initiated his response to the covid- pandemic. this raises the question: does the "unity after tragedy" phenomenon still occur when the differences between individuals are in political values? a research study with a similar purpose was that of demszky et al. [ ] , analyzing political polarization regarding mass shootings, with data being fed from twitter. while this study does take a more complex approach to understand the intricacies of how democrats and republicans react differently to a significant tragic event, the goals of this study are aligned with that of this paper. while this paper chooses to examine topic choice, framing, affect, and illocutionary force, this study focuses almost solely on identifying whether a partisan perceptual screen continues to exist and affect how americans view the same set of facts surrounding tragic incidents. additionally, the way the demszky et al., chooses to calculate partisanship (derived from the leave-out estimator in gentzkow et al. forthcoming [ ] ), has a large influence on the calculation method in this study. partisanship is measured here based on existing values of polarization calculated for sets of entities which contains measurable sentiment from both political parties. additionally, partisan assignment is performed using the same method. a list of twitter handles of democrat and republican figureheads was supplied by demszky et al., along with a list of ids of every single follower the political figureheads have. analyzing whether an individual follows more democrat or republican political figureheads led to a corresponding party assignment. meth od ol ogy the hypothesis of this study is that partisanship will increase in times of crises. a sub hypothesis is that the difference in partisanship of different parties will be even greater than background differences due to an ever-growing gap in political values between democrats and republicans, according to a study by the pew research center. the definition of partisanship in this study is based on the partisan "perceptual screen", the idea that surrounding a single entity or set of facts are multiple sets of opinions, as a result of different interpretations based on different political parties (mcgrath [ ] ). as a result, the definition of partisanship in this study will be a measurement of disagreement based around the differing sentiment of concrete entities. the definition of crisis in this study will be based on a definition from igi global: "a situation or time at which a nation faces intense difficulty, uncertainty, danger or serious threat to people and national systems and organizations and a need for non-routine rules and procedures emerge accompanied with urgency". the crises in this study as listed will be the hurricane sandy, north korean missile test surge, and the covid- pandemic. these crises were chosen because they were not "acute tragedies", but rather long-term events; the response to hurricane sandy lasted multiple months, north korea had several tests over the span of many days, and covid- has been infecting individuals from march to present, over many months. for this research study, it is necessary for every tweet to hold a political party assignment of democrat, republican, or neither. this is crucial in understanding the perspective and background that comes alongside every sentiment of a tweet. an assumption being made is that a person has a binary party affiliation, not taking into consideration that a person could identify as a moderate or party-leaning individual. in order to assign a party, we simply reference the accounts they are following and based on whether a user follows more democrat or republican politicians or figureheads, we are able to assign a class. this is a technique alike that of volkova et al. [ ] , and demszky et al. ( ), with the latter stating that this method or party assignment "takes advantage of homophily in the following behavior of users on twitter" (halberstam and knight, [ ] ). additionally, it is assumed that external national political events during the period being studied would not skew partisanship significantly. examples of these would be state elections or national elections as well as any national presidential debates. the timeframes here are only in respect to the events tested (hurricane sandy, n.k. missile testing, covid- ) and do not take any other events in the same time period into account. lastly, it is assumed that in the statistical definition, the automated sentiment classifier being used (stanford corenlp) is not biased, and will be able to yield accurate results. this sentiment classifier has been previously trained on movie reviews, able to classify between "very negative", "negative", "neutral", "positive", and "very positive." there is an assumption that this model will be unbiased in this study, where it is required to predict sentiment on text in a political context. it is also assumed that the sentiment of an entire sentence applies to every sentiment located in the sentence; this is how sentiments are assigned to entities. ]. an example of an entity instance would be "donald trump", or additional instances of the previously mentioned entity types. the overarching goal of this study is to assess the polarization surrounding entities from the democrat and republican political parties; single entities with opposing sentiment between the two political parties represent partisanship. the input data itself is structured as a tweet object, a dictionary-style data structure which contains information about the user's account, text of the tweet, and time posted surrounding the tweet. in order to transform these objects into a data form that can be analyzed to create a single polarization value, the tweet objects will need to be converted into csv files which contain a collection of entities alongside its associated democrat and republican sentiments. in order to perform this data transformation, steps are taken. steps - are completed separately for all tweets with a "democrat" standpoint and all tweets with a "republican" standpoint. corenlp is a linguistic annotator able to provide information on a text regarding sentiment, named entities, dependency, and parts of speech, as well as other aspects not utilized within this study. the input required for this step is the tweet's text, while the result is a dictionary style object with information on the aspects of the text we'd like to examine or could possibly reference in the future (sentiment, ner, dependency, pos for this study) for every entity instance. it should be noted that past deleted tweets were included in the dataset alongside current tweets; deleted tweets were not annotated and used in the study at all. from the information regarding the names of figurehead twitter handles and respective follower ids provided by dora et al., we are able to assign a political party to each tweet based on the user that posted it. for each id, we search the number of times it appears in a list of followers for each democrat figurehead, labeled value . this process is replicated for each republican figurehead, yielding . if = (including = = ), then the tweet will not be used, as only tweets with a partisan bias are useful in this study. after each named entity e is identified inside the tweet's text, a new datum is created. it should be noted that in order to use to most relevant input data, entities tagged by corenlp as the following types were discarded and not used to create data: ("email", "date", "number", "percent", "time", "money", "url"). there is an assumption that entity types that are events, locations, organizations, and people are more likely to create any meaningful measurement of polarization. for the list of tags of entities not used, it does not seem plausible that these types of entities create any meaningful difference of opinion between democrats and republicans. a single datum is then created which contains the following elements: the name of the entity, the user id of the entity's original tweet, the entity sentiment (calculated using the general sentiment of the entity's original sentence), and the associated political party of the tweet. this is repeated for every entity derived from the source of tweets. each datum is then compiled as a single row of one csv file, with the primary key being the entity name. entities that appear in multiple tweets will inevitably appear in multiple rows of the csv containing the data of the entities compiled from the input of tweets. having entities in multiple rows does not allow for calculation of average sentiment, which is why this csv has to be transformed into a new dictionary format. the keys are the of name of each entity , while the values are a list of two elements (average sentiment , number of mentions from original csv n). in order to ensure that entity names that differ only because of capitalization (ex. "donald trump" vs "donald trump") are considered, all entity names are made lowercase. if two entities from the csv are found to be identical, the dictionary value for this entity is recalculated to take the average sentiment s of both entities, and the number of mentions n is updated to represent the number of repeats. an entity mention would be a version of an entity instance that has been repeated; for example, "donald trump" was to be found in several different tweets. essentially, the multiple entity mentions are being reduced to entity instances as a result of this step. once the two dictionaries, one for each political party, are compiled containing the average sentiments on entities and the number of times it appeared throughout the data input, we can then calculate polarization using the formula below. this calculates the amount of polarization p across a single entity. while represents the average sentiment across an entity from users identified as democrats, represents the counterpart for users identified as republicans. in order to normalize the value of polarization to be between and , the difference of and is divided by . the scale for sentiment analysis used by corenlp ranges from (very negative) to (very positive), allowing possible options for sentiment value. in order to calculate polarization across all entities for the time frame of an event, we can utilize . the value will be utilized later to measure the presence of partisanship. it is calculated using a weighted average of the values and the number of mentions the entity itself has had in total, summing and . table : democrat/republican baseline and crisis sentiment and polarization per crisis it can be observed from the table that in all three events which were observed, the polarization from the crisis time period to the baseline time period increased. this seems to be strong evidence that in times of crisis, the average polarization between the two parties increases. on average between the three events, the polarization between baseline and crisis increased by . %, from . % to . %. it is possible that the slight increase in baseline polarization from to represents the ever-growing gap between the democrat and republican parties. additionally, it is interesting to note that on average, the republicans held a slightly more negative sentiment than democrats when referring to entities for baseline and crisis periods of all events. the single exception to this is the average democrat and republican sentiments during hurricane sandy. an unexpected outcome from this data was the slightly lower sentiment in regard to the n. korea missile test surge. in terms of pure consequence and resources lost as a result of a detrimental event, the north korea missile test surge is significantly lower than that or hurricane sandy and covid- . in the u.s. hurricane sandy caused destruction of infrastructure, a $ billion dollar loss, and deaths. the covid- pandemic has caused , deaths at the time of the writing of this paper, and multiple trillions of dollars of loss for the u.s. government and citizens. meanwhile, the n. korea missile test surge hasn't had nearly as significant an effect on the americans population. yet, the overall sentiment is lowest for this time period. during the course of this study, there were limitations which may have an effect on the data collection, leading to potential slight inaccuracies in the polarity value calculations. when entities were being collected, alongside their respective sentiments, they were simply selected if the corenlp tool was able to classify it as a location, organization, or person. however, there was no implementation of a method to ensure that each entity was relevant to the political world or political ideologies. to account for this, there is an assumption that all entities mentioned in a tweet are being chosen based on the viewpoint of a political party. for example, the entity keywords "trump" and "bernie" are politically relevant terms, and are likely to garner opposing sentiment. however, the entity "arnold schwarzenegger" may lead to differences in sentiment that may not be directly apparent. the assumption, for example, would be that arnold schwarzenegger is a democrat, or that another confounding variable leads him to receive a higher sentiment from users labeled as a republican. due to the fact that in terms of percentage-points, the polarity between democrats and republicans here appears to be less than that listed by the pew research center, it may be possible that individuals are less likely to express a political bias on social media. compared to the data collection method pew research center employs (individually-completed surveys), there simply is not the same amount of opportunity to express one's political opinion through bias. for example, adults are specifically asked about issues such as gun control legislation, while directly indicating their political preference. this would directly lead to this data specifically existing for the purpose of understanding differences in political preference. twitter data, however, does not fulfill this purpose due to the fact that it is a social media platform. in the future, it would be beneficial to examine the differences in sentiment within the same party. in this study, two data sets of entity sentiments were used, with one set coming from democrat users and the other from republican users. however, it would be very valuable to assess the validity of this study by replicating it using two data sets from the same political party. if the difference in polarities from the study compared to the altered study is statistically significant, then this study's methodology will need to be revised in order to account for this. additionally, as mentioned as a limitation, it would be useful to include an algorithm that is capable of detecting the amount of relevance of an entity to the topic which is being assessed. for example, in this study the topic would be the democrat and republican parties, as well as any entities that have significance in the political world. understanding how the sentiment surrounding a single entity changes over time is also valuable. analyzing how the amount of time that an event has an effect on the population's general change in sentiment is very valuable, and a good future research question. filtering the popularity of entity changes more heavily can lead to ultimately more accurate predictions; for the sake of low data availability during this study, this aspect was not emphasized. however, with the presence of more data, only testing data with a high number of mentions from both parties can be useful. when taking a look at the entities, it will be useful to analyze what the study would look like if only a 'person" entity was recognized, and whether it truly is different from the data obtained during this study. although this study was conducted with the democrat and republican parties based in the united states, these results should be applicable to other nations that follow a multi-party political culture. applying this study and analyzing how a foreign population's polarization can change after an event in another country may further validate the results of this study and show that the results can be applied on a broader scale. in addition, citizens of other countries may not necessarily use twitter and may use other social media. even in the united states, it would be very useful to analyze data which may come from different sources such as a political forum, which may even aid in making the data more politically relevant. changing the country or the social media source of this study's data may change the way polarization occurs, because of a potential change in user demographics such a region and average age. even in the united states, for example, understanding user demographics would be important. on average, more rural and older citizens tend to align with the republican party, while younger voters in urban areas tend to align with the democrat party. con cl u s i o n after collecting data both before and after the start of a national detrimental event, it's been deduced that such an event can increase polarization between political parties, such as the democrat and republican parties, beyond any background polarization which has already been occurring. by the scale used in this study, the increase in polarization increased at a rate of . % on average between political parties during time periods of national crisis. by utilizing twitter data of users that identify as democrats and republicans and analyzing their sentiment on common entities, polarization can be analyzed. if the sentiment towards common entities differs significantly between democrats and republicans, a high polarization exists. political parties for individuals are deduced in this study based on whether they are following more democrat or republican politicians/figureheads on twitter. polarization data was collected on democrat and republican users for the following three events (hurricane sandy [ ] , n. korea missile test surge [ ], covid- [ ]). sentiment/polarization data was calculated using the stanford corenlp tool, accessed through python. due to the fact that this data was in fact taken through twitter, political relevance of the entities tested could possibly be improved in future studies, where an alternative could be through a politically-based social media or forum. significant future work could also include a replication of this on multi-party systems outside of the united states. if the results still hold true, then this study's results would be validated for applicability outside the united states. a c k n o w l e d g m e n t s this paper was written under the mentorship of gabor angeli. i would like to thank him for his helpful guidance during the research process. additionally, i would like to thank dora demszky for aid in handling twitter data for this project. the effect on covid- lockdowns on political support: some good news for democracy suite washington, & inquiries, d. usa - - | m most americans say trump was too slow in initial response to coronavirus threat analyzing political sentiment on twitter when public tragedies happen: community practice approaches in grief, loss, and recovery analyzing economic behavior and the partisan perceptual screen partisanship and economic behavior: do partisan differences in economic forecasts predict real economic behavior from individual to community: the "framing" of - and the display of social solidarity the effect of a natural disaster on social cohesion: a longitudinal study suite washington, & inquiries, d. usa - - | m americans' growing partisan divide: key findings measuring polarization in highdimensional data: method and application to congressional speech inferring user political preferences from streaming communications homophily, group size, and the diffusion of political information in social networks: evidence from twitter key: cord- -i i clgq authors: salik, jonathan r. title: from cynic to advocate: the use of twitter in cardiology date: - - journal: j am coll cardiol doi: . /j.jacc. . . sha: doc_id: cord_uid: i i clgq nan the use of twitter in cardiology while the majority of americans continue to use social media for personal communication, individuals have increasingly begun to utilize social media as a primary source of news. as of , more than twothirds of americans ( %) report that they access news on social media, and % state that they do so "often" ( ) . though facebook remains the dominant social media platform globally, twitter has gained particular traction within the medical and scientific community. in large part, this may be attributable to twitter's "microblog" format, which limits posts to characters instead of the free-text formats found on other platforms such as facebook and instagram. twitter's streamlined and rapid interface is thus uniquely suited to stimulate academic discussion and promote the circulation of ideas and information. in addition, twitter users can affix hashtags to their tweets, allowing posts to be collated, grouped, and single tweet that contains a high-yield piece of medical information or "pearl" tweet that provides real-time updates during an academic conference or meeting a u g u s t , : - faculty mentor who will be presenting at the confer- true positive is that everyone has an opportunity to speak out and voice an opinion. and some has provided much needed social contact and a bit of humor to ease our stress and relieve combat fatigue. today, some is no longer an optional tool for cardiologists. it is an essential resource. in addition to being a vital tool for teaching, research, mentoring, and advocacy, it also serves as a lifeline for providing the best and most timely care to our patients. by being directly connected to the global cardiovascular community, some makes us better physicians, team members, leaders, and activists for our patients and colleagues. reviewing social media use by clinicians news use across social media platforms improving the safety of pci: tribulations, trials, transfusions, and twitter. presented at: cardiology grand rounds the kardashian index of cardiologists: celebrities or experts? how do researchers use social media and scholarly collaboration networks (scns)? nature.com of schemes and memes blog social media and emergency preparedness in response to novel coronavirus can tweets predict citations? metrics of social impact based on twitter and correlation with traditional metrics of scientific impact more than likes and tweets: creating social media portfolios for academic promotion and tenure the kardashian index: a measure of discrepant social media profile for scientists key: cord- -unoiwi g authors: yu, jingyuan; lu, yanqin; muñoz-justicia, juan title: analyzing spanish news frames on twitter during covid- —a network study of el país and el mundo date: - - journal: int j environ res public health doi: . /ijerph sha: doc_id: cord_uid: unoiwi g while covid- is becoming one of the most severe public health crises in the twenty-first century, media coverage about this pandemic is getting more important than ever to make people informed. drawing on data scraped from twitter, this study aims to analyze and compare the news updates of two main spanish newspapers el país and el mundo during the pandemic. throughout an automatic process of topic modeling and network analysis methods, this study identifies eight news frames for each newspaper’s twitter account. furthermore, the whole pandemic development process is split into three periods—the pre-crisis period, the lockdown period and the recovery period. the networks of the computed frames are visualized by these three segments. this paper contributes to the understanding of how spanish news media cover public health crises on social media platforms. as covid- is becoming a global health crisis, it has been announced as pandemic by world health organization (who, geneva, switzerland) on march [ ]. three days after, being one of the most infected countries, spanish prime minister pedro sanchez declared state of alarm. this is the second time that spain declared a national lockdown, so the influence of the pandemic on spain is substantial. as the situation of the pandemic became stable, the spanish government announced a -step plan for the transition to a new normality on may (plan para la transición hacia una nueva normalidad), signaling that the pandemic is gradually becoming under control. news media are important information sources for the public during epidemic crisis [ ] , serving as interactive community bulletin boards, as well as global or reginal monitors [ ] . with the prevalence of social media, news media organizations have been using these emerging tools to reach and engage boarder audiences during crises [ ] . twitter, being one of the most popular social media, has attracted a great number of traditional newspapers to digitalize real-time core information within characters. while newspaper articles tend to use conflict, responsibility, consequence and savior frames in the coverage of epidemics, their twitter accounts often post real-time updates, scientific evidence and actions [ ] . the tones adopted in the two kinds of news are also different, with newspaper articles using more alarming and reassuring tones and twitter updates using more neutral tones [ ] . scholars have been using the network analysis techniques to study news content. for example, guo [ ] proposed a network agenda setting model (nas) to analyze the salience of the network relationships among objects and/or attributes. inspired by this method, this study conducts network analysis on the twitter posts, analyzing and comparing the news frames of the two most important general-interest and nationally-circulated spanish newspapers (el país and el mundo) during different stages of the covid- crisis. the two selected newspapers are considered different regarding their political stance [ ] , with el país representing the political center-left media and el mundo seen as a political center-right media outlet [ , ] . discussion on the two media would allow us to better explore their particular news focus regarding their divergent political ideologies, thus illustrating a more comprehensive landscape of spanish news coverage on the pandemic. moreover, as this study focuses on the analysis of their twitter content, compared with other newspapers, el país and el mundo have the largest number of online followers, reflecting their substantial influence online. two research gaps are filled in this paper. from the empirical approach, despite the fact that the two spanish newspapers have been widely studied in the past epidemic crisis [ ] [ ] [ ] , their news posts on twitter deserves more investigation in communication research. from the methodological perspective, manual coding process is generally applied in most of the network news agenda and news frame studies [ , ] . to enhance efficiency and minimize the biases involved in manual coding, this study combines unsupervised machine learning technique and network visualization method to make a fully automatic network study, which is a major methodological contribution to the news frame literature. framing is an important research focus in communication studies because how an issue is reported in news can influence how it is understood by audiences [ ] . entman [ ] defined framing as "to select some aspects of a perceived reality and make them more salient in a communication text, in such a way as to promote a particular problem definition, causal interpretation, moral evaluation, and/or treatment recommendation for the item described" (p. ). frames in news media coverage can affect the topical focus and evaluative implications perceived by the audience, as well as their subsequent decision making about public policy [ ] . news frames about health issues and diseases have been found to affect audiences' understanding of health problems and their attitudes and behaviors [ , ] . regarding the ongoing covid- pandemic, the severity of the virus and preventive actions should be communicated to the public effectively. in this case, news media play an important role in enhancing public's understanding of the highly contagious disease, as well as in influencing the attitudinal and behavioral response on the prevention, containment, treatment and recovery [ ] . empirical studies about news frames have been conducted during the past epidemic crisis. for example, lee and basnyat [ ] focused on the news articles of singaporean straits times during h n pandemic and identified nine dominant frames via manual coding-basic information, preventive information, treatment information, medical research, social context, economic context, political context, personal stories and other (open-ended). their study revealed that the news coverage focused more on h n information updates and prevention than on other frames. in another one of their articles [ ] , four additional news themes were found-imported disease, war/battle metaphors, social responsibility and lockdown policy. shih, wijaya and brossard [ ] focused on news coverage about the mad cow disease, west nile virus and avian flu from the new york times by examining six frames-consequence, uncertainty, action, reassurance, conflict, new evidence. the results of their study revealed that the newspaper emphasized the consequence and action frames consistently across diseases but media concerns and journalists' narrative considerations regarding epidemics did change across different phases of development and across diseases. according to the association for media research (asociación para la investigación de medios de comunicación, http://reporting.aimc.es/index.html#/main/diarios), el país and el mundo are the two most read general-interest newspapers in spain in the first quarter of . comparative studies about these two newspapers have been conducted in various context. for example, baumgartner and chaqués-bonafont [ ] found that there are important news coverage differences between these two newspapers when they make explicit reference to individual political parties. regarding negative news about corruption, el país tends to mention right-wing political party, while el mundo mentions left-wing political party more often. the comparison between these newspapers in their news coverage about cannabis have also shown significant differences, el país focused more on the news about marijuana legalization, while el mundo focused more on police and crime news on drug consumption [ ] . during the ebola outbreak, ballester and villafranca [ ] studied the two newspapers together by comparing their news coverage of ebola with other rare diseases. the word "terror" appears more frequently in ebola related news, generating a higher level of anxiety toward ebola than other diseases. catalan-matamoros et al. [ ] studied the visual contents of the two newspapers, two main conclusions are made by the authors. first, the "conflict" frame dominates the portal of the two newspapers, which revealed alarming messages for the audience. second, they found the total number of visual content increased rapidly in the first two days of the crisis and decreased from the fifth day. in sum, the authors described the first two days as "high risk phase" of the epidemic outbreak and from the fifth day onward the "less severe phase." regarding the ongoing covid- crisis, researchers have found that there is a significant increase of coronavirus news in spanish state of alarm phase than the pre-alarm period and the total number of relevant news reported by el mundo is much more than el país [ ] . thanks to the ease of information exchanges on social media platforms, masip et al. [ ] indicates that spanish citizens are more informed during the coronavirus crisis than before. in this case, an in-depth analysis of social media news is warranted. latent dirichlet allocation (lda) is frequently used to extract latent topics from large scale textual data and has also been widely applied for social media studies [ ] [ ] [ ] . according to the developers of this technique, "lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document" [ ] (p. ). previous research has suggested lda an appropriate method to study news media coverage [ ] . for example, heidenreich, lind, eberl and boomgaarden [ ] used this method to identify frames from european refugee crisis news across five countries. for the covid- related studies, poirier et al. [ ] applied lda to identify six news frames (chinese outbreak, economic crisis, health crisis, helping canadians, social impact, western deterioration) from canadian media sources. in addition, network analysis methods have been widely adopted on communication studies. for example, regarding the mad cow disease, lim, berry and lee [ ] visualized the core word network of four groups (bureaucrats, citizens, scientists and interest groups) across four policy stages based on newspaper articles. they found the four groups focused on different policy issues and the news coverage did change over different stages. this study demonstrated that semantic network analysis is a powerful method for understanding issue framing in the policy process. fu and zhang [ ] used word co-occurrence network to study ngos' hiv/aids discourse on social media and website. their study revealed overlapping themes about hiv/aids across social media and website and ngos use social media to engage with the government, as well as other health care resources. kang et al. [ ] examined the vaccine sentiment on twitter by constructing and analyzing semantic networks of related information and found that semantic network of positive vaccine sentiment has a greater cohesiveness than the less-connected network of negative vaccine sentiment. this study sheds the light on discovering online information with a combination of natural language processing and network methods. on the other side, bail [ ] conceptualized a method to combine natural language processing and network analysis to examine how advocacy organizations stimulate conversation on social media. the author's idea is to convert the content of different documents into bag-of-words and then find the similarities (edges) between the documents by word co-occurrence. this method is further developed as a visualization tool to display a text network at group-word level [ , ] . in our case, each of the computed news frame (latent topic) is considered as a group of their relevant content, represented as nodes on the network and the edges between the frames are visualized according to the co-occurrence of the content and weighted by term frequency-inverse document frequency (tf-idf). to be clearer, the tf-idf is a numerical statistic to measure how relevant a word to a document in a corpus [ ] , it has been widely applied in text mining research, also in the abovementioned bail's work [ ] . our data are hydrated from open access institutional and news media tweet dataset for covid- social science research [ ] , which includes the twitter posts from the two selected spanish newspapers from the end of february. the first step is data cleaning, in which all the retweets are removed. then we deleted all the attached external website addresses, hashtags (#hashtags), mentions (@mentions), emojis, arabic numbers and stopwords (e.g., prepositions, pronouns etc.), because such information is considered less meaningful in computational text analysis [ ] . in addition, all the capital letters were converted to lower case (to standardize all the words) and we normalized the text with lemmatization (which refers to group together the inflected forms of a word) before the data are ready for the lda model analyses. using the lda function of r package "topicmodels" [ ] , we computed eight topics for each newspaper's twitter posts. the decision made on the number of topics is because too few topics make news frames less specific and too many topics make the network less interpretable [ ] . in order to make the performance of the topic model more efficient, we used the gibbs sampling method [ , ] , one of the most widely used statistical sampling techniques for probabilistic models and short-text classification [ ] [ ] [ ] [ ] . after having obtained the computed topics (news frames), we re-assigned each of the news tweets into their belonging frames, so we have a new dataset with the tweets of each newspaper categorized by the news frames. as the news focus regarding epidemics did change across different phases of the pandemic's development [ ] , following the work of pan and meng that adopted a three-stage model to analyze news frames during a previous pandemic [ ] , we split each dataset by three periods. the pre-crisis period includes tweets before march when spanish national lockdown was announced. this is the period that the pandemic information has been reported but not been officially alarmed by spanish government. the lockdown period includes tweets between march and may, the period that the spanish government adopted a strict national confinement. the recovery period includes the tweets from may (the day when spain stepped into the first stage of social recovery) to june (the last day of data collection). finally, a network of relationships between news frames has been generated from their word co-occurrence matrix for each newspaper during each time period. therefore, a total of six networks are constructed. for the el país dataset, a total number of , tweets are collected from february to june . after removing retweets, , original tweets are saved for our in-depth analysis. eight news frames have been successfully computed, they are "livelihood" (family life and children), "public health professional" (news about the department of public health), "pandemic update" (contagion and death poll), "madrid" (news about madrid), "politics" (general political news), "state of alarm" (spanish government and pm's announcement and policy update), "economy" (the effect of the pandemic on spanish economy) and "covid information" (general information about the pandemic). table presents the details of the eight news frames of el país with their top seven relevant words. figure presents the news frame network of the three segmented periods. each of the nodes represents a news frame and the size of nodes indicates the strength of the node, also known as weighted node degree, it is the sum of the edge weights of the adjacent edges for each node [ ] , reflecting the importance of a node in a weighted network. the edges between the nodes represent the connection strength between two frames (normalized by tf-idf), it is the sum of the tf-idf value of the co-words. table presents the detailed information about the news frames in each of the three periods, with the node strength, number of tweets in each news frame and their proportion of the total number in each segment. table presents the table of the most weighted edges in the three time segments; it is able to provide us the news frames with the highest similarity ties. overall, "livelihood," "public health professional," "pandemic update" and "politics" are the most important news frames of el país. as the crisis is gradually under control, the "pandemic update" turned to be less prominent in the recovery period. "livelihood" is the most prominent news frame of el país and it shows a strong connection with "politics," "economy" and "public health professional" in the pre-crisis stage, suggesting a close connection with government policy and economic situation. in the next two periods, it started to have a more significant relation with "madrid." this is understandable because the spanish capital suffered the most during the covid- pandemic. according to the actual policy, the community of madrid is one of the last regions that stepped into the recovery plan [ ] and this can also explain why the proportion of "madrid" increased across the three time segments. in addition, we indeed observed a news framing change in different stages of the pandemic outbreak. for example, the "politics" frame is less reported in the second period while the "state of alarm" and "covid information" frames have been paid higher attention during this stage. it is worth noting that although both frames have connections with others, no connections are observed between these two during the three periods, suggesting they are independent from each other. "state of alarm" is a policy oriented news frame while "covid information" focused more on general sanitary information. as the crisis is gradually controlled, the pandemic related news frames ("pandemic update," "state of alarm," "public health professional" and "covid information") are becoming less prominent in the recovery period. the media interests in general political news ("politics") decreased during the most difficult time but soon recovered with the crisis situation becoming stable. regarding the network, the "politics" frame has the strongest connection with "livelihood" during all of the three periods. it also has significant relation with "public health professional" (weight: . ) and "economy" (weight: . ) during the pre-crisis period but the two connections have been developing in different trends. while "politics" and "public health professional" remained connected in the other two periods, the connection between "politics" and "economy" turned to be less significant. instead, the "politics" frame becomes more connected with "state of alarm" and "madrid." for the el mundo dataset, a total number of , tweets are collected from february to june . after removing retweets, , original tweets are saved for our in-depth analysis. eight news frames are computed, six of which are considered the same as el país. they are "madrid," "state of alarm," "covid information," "economy," "pandemic update," "politics." the two unique el mundo frames are "lockdown" (news about the confinement) and "hospital" (news related to hospital, doctor and patient). table presents the news frames with their most relevant keywords. figure presents the network of the three segmented periods, table provides the detailed information of the news frames across time and table presents the detailed information of the most weighted edges. generally speaking, "madrid," "state of alarm" and "lockdown" are the three most prominent news frames during the pre-crisis period, along with the crisis becoming more severe, "covid information" is paid more attention by the newspaper. and finally these four frames are the most prominent news frames during the recovery period. table provides the detailed information of the news frames across time and table presents the detailed information of the most weighted edges. generally speaking, "madrid," "state of alarm" and "lockdown" are the three most prominent news frames during the pre-crisis period, along with the crisis becoming more severe, "covid information" is paid more attention by the newspaper. and finally these four frames are the most prominent news frames during the recovery period. pre-crisis period lockdown period recovery period "madrid" is the most prominent news frame of el mundo of all the time. the proportion of this topic is greatly changed from the second period to the third. as we have explained in the previous section, madrid is the last region that stepped into recovery plan, so this change is understandable. the "madrid" frame has the strongest connection strength with "lockdown" and "state of alarm" during the first two periods and the association between "madrid" and "covid information" becomes more and more eye-catching during the last two periods. the second most important news frame is "state of alarm," it has been paid less attention during the lockdown period but still, shared a significant proportion of the total news coverage. the "state of alarm" frame has the highest connection strength with "madrid" and "lockdown," similar to "madrid," the relation between "state of alarm" and "covid information" is becoming stronger during the second and third time segments (weight in the nd period: . , in the rd period: . ). as spain started to get recovered from the strict national lockdown, the proportion of the relevant news frames "lockdown" and "hospital" decrease during the recovery period but the connection between these two topics have been strengthened in this stage. as the "lockdown" frame is highly associated with "madrid" and "state of alarm," we assume this frame is strongly policy orientated. on the other hand, the "hospital" frame includes both health and social news, so it is naturally associated the most with "madrid" and "lockdown." regarding the "economy" frame, the proportion of this topic arrived its peak at the second period. it is significantly different from the frame "politics," which has been less adopted during the same period. both of them have strong ties with "madrid" and "state of alarm" but no significant connections have been exposed between these two frames. given that the frames "covid information" and "pandemic update" have almost no proportion changes during the three time periods, these two news frames are considered as stable news frames, tweets about "pandemic update" is slightly fewer than "covid information." regarding the network, like many other el mundo news frames, both of the two have the strongest connection with "madrid" and the tie between these two frames is getting more and more meaningful over time. significant differences are observed between el país (ep) and el mundo (em) in the frames used in their twitter news posts. first, the most prominent news frame of the two spanish newspapers are different. while ep focused on "livelihood," em tended to adopt the "madrid" frame most frequently. despite the fact that "madrid" is also a frame in the ep dataset, it is considered as a peripherical news frame. both of the two frames have the strongest connections with other topics in the networks, so these two frames can be seen as the motor themes of their newspapers on twitter. second, both of the newspapers have two unique news frames. while the ep news coverage on twitter focuses on "livelihood" and "public health professional," we observed the "lockdown" and "hospital" frames in the em twitter posts. the "livelihood" frame is somewhat similar to "hospital," because both of the two news frames contain social and living attributes. nevertheless, their connection strength with the other common frames are different. while "livelihood" associates the most with "politics" and "public health professional" in the ep networks, "hospital" associates the most with "madrid" and "lockdown" in the em networks. a possible interpretation of this difference is "livelihood" is linked to government (including relevant government departments) policy but "hospital" is more linked to the news about specific regions. also, ep shows higher attention to the ministry of health and professional perspective by adopting the "public health professional" frame while em focuses more on the effect of confinement from social perspectives with the "lockdown" frame. third, although there are six common news frames identified in the twitter posts of both newspapers but the longitudinal changes in their proportion over time are different. for example, the "economy" related news tweets are increasingly scarce over time in the ep dataset but for em, such information is more posted during the second time period (the lockdown period). another significant example can be seen from the "politics" frame. the ep twitter account posted more politics-related news during the recovery period than during the lockdown period. but for em, the increasing trend during the same periods is not so salient as ep. "state of alarm" is the second most important news frame for em on twitter but this frame is not so prominent in ep twitter posts. although the most relevant keywords of this frame in the two datasets are almost the same but the connections are different in the networks. during the first two periods, "state of alarm" is considered most associated with "lockdown" and "madrid" in the em network, while it is mostly linked to "livelihood" and "public health professional" in the ep network. during the recovery period, the link between "state of alarm" and "politics" is strengthened in ep network, while the connection between "state of alarm" and "covid information" is more eye-catching in the em network. this finding implies that, with the pandemic crisis getting under control, twitter posts about "state of alarm" is more related to political news on ep but connected to health news more closely in the em twitter coverage. this study analyzed and compared the frames of twitter news posts in the two most important spanish newspapers during covid- pandemic crisis. with a combination of topic modeling and network analysis method, a general landscape of the news coverage of the two newspapers has been illustrated. we found that the center-left media focused the most on family life and living issues ("livelihood"), while the center-right media focused the most on the spanish capital news ("madrid"). from the distribution and proportion of news frames, it can be concluded that el país focused the most on public health professionals and real-time alarming ("pandemic update") information during the first two periods. the el mundo coverage on twitter focused on the state of alarm and confinement ("lockdown") related information. during the recovery period, the proportion of general political news ("politics") update is largely increased in el país, being the third most prominent news frame in this stage. nevertheless, no such changes are observed in the results of el mundo. our results are consistent with the thesis proposed by shih et al. [ ] that media coverage about epidemics did change across different phrases of the crises. given our limited data collection timespan and the unique characteristics of twitter data, a more comprehensive analysis is needed for future studies. from the methodological approach, our method combination provides a dynamic overview of news frames' evolution over time. the weighted node degree and the most weighted edges in each of the stages have been reported. each of the motor themes ("livelihood" for el país, "madrid" for el mundo) is the leading topics of all of the three time segments. given the strong connections of the two topics with other frames, we observed a more unbalanced network structure in el mundo dataset. specifically, a second-level community is identified, which consisted of "madrid," "lockdown" and "state of alarm" in the pre-crisis period. the community is enlarged with "covid information" included in the last two periods. it implies that the content of the four news frames have a high degree of co-occurrence, they are relatively more independent from other frames. but the second-level community cannot be clearly observed in the el país network, thus, we believe that the news frames of el mundo is more centralized than el país. finally, several limitations of our study should be mentioned. first, previous literature has indicated that twitter based short-text news updates are different from their full length articles [ ] . in this case, it is worth noting that our results are solely based on the twitter posts, which may not be generalized to the comparison between the contents of the two newspapers' articles. second, as the news coverage may less focus on the health issue in the pre-crisis period than in later stages and our adopted topic modeling method is highly depended on the vast dataset, the number of tweets in the pre-crisis period is much less than the two other periods, news frames on the first period may not be perfectly classified. finally, although we have analyzed the two most important spanish newspapers with different political stances, the number of research objects are still limited and we would like to include more newspapers and use a larger dataset as our improvement strategies for the future. director-general's opening remarks at the media briefing on covid- - consumo informativo y cobertura mediática durante el confinamiento por el covid- : sobreinformación, sesgo ideológico y sensacionalismo. el prof between global and local: the glocalization of online news coverage on the trans-regional crisis of sars the complementary relationship between the internet and traditional mass media: the case of online news and information newspaper ebola articles differ from twitter updates the application of social network analysis in agenda setting research: a methodological exploration all news is bad news: newspaper coverage of political parties in spain analysis of the press coverage of queen sofia in el país and el mundo political clientelism and the media: southern europe and latin america in comparative perspective visual content published by the press during a health crisis: the case of ebola, spain el virus del ébola: análisis de su comunicación de crisis en españa. opción rev the impact of the ebola virus and rare diseases in the media and the perception of risk in spain coverage of the iraq war in the united states, mainland china, taiwan and poland whose story wins on twitter? visualizing the south china sea dispute framing, agenda setting, and priming: the evolution of three media effects models framing: toward clarification of a fractured paradigm switching trains of thought: the impact of news frames on readers' cognitive responses from press release to news: mapping the framing of the h n a influenza pandemic an examination of the quantity and construction of health information in the news media framing of influenza a (h n ) pandemic in a singaporean newspaper media coverage of public health epidemics: linking framing and issue attention cycle toward an integrated theory of print news coverage of epidemics treatment of cannabis in the spanish press noticias sobre covid- y -ncov en medios de comunicación de españa: el papel de los medios digitales en tiempos de confinamiento big social data analytics in journalism and mass communication empirical study of topic modeling in twitter characterizing microblogs with topic models knowledge discovery through directed probabilistic topic models: a survey media framing dynamics of the 'european refugee crisis': a comparative topic modelling approach un)covering the covid- pandemic: framing analysis of the crisis in canada. can stakeholders in the same bed with different dreams: semantic network analysis of issue interpretation in risk policy related to mad cow disease ngos' hiv/aids discourse on social media and websites: technology affordances and strategic communication across media platforms semantic network analysis of vaccine sentiment in online social media combining natural language processing and network analysis to examine how advocacy organizations stimulate conversation on social media mining of massive datasets open access institutional and news media tweet dataset for covid- social science research. arxiv stop word lists in free open-source software packages topicmodels: an r package for fitting topic models twitter topic modeling by tweet aggregation gibbslda++: a c/c++ implementation of latent dirichlet allocation (lda) robust detection of extreme events using twitter: worldwide earthquake monitoring detecting offensive tweets via topical feature discovery over a large scale twitter corpus indian elections on twitter: a comparison of campaign strategies of political parties a simple introduction to markov chain monte-carlo sampling media frames across stages of health crisis: a crisis management approach to news coverage of flu pandemic the architecture of complex weighted networks así llega madrid a la fase : paro disparado, sin turistas y con miles de ciudadanos en las colas del hambre this article is an open access article distributed under the terms and conditions of the creative commons attribution (cc by) license acknowledgments: this work belongs to the framework of the doctoral programme in person and society in the contemporary world of the autonomous university of barcelona. the authors declare no conflict of interest. key: cord- -g iish x authors: aguilar-gallegos, norman; romero-garcía, leticia elizabeth; martínez-gonzález, enrique genaro; garcía-sánchez, edgar iván; aguilar-Ávila, jorge title: dataset on dynamics of coronavirus on twitter date: - - journal: data brief doi: . /j.dib. . sha: doc_id: cord_uid: g iish x in this data article, we provide a dataset of , , twitter posts around the coronavirus health global crisis. the data were collected through the twitter rest api search. we used the rtweet r package to download raw data. the term searched was “coronavirus” which included the word itself and its hashtag version. we collected the data over days, from january to february , . the dataset is multilingual, prevailing english, spanish, and portuguese. we include a new variable created from other four variables; it is called “type” of tweets, which is useful for showing the diversity of tweets and the dynamics of users on twitter. the dataset comprises seven databases which can be analysed separately. on the other hand, they can be crossed to set other researches, among them, trends and relevance of different topics, types of tweets, the embeddedness of users and their profiles, the retweets dynamics, hashtag analysis, as well as to perform social network analysis. this dataset can attract the attention of researchers related to different fields on knowledge, such as data science, social science, network science, health informatics, tourism, infodemiology, and others.  the data collection started when there was a boom in the use of hashtags related to the coronavirus outbreak in china. those hashtags became trending topics very quickly. this way, the dataset covers the initial dynamics of the coronavirus outbreak on twitter.  seven filtered and analysed databases are provided, as well as a glance at their composition. further analysis can be done by crossing these databases.  a new variable, called "type" of tweets, was created. this variable categorises each post into ( ) tweets without mentions, ( ) tweets with mentions, ( ) retweets, and ( ) replies. through its use, it is possible to see different interactions and dynamics on twitter. we conglomerate a dataset of , , twitter posts (tweets). these tweets reflect the early discussion around the coronavirus health global crisis on this social network. all the collected data were searched using the keyword "coronavirus". this implies that tweets containing this word were retrieved, as well as the posts including its hashtag version (#coronavirus). the . m were gathered from january to february , , i.e. days. we selected those dates since, in the first one ( -feb- ), there was a boom around the topic on twitter, where several hashtags related to the coronavirus and the outbreak in china were trending topics [ ] . then, the data collection was closed ( -february- ) when the world health organization (who) changed the official name of this disease from "coronavirus" to "covid- " (link to the official communication). this paper compared to other publications, which also tracked diseases on twitter, conglomerates a considerable quantity of twitter posts within days. for instance, chew and eysenbach [ ] archived over million tweets related to h n or swine flu over eight months; while stefanidis et al. [ ] collected . m tweets regarding the zika outbreak for weeks (three months). more recently, cinelli et al. [ ] analysed social media infodemic, and gathered around . m posts on twitter, in days. another example in another field of knowledge is the data on olympic-themed tweets, with . m tweets collected over four months [ ] . fig. shows the daily distribution of the downloaded twitter posts. it is possible to appreciate that the process of downloading data retrieved some tweets from -jan- . it is worth mentioning that the amount of information that we could gather depended on two factors: ( ) the increasing discussion of the topic on twitter, and ( ) the times in which the code for downloading the data was run. when we started to see the quantity of information on twitter, the code was run as much as possible. this way, the trend in fig. shows that there was increasing attention on the topic on twitter. since the raw data were vast, we had to filter and create different databases. all of them are available in a mendeley dataset. this dataset has been created in line with twitter's terms & conditions [ ] ; none of the databases contains the text of the collected tweets. table shows the name of each database and a brief description of its variables. it is worth mentioning that all the databases contain the variable "status_id", which can be used to join them. this way, the mendeley dataset can be further applied to cross databases and, thus, analyse different topics. it will depend on the objective of the analysis, or the researchers' interests. if you have further questions about the dataset in table or you need more information about the raw variable, please contact us. furthermore, through the "status_id" variable, it is possible to hydrated twitter content using the twitter apis, and thus to get all the information related to the ids. in the database " .type.tweets", a new variable was created based on other four variables which are not included in the dataset; but in the downloaded data they were: "is_retweet", "reply_to_status_id", "mentions_screen_name", and "reply_to_user_id". since a user can tweet different messages with different characteristics, each one of the posts was classified in one of four categories, based on the following: . tw, this was used for original tweets, when the post did not include any mention; . mt, this was created when the user mentioned other users within the tweet; . rt: retweets, this category was used for those tweets which retweet one post and; . re: replies, this type of tweets were created when one user replied to another one. by doing the above, in fig. it is possible to see that a lot of the interaction and discussion on twitter around the coronavirus topic had been done by the retweets (category . rt). . % of the . m twitter posts were retweets. fig. also shows that users had been publishing original tweets without any mention within them ( . tw, without mentions, . %). the other databases and analyses provided in this paper are based on the differentiation of tweets by their "type". by including this variable in different databases (see table ), we want to prove that these categories influence a lot the interaction dynamics on twitter. this, at the same time, enriches the applications and uses that the databases can have beyond this paper. the raw data of the . m twitter posts were downloaded from twitter rest api search using the "rtweet (version . . )" r package, which was designed to simplify the interaction with twitter's apis and make it accessible to a wider range of users [ ] . the collection process comprised from january to february , , i.e. days. we searched for the term "coronavirus" which included the word itself and its hashtag version (#coronavirus). it is worth mentioning that the procedure to search tweets only returns data from the past - days, and it typically can return up to , twitter statuses in a single call (see this link). based on this, we run the process several times, creating a total of files, which summed , , records. we eliminated the duplicated ones, and the final number of retained tweets was , , , which are unique. all the collected data is multilingual since we did not use any kind of restriction about it. the most used languages on the tweets were english (en= . %), spanish (es= . %), portuguese (pt= . %), french (fr= . %) and, italian (it= . ); but in total, the tweets were written up in different languages (table ) . since the raw data contained variables, we had to filter and create different databases regarding the purpose of each analysis (see table ). this also enabled us to handle the data easier. we did this by taking into account twitter's terms & conditions [ ] . this way, in the next subsection, the results of analyses are shown related to each database explained above. two of the main features of twitter posts are if they have links and media into the text. this is because tweets with these characteristics have more probability of being retweeted [ ] . in order to explore that, we created a database called " .links.media.tweets" (see table ). this way, first, we excluded the posts whose type was " . rt" (see fig. ); thus, , , tweets ( . %) were retained. we found that almost % of the twitter posts contained a link; the tweets that quote other tweets are also included. links were more used in the types . tw, and . mt (fig. ) . in the case of media, we conversely found that a considerable proportion ( %) of the tweets did not contain any kind of media (pictures, gifs, video, etc.). twitter posts with media were more used in type . tw (fig. ) . further information can be found in this database, which includes: the number of retweets received in each tweet; in this case, . % of the , , tweets did not receive any retweet. also, it includes the complete url to each post, the url for the external links, as well as the url for media. based on the database called " .the.most.retweeted. rt" (see table ), we analyse the number of retweets received in each tweet and, we delve deeply into the tweets most retweeted. fig. shows that the most retweeted post had , retweets, and , other users retweeted the th post. we manually checked each tweet, as well as the accounts. by doing so, it was possible to determine that these tweets come from different actors, among them: journalists (e.g., @atomaraullo), tv news anchors (e.g., @teeratr), politicians (e.g., @realdonaldtrump, @risahontiveros, @chrismurphyct), actors (e.g., @realjameswoods), news (e.g., @quicktake, @voanews ), official institutes (e.g., @kkmputrajaya), political activists (e.g., @ realcandaceo) and, stand-up comedians (e.g., @kunalkamra ). it was interesting to see that these tweets were on the type " . tw", there was not any tweet type " . mt" or " . re". within these tweets, the types of accounts are from very outstanding people or institutions; for instance, four tweets from @realdonaldtrump appear in fig. . another interesting fact is that the tweets in fig. were written in different languages, which highlight the relevance of this multilingual dataset. other researchers can be interested in this information to explore the characteristics of the tweets and the number of retweets received. to go to the tweets, follow the links, respectively: . @atomaraullo, . @celsolamounier, . @teeratr, . @realdonaldtrump, . @realdonaldtrump, . @v_shakthi, . @risahontiveros, . @kathbarbadoro, . @kemenkesri, . @nycjim, . @chrismurphyct, . @realjameswoods, . @quicktake, . @realdonaldtrump, . @voanews, . @realdonaldtrump, . @howroute, . @teeratr, . @jmmulet, . @kkmputrajaya, . @abscbnnews, . @realcandaceo, . @kunalkamra , . @bbcbreaking, . @spectatorindex in order to know how many users had been involved so far in the twitter discussion around the coronavirus topic, we created a database called " .users.and.type" (see table ). after analysing it, we found that , , unique users had tweeted the . m tweets. interestingly, fig. shows that a considerable proportion of users tweeted only once ( %). but also, surprisingly, there was only one user that published more than , tweets. going more in-deep, when we included the types of tweets (fig. ) , the analysis got enriched, and it was possible to see that more than % of the users who tweeted only once, they did it through the retweet ( . rt). % of the accounts that also posted once did it by using the tweet without mentions ( . tw). meanwhile, % replied to another tweet, and they did it in only one occasion. the last proportion was the type " . mt", with . %. for the other categories based on the number of twitter posts, the interpretation goes in the same direction. all this means that people on twitter had been interacting by using different types of interaction (tw, mt, rt, and re); people were posting, others replying, and many others retweeting. these types of patterns can attract the attention of other researchers. following fig. , it is possible to identify that there were users more dynamic than the rest; they tweeted more than , tweets. but, when the variable of the types of tweets was introduced, in fig. we could appreciate that there were only users. it shows that people who tweeted a lot was because they were mixing the types of interactions. thus, they were classified into different categories in fig. . to exemplify this type of mixing interaction, in fig. we present the most active accounts by its total number of posts and the types of tweets. it is worth noting that neither account in fig. the use of hashtags on twitter posts is one of the primary mechanisms to get inserted into different trends and topics [ , ] . thus, we created a database called " .hashtags" (see table ). first, we excluded the " . rt" tweets type; then, we worked with , , tweets. in this set of data, , , ( . %) tweets contained one or more hashtags into the text; the rest of the tweets ( , , - . %) did not include any kind of hashtag. within these . m tweets, we found , hashtags terms. but, . % of those terms ( , ) were used only once. this way, we filtered the hashtags that were used more than times; , terms were retained. finally, we screened the hashtags that we considered as rare because they were based on numbers, symbols and were too long; , terms were obtained. by using a wordcloud, we show the diversity of multilingual hashtags used around the coronavirus topic ( fig. ) . only for visual reasons, we removed four hashtags that were used more than , times because they distorted the wordcloud. they were: coronavirus, china, wuhan and, coronavirusoutbreak. in table , the hashtags most used are presented. we are sure that some researchers can be interested in the use of hashtags on the tweets; for instance, this can be applied in the infodemiology research topic (see eysenbach [ ] ). also, concerning the terms used on tweets (co-mentions), and how this affects the trends on twitter, as well as the embeddedness of users. fig. . wordcloud of hashtags used on the tweets around coronavirus topic. the hashtags most used around the coronavirus topic on twitter. freq. hashtag freq. hashtag freq. weighted since pairs of nodes could be linked in several times through different tweets. also, we computed the number of components in the global network, and we found that , subnetworks constituted it. but, there were , components ( . %) of size one; it means, those nodes did get any kind of interaction. the biggest component comprised , , nodes ( . % of the total) and , , links ( . % of the total). a diversity of different researchers could be interested in these two databases since they can analyse different types of interactions, networks structures, and node properties. in order to provide an example of the relevance of these databases and the use of an sna approach, we analysed the network of the account: @realdonaldtrump. we filtered the database and got the users who interacted with this account, as well as how they interacted among them. fig. shows a network shaped by , nodes and , links. interestingly, @realdonaldtrump tweeted only four tweets without any mention; but he was mentioned , times, retweeted , times and, replied , times. fig. also shows other prominent nodes based on the degree centrality. both together will let other researchers study the network structures that emerge from the interaction among the twitter users involved in the discussion of the coronavirus topic. sna has been applied in other twitter data-based papers twitter users) and , , links (without differentiating between mentions, retweets, and replies; but, in the " a.tw.edges" database this variable was included for further analysis) then, we simplified the links summing their recurrence and removing the loops. , , nodes and , , links formed the resulting network. the links are references return of the coronavirus: -ncov pandemics in the age of twitter: content analysis of tweets during the h n outbreak zika in twitter: temporal variations of locations, actors, and concepts data on sentiments and emotions of olympic-themed tweets developer agreement and policy rtweet: collecting and analyzing twitter data want to be retweeted? large scale analytics on factors impacting retweet in twitter network, soc. comput. . ieee int. conf. soc. comput. / ieee int. conf. privacy, secur conversational aspects of retweeting on twitter big questions for social media big data: representativeness, validity and other methodological pitfalls infodemiology and infoveillance: framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the internet análisis de redes en twitter para la inserción en comunidades: el caso de un producto agroindustrial elecciones europeas : viralidad de los mensajes en twitter network analysis in the social sciences análisis de redes sociales: conceptos clave y cálculo de indicadores, uach, ciestaam. serie: metodologías y herramientas para la investigación the igraph software package for complex network research the authors declare that they have no known competing for financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article. key: cord- -isjacgq authors: alanazi, e.; alashaikh, a.; alqurashi, s.; alanazi, a. title: identifying and ranking common covid- symptoms from arabic twitter date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: isjacgq objective: the aim of this study is to identify the most common symptoms reported by covid- patients in the arabic language and order the symptoms appearance based on the collected data. methods: we search the arabic content of twitter for personal reports of covid- symptoms from march st to may th, . we identify arabic users who tweeted testing positive for covid- and extract the symptoms they publicly associate with covid- . furthermore, we ask them directly through personal messages to opt in and rank the appearance of the first three symptoms they experienced right before (or after) diagnosed with covid- . finally, we track their twitter timeline to identify additional symptoms that were mentioned within +- days from the day of tweeting having covid- . in summary, a list of covid- reports were collected and symptoms were (at least partially) ranked from early to late. results: the collected reports contained roughly symptoms originated from % (n= ) male and % (n= ) female twitter users. the majority ( %) of the tracked users were living in saudi arabia ( %) and kuwait ( %). furthermore, % (n= ) of the collected reports were asymptomatic. out of the users with symptoms (n= ), % (n= ) provided a chronological order of appearance for at least three symptoms. fever % (n= ), headache % (n= ), and anosmia % (n= ) were found to be the top three symptoms mentioned by the reports. they count also for the top- common first symptoms in a way that % (n= ) said their covid journey started with a fever, % (n= ) with a headache and % (n= ) with anosmia. out of the saudi symptomatic reported cases (n= ), the most common three symptoms were fever % (n= ), anosmia % (n= ), and headache % (n= ). the ongoing vigorous covid- outbreak has shown a great impact on human health and well-being and radically enforced a rigorous change in societies lifestyle undermining their prosperity. along with this catastrophe, we have witnessed a great effort from diverse research communities to study this disease in all its aspects. in recent years, social networks have become an unignorable source of information where users expose and share ideas, opinions, thoughts, and experiences on a multitude of topics. several researches have utilized the abundance of information offered by social platforms to conduct non-clinical medical research. for example, twitter has been the source for data for many health and medical studies; such as surveillance and monitoring of flu and cancer timeline and distribution across the usa using twitter [ ] , analyzing the spread of influenza in the uae based on geotagged arabic tweets [ ] , surveillance and monitoring of influenza in the uae based on arabic and english tweets [ ] , identifying symptoms and disease in saudi arabia using twitter [ ] , and most recently on analyzing covid- symptoms on twitter [ ] and analyzing the chronological and geographical distribution of covid- infected tweeters in the usa [ ] . twitter platform enables obtaining multiple features (such as age, sex, geo-location, … etc.) along with informative messages that using appropriate data mining and analysis techniques can potentially result in useful insights about a specific health condition [ ] . extracting common symptoms associated with a disease from publicly available data has the potential to control the spread of the disease and identify users at high risk. it also gives new insights that call for early intervention and control. for example, figure highlights a tweet (from saudi arabia) mentioning explicitly the loss of smell and taste as one distinctive symptom of covid- . the national-wide self-testing questionnaire app was updated almost couple of weeks after the tweet to include the loss smell and taste as one potential sign for covid- [ ] . tracking covid- symptoms in real-time from public data on twitter could have shorten the gap. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted june , . . in this paper, we study covid- symptoms as reported by arabic tweeters. initially, we shuffled arabic tweets and searching for tweets with covid- symptoms and also collected tweets for users who reported themselves infected through clinical test. in addition, we asked users who have been marked infected about the first three symptoms they experienced via a voluntary survey template sent over private message. our method for data collection is outlined in figure which translates roughly to "i have been diagnosed". suck keywords are likely to filter out reports that were not associated with a formal test result. an initial list of users were collected and two independent freelancers were asked to further read users' timeline and extract symptoms that were explicitly mentioned to be related to covid and their order of appearance if mentioned. additional information such as user gender, date of infection, and the country of residence were also collected. we assume the date of tweeting being covid- positive is the date of infection in case no other information were available. cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . we asked through twitter personal messages users to order the first three symptoms they experienced right before or after tested positive for covid- . we record the symptoms from first to last based on the received responses and what is available publicly on the users' tweets. in case no order was given, an implicit order is assumed following the order of which the symptoms were mentioned by the user. tracking tweets containing specific keywords is simply not enough to have the big picture about the disease dynamics [ ] . many patients detail their experience while infected, hence, knowing their health condition, sentiment, and tracking useful information may lead to a better understanding of the disease symptoms. in particular, we found tweets that were posted within ± days of infection date to contain valuable information about early symptoms, allowing us to process and rank the symptoms. figure highlights three tweets by three different covid- patients that indirectly relate symptoms before or after diagnosed with covid- . for simplicity, we set a fake date ( apr, ) for all the three using the tweetgen service [ ]. user.. was tested positive on apr and tweeted on apr about the loss of smell. user… tested positive may st , three days after complaining about a headache. user.. tested positive apr th , one day after tweeting her wish to be able to taste the food. the example highlighted in figure demonstrates that mining twitter for covid symptoms require more than a simple keyword search. in principle, the context of the tweet, narrated by a covid- patient, is also important. therefore, it is important to look not only the verbatim of the tweet but also to its context. to build a high-quality database of covid symptoms based on arabic tweets, we have relied on manual symptoms extraction. however, we have used twitter api to construct a social network graph for the users and the software gephi [ ] to visualize the resulted graph. the majority of the cases were recorded in may ( %, n= ), followed by april ( %, n= ) and march ( %, n= ). this surge of may reports is understandable as most of world countries, let alone the arabic speaking countries, witnessed a great increase in the number of confirmed cases. as for the demography, users from saudi arabia, kuwait, and uae constitute % of reports with saudi arabia being the largest cluster of reports ( %). the other countries (egypt, iraq, bahrain, qatar, uk, usa, belgum, and germany) together constitute the remaining %. needless to say, some of the adopted strategies to prevent further spread of the virus (e.g., active screening by the ministry of health in saudi arabia [ ] ) may also helped in finding more reports in may compared to other months. we have witnessed this firsthand as some of the asymptomatic reports were mainly a result of early active screening. the fact that almost half of the reports came from saudi arabia is not surprising as its one of the top countries participating on twitter with more than million users [ ] . we have collected almost symptoms from the reports (as shown in figure ). the daily number of collected tweets is also highlighted in figure . figure indicates that most of the reported cases experience between to symptoms, whereas % of the reported cases were asymptomatic. table lists the frequency of each symptom ordered from the most prevalent symptom to the least. only fever experienced by more than % of the patients. the frequency of symptoms appears to be consistent across male and female patients (corr. coeff. = . ). further, table lists the top three most reported symptoms. fever and headache were commonly the first reported symptoms. the top symptoms that coincide with fever were headache ( . %), cough ( . %) anosmia ( . %), and ageusia ( . %). other symptoms have relatively lower frequency with fever. in addition, table . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . https://doi.org/ . / . . . doi: medrxiv preprint date . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . ( ) . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . we have constructed a social graph for the users to discover insights related covid- communities in social networks as shown in figure . an edge between x and y represents the fact that x follows y or y follows x or both follow each other on twitter. saudi cases were colored in green while the kuwait cases are blue and orange for other places of residence. the graph shows nodes each node with a size proportional to its degree in the graph. the remaining users were found to be disconnected from all other users and thus removed from the network. it was built by the software gephi [ ] and relations were extracted by twitter api. clearly, kuwait users are more connected compared to others and some form a clique of three or more users. is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . this work identified common covid- symptoms from arabic personal reports on twitter. such study complements other recent studies [ ] [ ] ] that were focused on english tweets or specific demographic groups. the study was carried in a way to report not only the symptoms but their timeline as narrated by users. social networks have become the de facto communication channel for large number of people. many individuals around the globe write, interact, or even just browse the content of the social networks countless times a day. social networks have the property of being continuously updated by new information provided by other global citizens. as such, it is crucial to monitor their content to identify health issues [ ] [ ] . one potential benefit of analyzing social networks is understanding covid- symptoms and identifying people at high risk [ ] . anosmia being one of the top three reported symptoms, mentioned in % of the reports, was a surprising result in our study. several tweets complained how long it lasted before they started to smell again. our sample size is still relatively small to make any good judgement in this regard. however, recent clinical studies have reported finding anosmia in . % of the mild cases of covid- [ ] which is relatively close to our estimation from arabic tweets. indeed, the size of self-reports reflect the testing capacity in different countries. as of june , , saudi arabia has done almost one million tests and kuwait has roughly exceeded , tests [ ] . the reported cases from egypt, the arabic country with largest population, were inadequately representative. this could be attributed to factors such as the preferred social platform (e.g., facebook), the dialect and use of local idioms. our study tracked two widely used keywords to identify arabic covid- patients on twitter then manually extract symptoms. more complex keywords could reveal more interesting patterns about symptoms, and this brings the need to establish a comprehensive medical dictionary for different local arabic dialect. the dictionary can be utilized when mining different health opinions and conditions from the arabic content in social networks. our main motivation is to extract symptoms from users who are likely took the disease test and, hence, tweeted based on its result. in this study, we have not used other covid- sources. specifically, studying the arabic content of personal reports from both facebook and twitter will enrich the study. the noticeable increase in may reports compared to other months show the importance of developing a real-time surveillance system based on the symptoms reported by the arabic content of twitter. it also suggests further studies into the information sharing behavior [ ] in different communities and. the collected tweets of symptoms provided some insights into covid- symptoms and their chronological order. we have shown the most common symptoms associated with arabic self-reports of symptoms. we have analyzed the first, second, and third common symptom experienced by the users. furthermore, we analyzed symptoms prevalence in the two largest clusters found in our database: saudi arabia and kuwait. real-time disease surveillance using twitter data: demonstration on flu and cancer analysis and prediction of influenza in the uae based on arabic tweets tweetluenza: predicting flu trends from twitter data sehaa: a big data analytics tool for healthcare symptoms and diseases detection using twitter, apache spark, and machine learning self-reported covid- symptoms on twitter: an analysis and a research a chronological and geographical analysis of personal reports of covid- on twitter twitter as a tool for health research: a systematic review machine learning to detect self-reporting of symptoms, testing access, and recovery associated with covid- on twitter: retrospective big data infoveillance study gephi the open graph viz platform saudi arabia's active mass testing contains covid- spread social media mining for public health monitoring and surveillance assessing the online social environment for surveillance of obesity prevalence anosmia and dysgeusia in patients with mild sars-cov- infection use of twitter among local health departments: an analysis of information sharing, engagement, and action this work was supported by king abdulaziz city for science and technology (grant number: - - - - ). eisa alanazi and abdulaziz alashaikh designed the study and wrote the manuscript. sarah alqurashi developed the social network analysis part and collected related tweets from the twitter api. aued alanazi extracted and translated the symptoms from the collected personal reports to their scientific names. all authors approved the final version of the manuscript. none declared key: cord- -mjtlhh e authors: pellert, max; lasser, jana; metzler, hannah; garcia, david title: dashboard of sentiment in austrian social media during covid- date: - - journal: nan doi: nan sha: doc_id: cord_uid: mjtlhh e to track online emotional expressions of the austrian population close to real-time during the covid- pandemic, we build a self-updating monitor of emotion dynamics using digital traces from three different data sources. this enables decision makers and the interested public to assess issues such as the attitude towards counter-measures taken during the pandemic and the possible emergence of a (mental) health crisis early on. we use web scraping and api access to retrieve data from the news platform derstandard.at, twitter and a chat platform for students. we document the technical details of our workflow in order to provide materials for other researchers interested in building a similar tool for different contexts. automated text analysis allows us to highlight changes of language use during covid- in comparison to a neutral baseline. we use special word clouds to visualize that overall difference. longitudinally, our time series show spikes in anxiety that can be linked to several events and media reporting. additionally, we find a marked decrease in anger. the changes last for remarkably long periods of time (up to weeks). we discuss these and more patterns and connect them to the emergence of collective emotions. the interactive dashboard showcasing our data is available online under http://www.mpellert.at/covid _monitor_austria/. our work has attracted media attention and is part of an web archive of resources on covid- collected by the austrian national library. in , the outbreak of covid- in europe lead to a variety of countermeasures aiming to limit the spread of the disease. these include temporary lock downs, the closing of kindergartens, schools, shops and restaurants, the requirement to wear masks in public, and restrictions on personal contact. health infrastructure was re-allocated with the goal of providing additional resources to tackle the emerging health crisis triggered by covid- . such large-scale disruptions of private and public life can have tremendous influence on the emotional experiences of a population. governments have to build on the compliance of their citizens with these measures. forcing the population to comply by instituting harsh penalties is not sustainable in the longer run, especially in developed countries with established democratic institutions like in most of europe. on the scale of whole nations, very strict policing also faces technical limits and diverts resources from other duties. in addition, recent research shows that, when compared to enforcement, the recommendation of measures can be a better motivator for compliance [ ] . non-intrusive monitoring of emotional expressions of a population enables to identify problems early on, with the hope to provide the means to resolve them. due to the rapid development of the response to covid- , it is desirable to produce up-to-date observations of public sentiment towards the measures, but it is hard to quantify sentiment at large scales and high temporal resolution. policy decisions are usually accompanied by representative surveys of public sentiment that, however, suffer from a number of shortcomings. first, surveys depend on explicit self-reports which do not necessarily align with actual behaviour [ ] . in addition, conducting surveys among larger numbers of people is time consuming and expensive. lastly, a survey is always just a snapshot of public sentiment at a single point in time. often, by the time a questionnaire is constructed and the survey has been conducted, circumstances have changed and the results of the survey are only partially valid. online communities are a complementary data source to surveys when studying current and constantly evolving events. their digital traces reveal collective emotional dynamics almost in real-time. we gather these data in the form of text from platforms such as twitter and news forums, where large groups of users discuss timely issues. we observe a lot of activity online, with clear increases during the nation-wide lock down of public life. for example, our data shows austrian twitter saw a % increase in posts from march compared to before ( - - until - - ). livetickers at news platforms are a popular format that provides small pieces of very up-to-date news constantly over the course of a day. this triggers fast posting activity in the adjunct forum. by collecting these data in regular intervals, we face very little delay in data gathering and analysis and provide a complement to survey-based methods. our setup has the advantage of bearing low cost while featuring a very large sample size. the disadvantages include more noise in the signal due to our use of automated text analysis methods, such as sentiment analysis. additionally, if only information from one platform is considered, this might result in sampling a less representative part of the population than in surveys where participant demographics are controlled. however, systematic approaches to account for errors at different stages of research have been adapted to digital traces data [ ] . we showcase the monitoring of social media sentiment during the covid- pandemic for the case of austria. austria is located in central europe, serving as a small model for both western europe (especially germany [ ] ) and eastern europe (e.g. hungary [ ] ). therefore, the developments around covid- in austria have been closely watched by the rest of europe. as the virus started spreading in europe on a larger scale in february , stringent measures were implemented comparatively early in austria [ ] . using data from austria allows us to build a quite extensive, longitudinal account of first hand discussions on covid- . additionally, austria's political system and its public health system have all the capacities of a developed nation to tackle a health crisis [ ] . therefore, we expect the population to express the personal, emotional reaction to the event, without being overwhelmed by lack of resources and resulting basic issues of survival. interactive online dashboards are an accessible way to summarize complex information to the public. during covid- , popular dashboards have conveyed information about the evolution of the number of covid- cases in different regions of austria [ ] and globally [ ] . other dashboards track valuable information such as world-wide covid- registry studies [ ] . developers of dashboards include official governmental entities like the national ministry of health as well as academic institutions and individual citizens. to our knowledge, the overwhelming majority of these dashboards display raw data together with descriptive statistics of "hard" facts and numbers on covid- . to fill a gap, we build a dashboard with processed data from three different sources to track the sentiment in austrian social media during covid- . it is easily accessible online and updated on a daily basis to give feedback to authorities and the interested general public. we retrieve data from three different sites: a news platform, twitter and a chat platform for students. all data for this article was gathered in compliance with the terms and conditions of the platforms involved. twitter data was accessed through crimson hexagon (brandwatch), an official twitter partner. the platform for students and derstandard.at gave us their permission to retrieve the data automatically from their systems. a daily recurring task is set up on a server to retrieve and process the data, and to publish the updated page online (for a description of the workflow see figure ). the news platform derstandard.at was an internet pioneer as it was the first german language newspaper to go online in . from february , it started entertaining an active community, first in a chatroom [ ] . in , the chatroom was converted to a forum that is still active today and allows for posting beneath articles. users have to register to post and they can up-and down-vote single posts. in a platform change made voting more transparent by showing which user voted both positive or negative. according to a recent poll [ ] , derstandard.at is considered both the most trustful and most useful source of information on covid- in austria. visitors come from austria, but also from other parts of the german-speaking area. in , derstandard.at was visited by , , unique users per month that stay on average : minutes on the site and request a total of , , subpages [ ] . to cover the developments around covid- , daily livetickers (except sundays) were set up on derstandard.at. figure s in the supplementary information shows an example of the web interface of such a liveticker. as no dedicated api exists for data retrieval from derstandard.at, we use web-scraping to retrieve the data (under permission from the site). first, we request a sitemap and identify the relevant urls of livetickers. second, we query each small news item of each of the livetickers. we receive data in json format and flatten and transform the json object to extract the id of each small news piece. third, we query the posts attached to that id in batches. this is necessary because derstandard.at does not display all the posts at once beneath a small news item. instead, the page loads a new batch of posts as soon as the user reaches the bottom of the screen. this strategy is chosen to not overcrowd the interface, as the maximum number of posts beneath one small news item can be very high (up to posts in our data set). by following our iterative workflow to request posts, we are able to circumvent issues of pagination. finally, after we have received all posts, we transform the json objects to tabulator-separated value files for further analysis. this approach is summarised in the upper part of figure . to retrieve daily values for our indicators from twitter, we rely on the forsight platform by crimson hexagon, an aggregation service of data from various platforms, including twitter. twitter has an idiosyncratic user base in austria, mainly composed of opinion makers, like journalists and politicians. in the case of studying responses to a pandemic, studying these populations gives us an insight into public sentiment due to their influence in public opinion. yet, one should keep in mind that twitter users are younger, more liberal, and have higher formal education than the general population [ ] . as a third and last source, we include a discussion platform for young adults in austria . the discussions on the platform are organized in channels based on locality, with an average of ± (mean ± standard deviation) posts per day from - - to - - . the typical number of posts per day on the platform dropped from ± (january to april) to ± (april to may). this drop occurred due to the removal of the possibility to post anonymously on april th in order to prevent hate speech. based on data from this platform, we study the reaction of the special community of young adults in different austrian locations, with the majority of posts originating in vienna ( %), graz ( %) and other locations ( %). to assess expressions of emotions and social processes, we match text in posts on all three platforms to word classes in the german version of the linguistic inquiry and word count (liwc) dictionary [ ] , including anxiety, anger, sadness, positive emotions and social terms. liwc is a standard methodology in psychology for text analysis that includes validated lexica in german. it has been shown that liwc, despite its simplicity, has an accuracy to classify emotions in tweets that is comparable to other state of the art tools in sentiment analysis benchmarks [ ] . previous research has shown that liwc, when applied to large-scale aggregates of tweets, has similar correlations with well-being measures as other, more advanced text analysis techniques [ , ] . since within the scope of this study only text aggregates will be analysed, liwc is an appropriate method and can be applied to all sorts of text data that is collected for the monitor. for the prosocial lexicon, we translated a list of prosocial terms used in previous research [ ] , including for example words related to helping, empathy, cooperating, sharing, volunteering, and donating. we adapt the dictionaries to the task at hand by excluding most obvious terms that can bias the analysis, as done in recent research validating twitter word frequency data [ ] . specifically, we cleaned the lists for ( ) words which are likely more frequently used during the covid- pandemic e.g. by news media and do not necessarily express an emotion (sadness: tot*; anger: toete*, tt*, tte*; positive: heilte, geheilt, heilt, heilte*, heilung; prosocial: heilverfahren, behandlung, behandlungen, dienstpflicht, ffentlicher dienst, and digitale dienste all matching dienst*), ( ) potential mismatches unrelated to the respective emotion (sadness: harmonie/harmlos matching harm*; positive: uerst; prosocial: dienstag matching dienst*) ( ) specific austria-related terms like city names (sadness: klagenfurt matching klagen*) or events (sadness: misstrauensantrag matching miss*), and ( ) twitter-related terms for the analysis of tweets only (prosocial: teilen, teilt mit). for text from derstandard.at, we average the frequency of terms per post to take into account the varying lengths of posts. as twitter has a strict character limit of characters per post, crimson hexagon provides the number of tweets containing at least one of the terms, based on which we calculate the proportion of such posts. posts have a median length of characters in derstandard.at, characters in twitter, and characters in the chat platform for young adults. to exclude periodic weekday effects, we correct for the daily baseline of our indicators by computing relative differences to mean daily baseline values. for derstandard.at data, the baseline is computed from all posts to derstandard.at articles in the year . we use the main website articles for this instead of livetickers because during , livetickers were mainly used to cover sport events (for an example see https://www.derstandard.at/jetzt/livebericht/ /bundesliga-livelask-sturm) or high-profile court cases (https://www.derstandard.at/jetzt/livebericht/ /buwog-prozessvermoegensverwalter-stinksauer-auf-meischberger). thereby, we choose a slightly different medium for our baselines to avoid having a topic bias in the baselines. nonetheless, it comes from the same platform with the same layout and functionalities and an overlapping user base: users ( % of total unique users in the livetickers) in our data set that are active at livetickers also post at normal articles. the speed of posting may differ slightly, because the article is typically posted in a final format, whereas small news pieces are added constantly in livetickers. for the other data sources, we correct by computing the baseline for the indicators from the start of period available to us (twitter back to - - , chat platform for young adults back to - - ) to january . finally, we combine the processed data and render an interactive website. for this, we use "plotly" [ ] , "rflexdashboard" [ ] and "wordcloud " [ ] in r [ ] , and the "git" protocol to upload the resulting html page to github pages. using versioning control allows us to easily revert the page to a previous state in case of an error. we track the sentiment of the austrian population during covid- and make our findings available as an interactive online dashboard that is updated daily. we display the time series almost in real-time with a small delay to catch all available data (see figure using derstandard.at as a data source). it has features such as the option to display the number of observations by hovering over the data point or to isolate lines and to compare only a subset of indicators. the dashboard can be accessed online at http://www.mpellert.at/covid monitor austria/. table shows several descriptive statistics of the data sets used. for derstandard.at, we retrieved livetickers with small news items. on average, users publish ± posts under each of those items in the time period of interest ( - - to - - ). posts have a median length of characters (see figure s for a histogram of the length of posts). posts provide immediate reactions by the users of derstandard.at: the median is at . seconds for the first post to appear below a small news item. we use word clouds (figure ) to visualize the emotional content of posts. while livetickers on covid- cover the time period from - - until - - , the baseline includes normal articles on derstandard.at from . to highlight changes in language use during covid- , our word clouds compare word frequency in the livetickers with the baseline: the size of words in the clouds is proportional to | log( prob livetickers prob baseline )|, where prob baseline and prob livetickers refer to the frequency of the dictionary term compared to the frequency of all matches of terms in that category, in the baseline and the livetickers, respectively. color of words corresponds to the sign of this quantity: red means positive, i.e. the frequency of the word increased in the livetickers, whereas blue signifies that the usage of the word decreased. by combining these information, our word clouds give an impression on how the composition of terms in the dictionary categories changed during covid- . our dashboard analyses a part of public discourse. we assume that the lockdown of public life increased tendencies of the population to move debates online. users that take part in these discussions often form very active communities that sometimes structure their whole day around their posting activities. this is reflected in our data in the word clouds of figure from the increased usage of greetings (category "social"), marking the start or the end of a day such as "moin"/"good morning" or "gn"/"good night". we identified the following events in austria corresponding to anxiety spikes in expressed emotions in social media. unrelated to covid- , there was reporting on a terrorist attack in hanau, germany on - - . the first reported covid- case in austria was on - - and the first death on - - . the first press conference, announcing bans of large public events and university closures as first measures, happened on - - . it was followed by strict social distancing measures announced on - - , starting on the day after. the overall patterns in the monitor of sentiment in figure show that austrian user's expressions of anxiety increased, whereas anger decreased in our observation period. we go into detail on this in section . the sentiment dynamics on social media platforms can be influenced by content that spreads fear and other negative emotions. timely online emotion monitoring could help to quickly identify such campaigns by malicious actors. even legitimately elected governments can follow the controversial strategy of steering emotions to alert the population to the danger of a threat. for example, democratically elected actors can deliberately elicit emotions such as fear or anxiety to increase compliance from the top down. such a strategy has been followed in austria [ ] and other countries like germany [ ] . reports about the deliberate stirring of fear by the austrian government are reflected in a spike of anxiety on - - in figure . the spikes of anxiety at the beginning of march in the early stages of the covid- outbreak may have been reinforced by anxiety eliciting strategies. in an effort to provide an archive of austrian web resources for future reference, the austrian national library (nb) monitors the dashboard and stores changes. there are a number of such initiatives also in other nations [ ] with the earliest and most famous example being archive.org. through selective harvesting of resources connected to covid- , the dashboard is part of the nb collection "coronavirus " (https://webarchiv.onb.ac.at/). our results show patterns in the change of language use during covid- . in the anger category, words related to violence and crime are less frequent in livetickers since covid- compared to , indicating that reports and discussions about violent events, or possibly even these events themselves, become less frequent as the public discourse focuses on events related to the pandemic. for anxiety, the most remarkable change is a reduction in words related to terror and abuse, accompanied by a smaller increase of terms linked to panic, risk and uncertainty. in the sadness category, the verb "verabschiede"/"saying goodbye" appears almost times more often in the livetickers. for prosocial words, terms referring to helping, community and encouragement increased. from the social terms, the word "empfehlungen"/"recommendations" occurs slightly more frequently, while topics of migration, integration and patriarchy are less often discussed. finally, positive terms that increase the most are the expression of admiration "aww*" and "hugs", indicating that people send each other virtual hugs instead of physical ones. dynamics of collective emotions may be different in crisis times. while they typically vary fast [ ] and return to the baseline within a matter of days even after catastrophic events like natural disasters or terrorist attacks [ , ] , changes during the covid- pandemic in austria have lasted several weeks for most analysed categories (up to weeks in some cases). in contrast to one-off events, threat from a disease like covid- is more diffuse, and the emotion-eliciting events are distributed in time. in addition, measures that strongly affect people's daily lives over a long period of time, as well as high level of uncertainty, likely contribute to the unprecedented changes of collective emotional expression in online social media. the dashboard illustrates early and strong increases in anxiety across all three analysed platforms starting at the time of the first confirmed cases in austria (end of february ). a first initial spike of anxiety-terms occurs on all three platforms around the time the first positive cases were confirmed and news about the serious situation in italy were broadcast in austria. about two weeks later, levels rise again together with the number of confirmed cases, reaching particularly high levels in the week before the lockdown on march. afterwards, the gradually drop again. in total, levels of anxiety-expression did not return to the baseline for more than six weeks from - - until - - on twitter. on derstandard.at, levels also remained above the baseline for more than four weeks in a row. timelines for twitter and derstandard.at also show a clear and enduring decrease of angerrelated words starting in the week before the lock-down, as discussions of potentially controversial topics other than covid- become scarcer. this decrease lasts for four weeks on derstandard.at ( februar - april), but is particularly stable on twitter, where anger-terms remain less frequent than in for . months in a row ( - - to - - ). in contrast, prosocial and social terms show opposing trends on these two platforms: they increase slightly but do so for more than months on twitter, where people share not only news, but also talk about their personal lives. in contrast, they decrease for more than months in a row on derstandard.at, where people mostly discuss specific political events or topics.the increase of sadness-related expressions is smaller than changes in anxiety and anger, but also lasted for about a month on twitter, and two weeks on derstandard.at. interestingly, positive expressions were used slightly more frequently on all three platforms for long periods since the outbreak. this trend is visible from the beginning of march on the student platform and derstandard.at, and further increases since restrictions on people's lives have reduced. in total, positive expressions are more frequent than baseline during the last . months (as of th of june) on derstandard.at. an analysis of collective emotions in reddit comments from users in eight us cities found results similar to ours, including spikes in anxiety and the decrease in anger [ ] , which suggests that some of our findings might generalize to other platforms and countries. the dashboard gives opinion makers and the interested public a way to observe collective sentiment vis-a-vis the crisis response in the context of a pandemic. it has gained attention from austrian media [ ] , and from the covid future operations clearing board [ ] , an interdisciplinary platform for exchange and collaboration between researchers put in place by the federal chancellery of the republic of austria. especially during the first weeks of the crisis, multiple newspapers reported on the changes of emotional expressions in online platforms [ , , , , ] . timely knowledge about the collective emotional state and expressed social attitudes of the population is valuable for adapting emergency and risk-communication as well as for improving the preparedness of (mental) health services. supplementary material is included. the dashboard can be accessed at http://www.mpellert.at/covid monitor austria/. the source code is available at https://github.com/maxpel/covid monitor austria. the data sets accumulated daily by updating the dashboard will be released in the future. table : descriptive statistics showing relevant aspects of the data sources. numbers refer to the time period from march to june ( days). the total number of twitter users in austria in january is taken from the report of datareportal [ ] . fractions refer to the number of posts containing at least one term from the relevant dictionary category in liwc divided by the total number of posts. the differential impact of physical distancing strategies on social contacts relevant for the spread of covid- . medrxiv psychology as the science of self-reports and finger movements: whatever happened to actual behavior? a total error framework for digital traces of humans austria's kurz says germany copied his country's lockdown easing plan. reuters coronavirus orbn: gradual restart of life planned in nd phase of measures, preparation for surprises. hungary today kurier.at. sterreich bei intensivbetten weit ber oecd schnitt austrian ministry for health. amtliches dashboard covid csse. covid- dashboard by the center for systems science and engineering a real-time dashboard of clinical trials for covid- . the lancet digital health der standard chatroom: die bar, die nicht mehr ist. der standard library catalog: www.derstandard sales team. derstandard.at media data how twitter users compare to the general public computergesttzte quantitative textanalyse -quivalenz und robustheit der deutschen version des linguistic inquiry and word count sentibench-a benchmark comparison of state-of-the-practice sentiment analysis methods tracking" gross community happiness" from tweets estimating geographic subjective well-being from twitter: a comparison of dictionary and datadriven language methods moral actor, selfish agent flexdashboard: r markdown format for flexible dashboards wordcloud : create word cloud by htmlwidget r: a language and environment for statistical computing. r foundation for statistical computing regierungsprotokoll: angst vor infektion offenbar erwnscht wie wir covid- unter kontrolle bekommen a survey on web archiving initiatives the individual dynamics of affective expression on social media a novel surveillance approach for disaster mental health collective emotions and social resilience in the digital traces after a terrorist attack the unfolding of the covid outbreak: the shifts in thinking and feeling. understanding people and groups republic of austria. covid- future operations clearing board -bundeskanzleramt sterreich online-emotionen in foren whrend der coronakrise coronavirus: twitter spiegelt ngste und sorgen der menschen wider -derstandard gefhle und videokonferenzen -wiener komplexittsforscher finden bei online-emotionen nach einem deutlichen anstieg zu beginn der krise nun weniger ngstlichkeit. mensch -wiener zeitung online, . library catalog: www.wienerzeitung online-emotionen: mehr trauer als wut. science.orf.at we thank christian burger from derstandard.at for providing data access, and julia litofcenko and lena mller-naendrup for their support in translating the prosocial dictionary to german. access to crimson hexagon was provided via the project v!brant emotional health grant suicide prevention media campaign oregon to thomas niederkrotenthaler. the authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. mp and dg designed research. mp retrieved derstandard.at data, processed and analyzed all data, and implemented the dashboard. jl retrieved data for the platform for young adults. hm retrieved data for twitter, and wrote methods and result reports for the dashboard. mp, jl and hm wrote the draft of the manuscript. all authors provided input for writing and approved the final manuscript. key: cord- - gv y authors: bello-orgaz, gema; jung, jason j.; camacho, david title: social big data: recent achievements and new challenges date: - - journal: inf fusion doi: . /j.inffus. . . sha: doc_id: cord_uid: gv y big data has become an important issue for a large number of research areas such as data mining, machine learning, computational intelligence, information fusion, the semantic web, and social networks. the rise of different big data frameworks such as apache hadoop and, more recently, spark, for massive data processing based on the mapreduce paradigm has allowed for the efficient utilisation of data mining methods and machine learning algorithms in different domains. a number of libraries such as mahout and sparkmlib have been designed to develop new efficient applications based on machine learning algorithms. the combination of big data technologies and traditional machine learning algorithms has generated new and interesting challenges in other areas as social media and social networks. these new challenges are focused mainly on problems such as data processing, data storage, data representation, and how data can be used for pattern mining, analysing user behaviours, and visualizing and tracking data, among others. in this paper, we present a revision of the new methodologies that is designed to allow for efficient data mining and information fusion from social media and of the new applications and frameworks that are currently appearing under the “umbrella” of the social networks, social media and big data paradigms. data volume and the multitude of sources have experienced exponential growth, creating new technical and application challenges; data generation has been estimated at . exabytes ( exabyte = . . terabytes) of data per day [ ] . these data come from everywhere: sensors used to gather climate, traffic and flight information, posts to social media sites (twitter and facebook are popular examples), digital pictures and videos (youtube users upload hours of new video content per minute [ ] ), transaction records, and cell phone gps signals, to name a few. the classic methods, algorithms, frameworks, and tools for data management have become both inadequate for processing this amount of data and unable to offer effective solutions for managing the data growth. the problem of managing and extracting useful knowledge from these data sources is currently one of the most popular topics in computing research [ , ] . in this context, big data is a popular phenomenon that aims to provide an alternative to traditional solutions based on databases and data analysis. big data is not just about storage or access to data; its solutions aim to analyse data in order to make sense of them and exploit their value. big data refers to datasets that are terabytes to of challenges in obtaining valuable knowledge for people and companies (see value feature). • velocity: refers to the speed of data transfers. the data's contents are constantly changing through the absorption of complementary data collections, the introduction of previous data or legacy collections, and the different forms of streamed data from multiple sources. from this point of view, new algorithms and methods are needed to adequately process and analyse the online and streaming data. • variety: refers to different types of data collected via sensors, smartphones or social networks, such as videos, images, text, audio, data logs, and so on. moreover, these data can be structured (such as data from relational databases) or unstructured in format. • value: refers to the process of extracting valuable information from large sets of social data, and it is usually referred to as big data analytics. value is the most important characteristic of any big-data-based application, because it allows to generate useful business information. • veracity: refers to the correctness and accuracy of information. behind any information management practice lie the core doctrines of data quality, data governance, and metadata management, along with considerations of privacy and legal concerns. some examples of potential big data sources are the open science data cloud [ ] , the european union open data portal, open data from the u.s. government, healthcare data, public datasets on amazon web services, etc. social media [ ] has become one of the most representative and relevant data sources for big data. social media data are generated from a wide number of internet applications and web sites, with some of the most popular being facebook, twitter, linkedin, youtube, instagram, google, tumblr, flickr, and wordpress. the fast growth of these web sites allow users to be connected and has created a new generation of people (maybe a new kind of society [ ] ) who are enthusiastic about interacting, sharing, and collaborating using these sites [ ] . this information has spread to many different areas such as everyday life [ ] (e-commerce, e-business, e-tourism, hobbies, friendship, ...), education [ ] , health [ ] , and daily work. in this paper, we assume that social big data comes from joining the efforts of the two previous domains: social media and big data. therefore, social big data will be based on the analysis of vast amounts of data that could come from multiple distributed sources but with a strong focus on social media. hence, social big data analysis [ , ] is inherently interdisciplinary and spans areas such as data mining, machine learning, statistics, graph mining, information retrieval, linguistics, natural language processing, the semantic web, ontologies, and big data computing, among others. their applications can be extended to a wide number of domains such as health and political trending and forecasting, hobbies, e-business, cybercrime, counterterrorism, time-evolving opinion mining, social network analysis, and humanmachine interactions. the concept of social big data can be defined as follows: "those processes and methods that are designed to provide sensitive and relevant knowledge to any user or company from social media data sources when data sources can be characterised by their different formats and contents, their very large size, and the online or streamed generation of information." the gathering, fusion, processing and analysing of the big social media data from unstructured (or semi-structured) sources to extract value knowledge is an extremely difficult task which has not been completely solved. the classic methods, algorithms, frameworks and tools for data management have became inadequate for processing the vast amount of data. this issue has generated a large number of open problems and challenges on social big data domain related to different aspects as knowledge representation, data management, data processing, data analysis, and data visualisation [ ] . some of these challenges include accessing to very large quantities of unstructured data (management issues), determination of how much data is enough for having a large quantity of high quality data (quality versus quantity), processing of data stream dynamically changing, or ensuring the enough privacy (ownership and security), among others. however, given the very large heterogeneous dataset from social media, one of the major challenges is to identify the valuable data and how analyse them to discover useful knowledge improving decision making of individual users and companies [ ] . in order to analyse the social media data properly, the traditional analytic techniques and methods (data analysis) require adapting and integrating them to the new big data paradigms emerged for massive data processing. different big data frameworks such as apache hadoop [ ] and spark [ ] have been arising to allow the efficient application of data mining methods and machine learning algorithms in different domains. based on these big data frameworks, several libraries such as mahout [ ] and sparkmlib [ ] have been designed to develop new efficient versions of classical algorithms. this paper is focused on review those new methodologies, frameworks, and algorithms that are currently appearing under the big data paradigm, and their applications to a wide number of domains such as e-commerce, marketing, security, and healthcare. finally, summarising the concepts mentioned previously, fig. shows the conceptual representation of the three basic social big data areas: social media as a natural source for data analysis; big data as a parallel and massive processing paradigm; and data analysis as a set of algorithms and methods used to extract and analyse knowledge. the intersections between these clusters reflect the concept of mixing those areas. for example, the intersection between big data and data analysis shows some machine learning frameworks that have been designed on top of big data technologies (mahoot [ ] , mlbase [ , ] , or sparkmlib [ ] ). the intersection between data analysis and social media represents the concept of current web-based applications that intensively use social media information, such as applications related to marketing and e-health that are described in section . the intersection between big data and social media is reflected in some social media applications such as linkedin, facebook, and youtube that are currently using big data technologies (mon-godb, cassandra, hadoop, and so on) to develop their web systems. finally, the centre of this figure only represents the main goal of any social big data application: knowledge extraction and exploitation. the rest of the paper is structured as follows; section provides an introduction to the basics on the methodologies, frameworks, and software used to work with big data. section provides a description of the current state of the art in the data mining and data analytic techniques that are used in social big data. section describes a number of applications related to marketing, crime analysis, epidemic intelligence, and user experiences. finally, section describes some of the current problems and challenges in social big data; this section also provides some conclusions about the recent achievements and future trends in this interesting research area. currently, the exponential growth of social media has created serious problems for traditional data analysis algorithms and techniques (such as data mining, statistics, machine learning, and so on) due to their high computational complexity for large datasets. this type of methods does not properly scale as the data size increases. for this reason, the methodologies and frameworks behind the big data concept are becoming very popular in a wide number of research and industrial areas. this section provides a short introduction to the methodology based on the mapreduce paradigm and a description of the most popular framework that implements this methodology, apache hadoop. afterwards apache spark is described as emerging big data framework that improves the current performance of the hadoop framework. finally, some implementations and tools for big data domain related to distributed data file systems, data analytics, and machine learning techniques are presented. mapreduce [ , ] is presented as one of the most efficient big data solutions. this programming paradigm and its related algorithms [ ] , were developed to provide significant improvements in large-scale data-intensive applications in clusters [ ] . the programming model implemented by mapreduce is based on the definition of two basic elements: mappers and reducers. the idea behind this programming model is to design map functions (or mappers) that are used to generate a set of intermediate key/value pairs, after which the reduce functions will merge (reduce can be used as a shuffling or combining function) all of the intermediate values that are associated with the same intermediate key. the key aspect of the mapreduce algorithm is that if every map and reduce is independent of all other ongoing maps and reduces, then the operations can be run in parallel on different keys and lists of data. although three functions, map(), combining()/shuffling(), and reduce(), are the basic processes in any mapreduce approach, usually they are decomposed as follows: . prepare the input: the mapreduce system designates map processors (or worker nodes), assigns the input key value k that each processor would work on, and provides each processor with all of the input data associated with that key value. . the map() step: each worker node applies the map() function to the local data and writes the output to a temporary storage space. the map() code is run exactly once for each k key value, generating output that is organised by key values k . a master node arranges it so that for redundant copies of input data only one is processed. . the shuffle() step: the map output is sent to the reduce processors, which assign the k key value that each processor should work on, and provide that processor with all of the map-generated data associated with that key value, such that all data belonging to one key are located on the same worker node. . the reduce() step: worker nodes process each group of output data (per key) in parallel, executing the user-provided reduce() code; each function is run exactly once for each k key value produced by the map step. . produce the final output: the mapreduce system collects all of the reduce outputs and sorts them by k to produce the final outcome. fig. shows the classical "word count problem" using the mapreduce paradigm. as fig. shown, initially a process will split the data into a subset of chunks that will later be processed by the mappers. once the key/values are generated by mappers, a shuffling process is used to mix (combine) these key values (combining the same keys in the same worker node). finally, the reduce functions are used to count the words that generate a common output as a result of the algorithm. as a result of the execution or wrappers/reducers, the output will generate a sorted list of word counts from the original text input. ( , "i thought i") ( , "thought of thinking") ( , "of thanking you") finally, and before the application of this paradigm, it is essential to understand if the algorithms can be translated to mappers and reducers or if the problem can be analysed using traditional strategies. mapreduce provides an excellent technique to work with large sets of data when the algorithm can work on small pieces of that dataset in parallel, but if the algorithm cannot be mapped into this methodology, it may be "trying to use a sledgehammer to crack a nut". any mapreduce system (or framework) is based on a mapreduce engine that allows for implementing the algorithms and distributing the parallel processes. apache hadoop [ ] is an open-source software framework written in java for the distributed storage and distributed processing of very large datasets using the mapreduce paradigm. all of the modules in hadoop have been designed taking into account the assumption that hardware failures (of individual machines or of racks of machines) are commonplace and thus should be automatically managed in the software by the framework. the core of apache hadoop comprises a storage area, the hadoop distributed file system (hdfs), and a processing area (mapreduce). the hdfs (see section . . ) spreads multiple copies of the data across different machines. this not only offers reliability without the need for raid-controlled disks but also allows for multiple locations to run the mapping. if a machine with one copy of the data is busy or offline, another machine can be used. a job scheduler (in hadoop, the jobtracker) keeps track of which mapreduce jobs are executing; schedules individual maps; reduces intermediate merging operations to specific machines; monitors the successes and failures of these individual tasks; and works to complete the entire batch job. the hdfs and the job scheduler can be accessed by the processes and programs that need to read and write data and to submit and monitor the mapreduce jobs. however, hadoop presents a number of limitations: . for maximum parallelism, you need the maps and reduces to be stateless, to not depend on any data generated in the same mapreduce job. you cannot control the order in which the maps run or the reductions. . hadoop is very inefficient (in both cpu time and power consumed) if you are repeating similar searches repeatedly. a database with an index will always be faster than running a mapreduce job over un-indexed data. however, if that index needs to be regenerated whenever data are added, and data are being added continually, mapreduce jobs may have an edge. . in the hadoop implementation, reduce operations do not take place until all of the maps have been completed (or have failed and been skipped). as a result, you do not receive any data back until the entire mapping has finished. . there is a general assumption that the output of the reduce is smaller than the input to the map. that is, you are taking a large data source and generating smaller final values. apache spark [ ] is an open-source cluster computing framework that was originally developed in the amplab at university of california, berkeley. spark had over contributors in june , making it a very high-activity project in the apache software foundation and one of the most active big data open source projects. it provides high-level apis in java, scala, python, and r and an optimised engine that supports general execution graphs. it also supports a rich set of high-level tools including spark sql for sql and structured data processing, spark mllib for machine learning, graphx for graph processing, and spark streaming. the spark framework allows for reusing a working set of data across multiple parallel operations. this includes many iterative machine learning algorithms as well as interactive data analysis tools. therefore, this framework supports these applications while retaining the scalability and fault tolerance of mapreduce. to achieve these goals, spark introduces an abstraction called resilient distributed datasets (rdds). an rdd is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. in contrast to hadoops two-stage disk-based mapreduce paradigm (mappers/reducers), sparks in-memory primitives provide performance up to times faster for certain applications by allowing user programs to load data into a clusters memory and to query it repeatedly. one of the multiple interesting features of spark is that this framework is particularly well suited to machine learning algorithms [ [ ] ]. from a distributed computing perspective, spark requires a cluster manager and a distributed storage system. for cluster management, spark supports stand-alone (native spark cluster), hadoop yarn, and apache mesos. for distributed storage, spark can interface with a wide variety, including the hadoop distributed file system, apache cassandra, openstack swift, and amazon s . spark also supports a pseudo-distributed local mode that is usually used only for development or testing purposes, when distributed storage is not required and the local file system can be used instead; in this scenario, spark is running on a single machine with one executor per cpu core. a list related to big data implementations and mapreduce-based applications was generated by mostosi [ ] . although the author finds that "it is [the list] still incomplete and always will be", his "big-data ecosystem table" [ ] contains more than references related to different big data technologies, frameworks, and applications and, to the best of this authors knowledge, is one of the best (and more exhaustive) lists related to available big data technologies. this list comprises different topics related to big data, and a selection of those technologies and applications were chosen. those topics are related to: distributed programming, distributed files systems, a document data model, a key-value data model, a graph data model, machine learning, applications, business intelligence, and data analysis. this selection attempts to reflect some of the recent popular frameworks and software implementations that are commonly used to develop efficient mapreduce-based systems and applications. • apache pig. pig provides an engine for executing data flows in parallel on hadoop. it includes a language, pig latin, for expressing these data flows. pig latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. • apache storm. storm is a complex event processor and distributed computation framework written basically in the clojure programming language [ ] . it is a distributed real-time computation system for rapidly processing large streams of data. storm is an architecture based on a master-workers paradigm, so that a storm cluster mainly consists of master and worker nodes, with coordination done by zookeeper [ ] . • stratosphere [ ] . stratosphere is a general-purpose cluster computing framework. it is compatible with the hadoop ecosystem:, accessing data stored in the hdfs and running with hadoops new cluster manager yarn. the common input formats of hadoop are supported as well. stratosphere does not use hadoops mapreduce implementation; it is a completely new system that brings its own runtime. the new runtime allows for defining more advanced operations that include more transformations than only map and reduce. additionally, stratosphere allows for expressing analysis jobs using advanced data flow graphs, which are able to resemble common data analysis task more naturally. • apache hdfs. the most extended and popular distributed file system for mapreduce frameworks and applications is the hadoop distributed file system. the hdfs offers a way to store large files across multiple machines. hadoop and hdfs were derived from the google file system (gfs) [ ] . cassandra is a recent open source fork of a stand-alone distributed non-sql dbms system that was initially coded by facebook, derived from what was known of the original google bigtable [ ] and google file system designs [ ] . cassandra uses a system inspired by amazons dynamo for storing data, and mapreduce can retrieve data from cassandra. cassandra can run without the hdfs or on top of it (the datastax fork of cassandra). • apache giraph. giraph is an iterative graph processing system built for high scalability. it is currently used at facebook to analyse the social graph formed by users and their connections. giraph was originated as the open-source counterpart to pregel [ ] , the graph processing framework developed at google (see section . for a further description). • mongodb. mongodb is an open-source document-oriented database system and is part of the nosql family of database systems [ ] . it provides high performance, high availability, and automatic scaling. instead of storing data in tables as is done in a classical relational database, mongodb stores structured data as json-like documents, which are data structures composed of fields and value pairs. its index system supports faster queries and can include keys from embedded documents and arrays. moreover, this database allows users to distribute data across a cluster of machines. • apache mahout [ ] . the mahout(tm) machine learning (ml) library is an apache(tm) project whose main goal is to build scalable libraries that contain the implementation of a number of the conventional ml algorithms (dimensionality reduction, classification, clustering, and topic models, among others). in addition, this library includes implementations for a set of recommender systems (user-based and item-based strategies). the first versions of mahout implemented the algorithms built on the hadoop framework, but recent versions include many new implementations built on the mahout-samsara environment, which runs on spark and h o. the new spark-item similarity implementations enable the next generation of co-occurrence recommenders that can use entire user click streams and contexts in making recommendations. • spark mllib [ ] . mllib is sparks scalable machine learning library, which consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, and dimensionality reduction, as well as underlying optimization primitives. it supports writing applications in java, scala, or python and can run on any hadoop /yarn cluster with no preinstallation. the first version of mllib was developed at uc berkeley by contributors, and it provided a limited set of standard machine learning methods. however, mllib is currently experiencing dramatic growth, and it has over contributors from over organisations. • mlbase [ ] . the mlbase platform consists of three layers: ml optimizer, mllib, and mli. ml optimizer (currently under development) aims to automate the task of ml pipeline construction. the optimizer solves a search problem over the feature extractors and ml algorithms included in mli and mllib. mli [ ] is an experimental api for feature extraction and algorithm development that introduces high-level ml programming abstractions. a prototype of mli has been implemented against spark and serves as a test bed for mllib. finally, mllib is apache sparks distributed ml library. mllib was initially developed as part of the mlbase project, and the library is currently supported by the spark community. pentaho is an open source data integration (kettle) tool that delivers powerful extraction, transformation, and loading capabilities using a groundbreaking, metadata-driven approach. it also provides analytics, reporting, visualisation, and a predictive analytics framework that is directly designed to work with hadoop nodes. it provides data integration and analytic platforms based on hadoop in which datasets can be streamed, blended, and then automatically published into one of the popular analytic databases. • sparkr. there is an important number of r-based applications for mapreduce and other big data applications. r [ ] is a popular and extremely powerful programming language for statistics and data analysis. sparkr provides an r frontend for spark. it allows users to interactively run jobs from the r shell on a cluster, automatically serializes the necessary variables to execute a function on the cluster, and also allows for easy use of existing r packages. social big data analytic can be seen as the set of algorithms and methods used to extract relevant knowledge from social media data sources that could provide heterogeneous contents, with very large size, and constantly changing (stream or online data). this is inherently interdisciplinary and spans areas such as data mining, machine learning, statistics, graph mining, information retrieval, and natural language among others. this section provides a description of the basic methods and algorithms related to network analytics, community detection, text analysis, information diffusion, and information fusion, which are the areas currently used to analyse and process information from social-based sources. today, society lives in a connected world in which communication networks are intertwined with daily life. for example, social networks are one of the most important sources of social big data; specifically, twitter generates over million tweets every day [ ] . in social networks, individuals interact with one another and provide information on their preferences and relationships, and these networks have become important tools for collective intelligence extraction. these connected networks can be represented using graphs, and network analytic methods [ ] can be applied to them for extracting useful knowledge. graphs are structures formed by a set of vertices (also called nodes) and a set of edges, which are connections between pairs of vertices. the information extracted from a social network can be easily represented as a graph in which the vertices or nodes represent the users and the edges represent the relationships among them (e.g., a re-tweet of a message or a favourite mark in twitter). a number of network metrics can be used to perform social analysis of these networks. usually, the importance, or influence, in a social network is analysed through centrality measures. these measures have high computational complexity in large-scale networks. to solve this problem, focusing on a large-scale graph analysis, a second generation of frameworks based on the mapreduce paradigm has appeared, including hama, giraph (based on pregel), and graphlab among others [ ] . pregel [ ] is a graph-parallel system based on the bulk synchronous parallel model (bsp) [ ] . a bsp abstract computer can be interpreted as a set of processors that can follow different threads of computation in which each processor is equipped with fast local memory and interconnected by a communication network. according to this, the platform based on this model comprises three major components: • components capable of processing and/or local memory transactions (i.e., processors). • a network that routes messages between pairs of these components. • a hardware facility that allows for the synchronisation of all or a subset of components. taking into account this model, a bsp algorithm is a sequence of global supersteps that consists of three components: . concurrent computation: every participating processor may perform local asynchronous computations. . communication: the processes exchange data from one processor to another, facilitating remote data storage capabilities. . barrier synchronisation: when a process reaches this point (the barrier), it waits until all other processes have reached the same barrier. hama [ ] and giraph are two distributed graph processing frameworks on hadoop that implement pregel. the main difference between the two frameworks is the matrix computation using the mapreduce paradigm. apache giraph is an iterative graph processing system in which the input is a graph composed of vertices and directed edges. computation proceeds as a sequence of iterations (supersteps). initially, every vertex is active, and for each superstep, every active vertex invokes the "compute method" that will implement the graph algorithm that will be executed. this means that the algorithms implemented using giraph are vertex oriented. apache hama does not only allow users to work with pregel-like graph applications. this computing engine can also be used to perform computeintensive general scientific applications and machine learning algorithms. moreover, it currently supports yarn, which is the resource management technology that lets multiple computing frameworks run on the same hadoop cluster using the same underlying storage. therefore, the same data could be analysed using mapreduce or spark. in contrast, graphlab is based on a different concept. whereas pregel is a one-vertex-centric model, this framework uses vertexto-node mapping in which each vertex can access the state of adjacent vertices. in pregel, the interval between two supersteps is defined by the run time of the vertex with the largest neighbourhood. the graphlab approach improves this splitting of vertices with large neighbourhoods across different machines and synchronises them. finally, elser and montresor [ ] present a study of these data frameworks and their application to graph algorithms. the k-core decomposition algorithm is adapted to each framework. the goal of this algorithm is to compute the centrality of each node in a given graph. the results obtained confirm the improvement achieved in terms of execution time for these frameworks based on hadoop. however, from a programming paradigm point of view, the authors recommend pregel-inspired frameworks (a vertex-centric framework), which is the better fit for graph-related problems. the community detection problem in complex networks has been the subject of many studies in the field of data mining and social network analysis. the goal of the community detection problem is similar to the idea of graph partitioning in graph theory [ , ] . a cluster in a graph can be easily mapped into a community. despite the ambiguity of the community definition, numerous techniques have been used for detecting communities. random walks, spectral clustering, modularity maximization, and statistical mechanics have all been applied to detecting communities [ ] . these algorithms are typically based on the topology information from the graph or network. related to graph connectivity, each cluster should be connected; that is, there should be multiple paths that connect each pair of vertices within the cluster. it is generally accepted that a subset of vertices forms a good cluster if the induced sub-graph is dense and there are few connections from the included vertices to the rest of the graph [ ] . considering both connectivity and density, a possible definition of a graph cluster could be a connected component or a maximal clique [ ] . this is a sub-graph into which no vertex can be added without losing the clique property. one of the most well-known algorithms for community detection was proposed by girvan and newman [ ] . this method uses a new similarity measure called "edge betweenness" based on the number of the shortest paths between all vertex pairs. the proposed algorithm is based on identifying the edges that lie between communities and their successive removal, achieving the isolation of the communities. the main disadvantage of this algorithm is its high computational complexity with very large networks. modularity is the most used and best known quality measure for graph clustering techniques, but its computation is an np-complete problem. however, there are currently a number of algorithms based on good approximations of modularity that are able to detect communities in a reasonable time. the first greedy technique to maximize modularity was a method proposed by newman [ ] . this was an agglomerative hierarchical clustering algorithm in which groups of vertices were successively joined to form larger communities such that modularity increased after the merging. the update of the matrix in the newman algorithm involved a large number of useless operations owing to the sparseness of the adjacency matrix. however, the algorithm was improved by clauset et al. [ ] , who used the matrix of modularity variations to arrange for the algorithm to perform more efficiently. despite the improvements to and modifications of the accuracy of these greedy algorithms, they have poor performance when they are compared against other techniques. for this reason, newman reformulated the modularity measure in terms of eigenvectors by replacing the laplacian matrix with the modularity matrix [ ] , called the spectral optimization of modularity. this improvement must also be applied in order to improve the results of other optimization techniques [ , ] . random walks can also be useful for finding communities. if a graph has a strong community structure, a random walker spends a long time inside a community because of the high density of internal edges and the consequent number of paths that could be followed. zhou and lipowsky [ ] , based on the fact that walkers move preferentially towards vertices that share a large number of neighbours, defined a proximity index that indicates how close a pair of vertices is to all other vertices. communities are detected with a procedure called netwalk, which is an agglomerative hierarchical clustering method by which the similarity between vertices is expressed by their proximity. a number of these techniques are focused on finding disjointed communities. the network is partitioned into dense regions in which nodes have more connections to each other than to the rest of the network, but it is interesting that in some domains, a vertex could belong to several clusters. for instance, it is well-known that people in a social network for natural memberships in multiple communities. therefore, the overlap is a significant feature of many realworld networks. to solve this problem, fuzzy clustering algorithms applied to graphs [ ] and overlapping approaches [ ] have been proposed. xie et al. [ ] reviewed the state of the art in overlapping community detection algorithms. this work noticed that for low overlapping density networks, slpa, oslom, game, and copra offer better performance. for networks with high overlapping density and high overlapping diversity, both slpa and game provide relatively stable performance. however, test results also suggested that the detection in such networks is still not yet fully resolved . a common feature that is observed by various algorithms in real-world networks is the relatively small fraction of overlapping nodes (typically less than %), each of which belongs to only or communities. a significant portion of the unstructured content collected from social media is text. text mining techniques can be applied for automatic organization, navigation, retrieval, and summary of huge volumes of text documents [ ] [ ] [ ] . this concept covers a number of topics and algorithms for text analysis including natural language processing (nlp), information retrieval, data mining, and machine learning [ ] . information extraction techniques attempt to extract entities and their relationships from texts, allowing for the inference of new meaningful knowledge. these kinds of techniques are the starting point for a number of text mining algorithms. a usual model for representing the content of documents or text is the vector space model. in this model, each document is represented by a vector of frequencies of remaining terms within the document [ ] . the term frequency (tf) is a function that relates the number of occurrences of the particular word in the document divided by the number of words in the entire document. another function that is currently used is the inverse document frequency (idf); typically, documents are represented as tf-idf feature vectors. using this data representation, a document represents a data point in n-dimensional space where n is the size of the corpus vocabulary. text data tend to be sparse and high dimensional. a text document corpus can be represented as a large sparse tf-idf matrix, and applying dimensionality reduction methods to represent the data in compressed format [ ] can be very useful. latent semantic indexing [ ] is an automatic indexing method that projects both documents and terms into a low-dimensional space that attempts to represent the semantic concepts in the document. this method is based on the singular value decomposition of the term-document matrix, which constructs a low-ranking approximation of the original matrix while preserving the similarity between the documents. another family of dimension reduction techniques is based on probabilistic topic mod-els such as latent dirichlet allocation (lda) [ ] . this technique provides the mechanism for identifying patterns of term co-occurrence and using those patterns to identify coherent topics. standard lda implementations of the algorithm read the documents of the training corpus numerous times and in a serial way. however, new, efficient, parallel implementations of this algorithm have appeared [ ] in attempts to improve its efficiency. unsupervised machine learning methods can be applied to any text data without the need for a previous manual process. specifically, clustering techniques are widely studied in this domain to find hidden information or patterns in text datasets. these techniques can automatically organise a document corpus into clusters or similar groups based on a blind search in an unlabelled data collection, grouping the data with similar properties into clusters without human supervision. generally, document clustering methods can be mainly categorized into two types [ ] : partitioning algorithms that divide a document corpus into a given number of disjoint clusters that are optimal in terms of some predefined criteria functions [ ] and hierarchical algorithms that group the data points into a hierarchical tree structure or a dendrogram [ ] . both types of clustering algorithms have strengths and weaknesses depending on the structure and characteristics of the dataset used. in zhao and karypis [ ] , a comparative assessment of different clustering algorithms (partitioning and hierarchical) was performed using different similarity measures on high-dimensional text data. the study showed that partitioning algorithms perform better and can also be used to produce hierarchies of higher quality than those returned by the hierarchical ones. in contrast, the classification problem is one of the main topics in the supervised machine learning literature. nearly all of the wellknown techniques for classification, such as decision trees, association rules, bayes methods, nearest neighbour classifiers, svm classifiers, and neural networks, have been extended for automated text categorisation [ ] . sentiment classification has been studied extensively in the area of opinion mining research, and this problem can be formulated as a classification problem with three classes, positive, negative and neutral. therefore, most of the existing techniques designed for this purpose are based on classifiers [ ] . however, the emergence of social networks has created massive and continuous streams of text data. therefore, new challenges have been arising in adapting the classic machine learning methods, because of the need to process these data in the context of a one-pass constraint [ ] . this means that it is necessary to perform data mining tasks online and only one time as the data come in. for example, the online spherical k-means algorithm [ ] is a segment-wise approach that was proposed for streaming text clustering. this technique splits the incoming text stream into small segments that can be processed effectively in memory. then, a set of k-means iterations is applied to each segment in order to cluster them. moreover, in order to consider less important old documents during the clustering process, a decay factor is included. one of the most important roles of social media is to spread information to social links. with the large amount of data and the complex structures of social networks, it has been even more difficult to understand how (and why) information is spread by social reactions (e.g., retweeting in twitter and like in facebook). it can be applied to various applications, e.g., viral marketing, popular topic detection, and virus prevention [ ] . as a result, many studies have been proposed for modelling the information diffusion patterns on social networks. the characteristics of the diffusion models are (i) the topological structure of the network (a sub-graph composed of a set of users to whom the information has been spread) and (ii) temporal dynamics (the evolution of the number of users whom the information has reached over time) [ ] . according to the analytics, these diffusion models can be categorized into explanatory and predictive approaches [ ] . • explanatory models: the aim of these models is to discover the hidden spreading cascades once the activation sequences are collected. these models can build a path that can help users to easily understand how the information has been diffused. the netint method [ ] has applied sub-modular, function-based iterative optimisation to discover the spreading cascade (path) that maximises the likelihood of the collected dataset. in particular, for working with missing data, a k-tree model [ ] has been proposed to estimate the complete activation sequences. • predictive models: these are based on learning processes with the observed diffusion patterns. depending on the previous diffusion patterns, there are two main categories of predictive models: (i) structure-based models (graph-based approaches) and (ii) content-analysis-based models (non-graph-based approaches). moreover, there are more existing approaches to understanding information diffusion patterns. the projected greedy approach for non-sub-modular problems [ ] was recently proposed to populate the useful seeds in social networks. this approach can identify the partial optimisation for understanding the information diffusion. additionally, an evolutionary dynamics model was presented in [ [ ] , [ ] ] that attempted to understand the temporal dynamics of information diffusion over time. one of the relevant topics for analysing information diffusion patterns and models is the concept of time and how it can be represented and managed. one of the popular approaches is based on time series. any time series can be defined as a chronological collection of observations or events. the main characteristics of this type of data are large size, high dimensionality, and continuous change. in the context of data mining, the main problem is how to represent the data. an effective mechanism for compressing the vast amount of time series data is needed in the context of information diffusion. based on this representation, different data mining techniques can be applied such as pattern discovery, classification, rule discovery, and summarisation [ ] . in lin et al. [ ] , a new symbolic representation of time series is proposed that allows for a dimensionality/numerosity reduction. this representation is tested using different classic data mining tasks such as clustering, classification, query by content, and anomaly detection. based on the mathematical models mentioned above, we need to compare a number of various applications that can support users in many different domains. one of the most promising applications is detecting meaningful social events and popular topics in society. such meaningful events and topics can be discovered by well-known text processing schemes (e.g., tf-idf) and simple statistical approaches (e.g., lda, gibbs sampling, and the tste method [ ] ). in particular, not only the time domain but also the frequency domain have been exploited to identify the most frequent events [ ] . the social big data from various sources needs to be fused for providing users with better services. these fusion can be done in different ways and affect to different technologies, methods and even research areas. two of these possible areas are ontologies and social networks, next how previous areas could benefit from information fusion in social big data are briefly described: • ontology-based fusion. semantic heterogeneity is an important issue on information fusion. social networks have inherently different semantics from other types of network. such semantic heterogeneity includes not only linguistic differences (e.g., between 'reference' and 'bibliography') but also mismatching between conceptual structures. to deal with these problems, in [ ] ontologies are exploited from multiple social networks, and more importantly, semantic correspondences obtained by ontology matching methods. more practically, semantic meshup applications have been illustrated. to remedy the data integration issues of the traditional web mashups, the semantic technologies uses the linked open data (lod) based on rdf data model, as the unified data model for combining, aggregating, and transforming data from heterogeneous data resources to build linked data mashups [ ] . • social network integration. next issue is how to integrate the distributed social networks. as many kinds of social networking services have been developed, users are joining multiple services for social interactions with other users and collecting a large amount of information (e.g., statuses on facebook and tweets on twitter). an interesting framework has been proposed for a social identity matching (sim) method across these multiple sns [ ] . it means that the proposed approach can protect user privacy, because only the public information (e.g., username and the social relationships of the users) is employed to find the best matches between social identities. particularly, cloud-based platform has been applied to build software infrastructure where the social network information can be shared and exchanged [ ] . the social big data analysis can be applied to social media data sources for discovering relevant knowledge that can be used to improve the decision making of individual users and companies [ ] . in this context, business intelligence can be defined as those techniques, systems, methodologies, and applications that analyse critical business data to help an enterprise better understand its business and market and to support business decisions [ ] . this field includes methodologies that can be applied to different areas such as e-commerce, marketing, security, and healthcare [ ] ; more recent methodologies have been applied to treat social big data. this section provides short descriptions of some applications of these methodologies in domains that intensively use social big data sources for business intelligence. marketing researchers believe that big social media analytics and cloud computing offer a unique opportunity for businesses to obtain opinions from a vast number of customers, improving traditional strategies. a significant market transformation has been accomplished by leading e-commerce enterprises such amazon and ebay through their innovative and highly scalable e-commerce platforms and recommender systems. social network analysis extracts user intelligence and can provide firms with the opportunity for generating more targeted advertising and marketing campaigns. maurer and wiegmann [ ] show an analysis of advertising effectiveness on social networks. in particular, they carried out a case study using facebook to determine users perceptions regarding facebook ads. the authors found that most of the participants perceived the ads on facebook as annoying or not helpful for their purchase decisions. however, trattner and kappe [ ] show how ads placed on users social streams that have been generated by the facebook tools and applications can increase the number of visitors and the profit and roi of a web-based platform. in addition, the authors present an analysis of real-time measures to detect the most valuable users on facebook. a study of microblogging (twitter) utilization as an ewom (electronic word-of-mouth) advertising mechanism is carried out in jansen et al. [ ] . this work analyses the range, frequency, timing, and table basic features related to social big data applications in marketing area. trattner and kappe [ ] targeted advertising on facebook real-time measures to detect the most valuable users jansen et al. [ ] twitter as ewom advertising mechanism sentiment analysis asur et al. [ ] using twitter to forecast box-office revenues for movies topics detection, sentiment analysis ma et al. [ ] viral marketing in social networks social network analysis, information diffusion models content of tweets in various corporate accounts. the results obtained show that % of microblogs mention a brand. of the branding microblogs, nearly % contained some expression of brand sentiments. therefore, the authors conclude that microblogging reports what customers really feel about the brand and its competitors in real time, and it is a potential advantage to explore it as part of companies overall marketing strategies. customers brand perceptions and purchasing decisions are increasingly influenced by social media services, and these offer new opportunities to build brand relationships with potential customers. another approach that uses twitter data is presented in asur et al. [ ] to forecast box-office revenues for movies. the authors show how a simple model built from the rate at which tweets are created about particular topics can outperform marketbased predictors. moreover, the sentiment extraction from twitter is used to improve the forecasting power of social media. because of the exponential growth use of social networks, researchers are actively attempting to model the dynamics of viral marketing based on the information diffusion process. ma et al. [ ] proposed modelling social network marketing using heat diffusion processes. heat diffusion is a physical phenomenon related to heat, which always flows from a position with higher temperature to a position with lower temperature. the authors present three diffusion models along with three algorithms for selecting the best individuals to receive marketing samples. these models can diffuse both positive and negative comments on products or brands in order to simulate the real opinions within social networks. moreover, the authors complexity analysis shows that the model is also scalable to large social networks. table shows a brief summary of the previously described applications, including the basic functionalities for each and their basic methods. criminals tend to have repetitive pattern behaviours, and these behaviours are dependent upon situational factors. that is, crime will be concentrated in environments with features that facilitate criminal activities [ ] . the purpose of crime data analysis is to identify these crime patterns, allowing for detecting and discovering crimes and their relationships with criminals. the knowledge extracted from applying data mining techniques can be very useful in supporting law enforcement agencies. communication between citizens and government agencies is mostly through telephones, face-to-face meetings, email, and other digital forms. most of these communications are saved or transformed into written text and then archived in a digital format, which has led to opportunities for automatic text analysis using nlp techniques to improve the effectiveness of law enforcement [ ] . a decision support system that combines the use of nlp techniques, similarity measures, and classification approaches is proposed by ku and leroy [ ] to automate and facilitate crime analysis. filtering reports and identifying those that are related to the same or similar crimes can provide useful information to analyse crime trends, which allows for apprehending suspects and improving crime prevention. traditional crime data analysis techniques are typically designed to handle one particular type of dataset and often overlook geospatial distribution. geographic knowledge discovery can be used to discover patterns of criminal behaviour that may help in detecting where, when, and why particular crimes are likely to occur. based on this concept, phillips and lee [ ] present a crime data analysis technique that allows for discovering co-distribution patterns between large, aggregated and heterogeneous datasets. in this approach, aggregated datasets are modelled as graphs that store the geospatial distribution of crime within given regions, and then these graphs are used to discover datasets that show similar geospatial distribution characteristics. the experimental results obtained in this work show that it is possible to discover geospatial co-distribution relationships among crime incidents and socio-economic, socio-demographic and spatial features. another analytical technique that is now in high use by law enforcement agencies to visually identify where crime tends to be highest is the hotspot mapping. this technique is used to predict where crime may happen, using data from the past to inform future actions. each crime event is represented as a point, allowing for the geographic distribution analysis of these points. a number of mapping techniques can be used to identify crime hotspots, such as: point mapping, thematic mapping of geographic areas, spatial ellipses, grid thematic mapping, and kernel density estimation (kde), among others. chainey et al. [ ] conducted a comparative assessment of these techniques, and the results obtained showed that kde was the technique that consistently outperformed the others. moreover, the authors offered a benchmark to compare with the results of other techniques and other crime types, including comparisons between advanced spatial analysis techniques and prediction mapping methods. another novel approach using spatio-temporally tagged tweets for crime prediction is presented by gerber [ ] . this work shows the use of twitter, applying a linguistic analysis and statistical topic modelling to automatically identify discussion topics across a city in the united states. the experimental results showed that adding twitter data improved crime prediction performance versus a standard approach based on kde. finally, the use of data mining in fraud detection is very popular, and there are numerous studies on this area. atm phone scams are one well-known type of fraud. kirkos et al. [ ] analysed the effectiveness of data mining classification techniques (decision trees, neural networks and bayesian belief networks) for identifying fraudulent financial statements, and the experimental results concluded that bayesian belief networks provided higher accuracy for fraud classification. another approach to detecting fraud in real-time credit card transactions was presented by quah and sriganesh [ ] . the system these authors proposed uses a self-organization map to filter and analyse customer behaviour to detect fraud. the main idea is to detect the patterns of the legal cardholder and of the fraudulent transactions through neural network learning and then to develop rules for these two different behaviours. one typical fraud in this area is the atm phone scams that attempts to transfer a victims money into fraudulent accounts. in order to identify the signs of fraudulent accounts and the patterns of fraudulent transactions, li et al. [ ] applied bayesian classification and association rules. detection rules are developed based on the identified signs and applied to the design of a fraudulent account detection system. table shows a brief summary of all of the applications that were previously mentioned, providing a description of the basic functionalities of each and their main methods. [ ] technique to discover geospatial co-distribution relations among crime incidents network analysis chainey et al. [ ] comparative assessment of mapping techniques to predict where crimes may happen spatial analysis, mapping methods gerber [ ] identify discussion topics across a city in the united states to predict crimes linguistic analysis, statistical topic modelling kirkos et al. [ ] identification of fraudulent financial statements classification (decision trees, neural networks and bayesian belief networks) quah and sriganesh [ ] detect fraud detection in real-time credit card transactions neural network learning, association rules li et al. [ ] identify the signs of fraudulent accounts and the patterns of fraudulent transactions bayesian classification, association rules epidemic intelligence can be defined as the early identification, assessment, and verification of potential public health risks [ ] and the timely dissemination of the appropriate alerts. this discipline includes surveillance techniques for the automated and continuous analysis of unstructured free text or media information available on the web from social networks, blogs, digital news media, and official sources. text mining techniques have been applied to biomedical text corpora for named entity recognition, text classification, terminology extraction, and relationship extraction [ ] . these methods are human language processing algorithms that aim to convert unstructured textual data from large-scale collections to a specific format, filtering them according to need. they can be used to detect words related to diseases or their symptoms in published texts [ ] . however, this can be difficult because the same word can refer to different things depending upon context. furthermore, a specific disease can have multiple associated names and symptoms, which increases the complexity of the problem. ontologies can help to automate human understanding of key concepts and the relationships between them, and they allow for achieving a certain level of filtering accuracy. in the health domain, it is necessary to identify and link term classes such as diseases, symptoms, and species in order to detect the potential focus of disease outbreaks. currently, there are a number of available biomedical ontologies that contain all of the necessary terms. for example, the biocaster ontology [ ] is based on the owl semantic web language, and it was designed to support automated reasoning across terms in languages. the increasing popularity and use of microblogging services such as twitter are recently a new valuable data source for web-based surveillance because of their message volume and frequency. twitter users may post about an illness, and their relationships in the network can give us information about whom they could be in contact with. furthermore, user posts retrieved from the public twitter api can come with gps-based location tags, which can be used to locate the potential centre of disease outbreaks. a number of works have already appeared that show the potential of twitter messages to track and predict outbreaks. a document classifier to identify relevant messages was presented in culotta [ ] . in this work, twitter messages related to the flu were gathered, and then a number of classification systems based on different regression models to correlate these messages with cdc statistics were compared; the study found that the best model had a correlation of . (simple model regression). aramaki [ ] presented a comparative study of various machinelearning methods to classify tweets related to influenza into two categories: positive and negative. their experimental results showed that the svm model that used polynomial kernels achieved the highest accuracy (fmeasure of . ) and the lowest training time. well-known regression models were evaluated on their ability to assess disease outbreaks from tweets in bodnar and salathé [ ] . regression methods such as linear, multivariable, and svm were applied to the raw count of tweets that contained at least one of the keywords related to a specific disease, in this case "flu". the models also validated that even using irrelevant tweets and randomly generated datasets, regression methods were able to assess disease levels comparatively well. a new unsupervised machine learning approach to detect public health events was proposed in fisichella et al. [ ] that can complement existing systems because it allows for identifying public health events even if no matching keywords or linguistic patterns can be found. this new approach defined a generative model for predictive event detection from documents by modelling the features based on trajectory distributions. however, in recent years, a number of surveillance systems have appeared that apply these social mining techniques and that have been widely used by public health organizations such as the world health organization (who) and the european centre for disease prevention and control [ ] . tracking and monitoring mechanisms for early detection are critical in reducing the impact of epidemics through rapid responses. one of the earliest surveillance systems is the global public health intelligence network (gphin) [ ] developed by the public health agency of canada in collaboration with the who. it is a secure, web-based, multilingual warning tool that continuously monitors and analyses global media data sources to identify information about disease outbreaks and other events related to public healthcare. the information is filtered for relevance by an automated process and is then analysed by public health agency of canada gphin officials. from to , this surveillance system was able to detect the outbreak of sars (severe acute respiratory syndrome). from the biocaster ontology in arose the biocaster system [ ] for monitoring online media data. the system continuously analyses documents reported from over rss feeds, google news, who, promed-mail, and the european media monitor, among other data sources. the extracted text is classified based on its topical relevance and plotted onto a google map using geo-information. the system has four main stages: topic classification, named entity recognition, disease/location detection, and event recognition. in the first stage, the texts are classified as relevant or non-relevant using a naive bayes classifier. then, for the relevant document corpora, entities of interest from concept types based on the ontology related to diseases, viruses, bacteria, locations, and symptoms are searched. healthmap project [ ] is a global disease alert map that uses data from different sources such as google news, expert-curated discussions such as promed-mail, and official organization reports such as those from the who or euro surveillance, an automated real-time system that monitors, organises, integrates, filters, visualises, and disseminates online information about emerging diseases. another system that collects news from the web related to human and animal health and that plots the data on google maps is epispider [ ] . this tool automatically extracts information on infectious disease outbreaks from multiple sources including promedmail and medical web sites, and it is used as a surveillance system by table basic features related to social big data applications in health care area. ref. num. summary methods culotta [ ] track and predict outbreak detection using twitter classification (regression models) aramaki et al. [ ] classify tweets related to influenza classification bodnar and salathé [ ] assess disease outbreaks from tweets regression methods fisichella et al. [ ] detect public health events modelling trajectory distributions gphin [ ] identify information about disease outbreaks and other events related to public healthcare classification documents for relevance biocaster [ ] monitoring online media data related to diseases, viruses, bacteria, locations and symptoms topic classification, named entity recognition, event recognition healthmap [ ] global disease alert map mapping techniques epispider [ ] human and animal disease alert map topic and location detection [ ] collecting user experiences into a continually growing and adapting multimedia diary. classification of patterns in sensor readings from a camera, microphone, and accelerometers many eyes [ ] creating visualisations in collaborative environment from upload data sets visualisation layout algorithms tweetpulse [ ] building social pulse by aggregating identical user experiences visualising temporal dynamics of the thematic events public healthcare organizations, a number of universities, and health research organizations. additionally, this system automatically converts the topic and location information from the reports into rss feeds. finally, lyon et al. [ ] conducted a comparative assessment of these three systems (biocaster, epispider, and healthmap) related to their ability to gather and analyse information that is relevant to public health. epispider obtained more relevant documents in this study. however, depending on the language of each system, the ability to acquire relevant information from different countries differed significantly. for instance, biocaster gives special priority to languages from the asia-pacific region, and epispider only considers documents written in english. table shows a summary of the previous applications and their related functionalities and methods. big data from social media needs to be visualised for better user experiences and services. for example, the large volume of numerical data (usually in tabular form) can be transformed into different formats. consequently, user understandability can be increased. the capability of supporting timely decisions based on visualising such big data is essential to various domains, e.g., business success, clinical treatments, cyber and national security, and disaster management [ ] . thus, user-experience-based visualisation has been regarded as important for supporting decision makers in making better decisions. more particularly, visualisation is also regarded as a crucial data analytic tool for social media [ ] . it is important for understanding users needs in social networking services. there have been many visualisation approaches to collecting (and improving) user experiences. one of the most well-known is interactive data analytics. based on a set of features of the given big data, users can interact with the visualisation-based analytics system. such systems are r-based software packages [ ] and ggobi [ ] . moreover, some systems have been developed using statistical inferences. a bayesian inference scheme-based multi-input/multi-output (mimo) system [ ] has been developed for better visualisation. we can also consider life-logging services that record all user experiences [ ] , which is also known as quantify-self. various sensors can capture continuous physiological data (e.g., mood, arousal, and blood oxygen levels) together with user activities. in this context, life caching has been presented as a collaborative social action of storing and sharing users life events in an open environment. more practically, this collaborative user experience has been applied to gaming to encourage users. systems such as insense [ ] are based on wearable devices and can collect users experiences into a continually growing and adapting multimedia diary. the insense system uses the patterns in sensor readings from a camera, a microphone, and accelerometers to classify the users activities and automatically collect multimedia clips when the user is in an interesting situation. moreover, visualisation systems such as many eyes [ ] have been designed to upload datasets and create visualisations in collaborative environments, allowing users to upload data, create visualisation of that data, and leave comments on both the visualisation and the data, providing a medium to foment discussion among users. many eyes is designed for ordinary people and does not require any extensive training or prior knowledge to take full advantage of its functionalities. other visual analytics tools have shown some graphical visualisations for supporting efficient analytics of the given big data. particularly, tweetpulse [ ] has built social pulses by aggregating identical user experiences in social networks (e.g., twitter), and visualised temporal dynamics of the thematic events. finally, table provides a summary of those applications related to the methods used for visualisation based on user experiences. with the large number and rapid growth of social media systems and applications, social big data has become an important topic in a broad array of research areas. the aim of this study has been to provide a holistic view and insights for potentially helping to find the most relevant solutions that are currently available for managing knowledge in social media. as such, we have investigated the state-of-the-art technologies and applications for processing the big data from social media. these technologies and applications were discussed in the following aspects: (i) what are the main methodologies and technologies that are available for gathering, storing, processing, and analysing big data from social media? (ii) how does one analyse social big data to discover meaningful patterns? and (iii) how can these patterns be exploited as smart, useful user services through the currently deployed examples in social-based applications? more practically, this survey paper shows and describes a number of existing systems (e.g., frameworks, libraries, software applications) that have been developed and that are currently being used in various domains and applications based on social media. the paper has avoided describing or analysing those straightforward applications such as facebook and twitter that currently intensively use big data technologies, instead focusing on other applications (such as those related to marketing, crime analysis, or epidemic intelligence) that could be of interest to potential readers. although it is extremely difficult to predict which of the different issues studied in this work will be the next "trending topic" in social big data research, from among all of the problems and topics that are currently under study in different areas, we selected some "open topics" related to privacy issues, streaming and online algorithms, and data fusion visualisation, providing some insights and possible future trends. in the era of online big data and social media, protecting the privacy of the users on social media has been regarded as an important issue. ironically, as the analytics introduced in this paper become more advanced, the risk of privacy leakage is growing. as such, many privacy-preserving studies have been proposed to address privacy-related issues. we can note that there are two main well-known approaches. the first one is to exploit "k-anonymity", which is a property possessed by certain anonymised data [ ] . given the private data and a set of specific fields, the system (or service) has to make the data practically useful without identifying the individuals who are the subjects of the data. the second approach is "differential privacy", which can provide an efficient way to maximise the accuracy of queries from statistical databases while minimising the chances of identifying its records [ ] . however, there are still open issues related to privacy. social identification is the important issue when social data are merged from available sources, and secure data communication and graph matching are potential research areas [ ] . the second issue is evaluation. it is not easy to evaluate and test privacy-preserving services with real data. therefore, it would be particularly interesting in the future to consider how to build useful benchmark datasets for evaluation. moreover, we have to consider this data privacy issues in many other research areas. in the context of law (also, international law) enforcement, data privacy must be prevented from any illegal usages, whereas governments tend to trump the user privacy for the purpose of national securities. also, developing educational program for technicians (also, students) is important [ ] . it is still open issue on how (and what) to design the curriculum for the data privacy. one of the current main challenges in data mining related to big data problems is to find adequate approaches to analysing massive amounts of online data (or data streams). because classification methods require previous labelling, these methods also require great effort for real-time analysis. however, because unsupervised techniques do not need this previous process, clustering has become a promising field for real-time analysis, especially when these data come from social media sources. when data streams are analysed, it is important to consider the analysis goal in order to determine the best type of algorithm to be used. we were able to divide data stream analysis into two main categories: • offline analysis: we consider a portion of data (usually large data) and apply an offline clustering algorithm to analyse the data. • online analysis: the data are analysed in real time. these kinds of algorithms are constantly receiving new data and are not usually able to keep past information. a new generation of online [ , ] and streaming [ , ] algorithms is currently being developed in order to manage social big data challenges, and these algorithms require high scalability in both memory consumption [ ] and time computation. some new developments related to traditional clustering algorithms, such as the k-mean [ ] , em [ ] , which has been modified to work with the mapreduce paradigm, and more sophisticated approaches based on graph computing (such as spectral clustering), are currently being developed [ ] [ ] [ ] into more efficient versions from the state-of-theart algorithms [ , ] . finally, data fusion and data visualisation are two clear challenges in social big data. although both areas have been intensively studied with regard to large, distributed, heterogeneous, and streaming data fusion [ ] and data visualisation and analysis [ ] , the current, rapid evolution of social media sources jointly with big data technologies creates some particularly interesting challenges related to: • obtaining more reliable methods for fusing the multiple features of multimedia objects for social media applications [ ] . • studying the dynamics of individual and group behaviour, characterising patterns of information diffusion, and identifying influential individuals in social networks and other social media-based applications [ ] . • identifying events [ ] in social media documents via clustering and using similarity metric learning approaches to produce highquality clustering results [ ] . • the open problems and challenges related to visual analytics [ ] , especially related to the capacity to collect and store new data, are rapidly increasing in number, including the ability to analyse these data volumes [ ] , to record data about the movement of people and objects at a large scale [ ] , and to analyse spatio-temporal data and solve spatio-temporal problems in social media [ ] , among others. ibm, big data and analytics the data explosion in minute by minute data mining with big data analytics over large-scale multidimensional data: the big data revolution! ad - d-data-management-controlling-data-volume-velocity-and-variety.pdf the importance of 'big data': a definition, gartner, stamford the rise of big data on cloud computing: review and open research issues an overview of the open science data cloud big data: survey, technologies, opportunities, and challenges media, society, world: social theory and digital media practice who interacts on the web?: the intersection of users' personality and social media use users of the world, unite! the challenges and opportunities of social media the role of social media in higher education classes (real and virtual)-a literature review the dynamics of health behavior sentiments on a large online social network big social data analysis, big data comput trending: the promises and the challenges of big social data big data: issues and challenges moving forward business intelligence and analytics: from big data to big impact hadoop: the definitive spark: cluster computing with working sets mllib: machine learning in apache spark mlbase: a distributed machine-learning system mli: an api for distributed machine learning mapreduce: simplified data processing on large clusters mapreduce: simplified data processing on large clusters mapreduce algorithms for big data analysis improving mapreduce performance in heterogeneous environments proceedings of the acm sigmod international conference on management of data, sigmod ' useful stuff the big-data ecosystem table clojure programming, o'really the chubby lock service for loosely-coupled distributed systems the stratosphere platform for big data analytics the google file system bigtable: a distributed storage system for structured data pregel: a system for large-scale graph processing mongodb: the definitive guide the r book, st twitter now seeing million tweets per day, increased mobile ad revenue, says ceo an introduction to statistical methods and data analysis an evaluation study of bigdata frameworks for graph processing a bridging model for parallel computation hama: an efficient matrix computation with the mapreduce framework finding local community structure in networks community detection in graphs on clusterings-good, bad and spectral the maximum clique problem, in: handbook of combinatorial optimization community structure in social and biological networks fast algorithm for detecting community structure in networks finding community structure in very large networks modularity and community structure in networks spectral tri partitioning of networks a vector partitioning approach to detecting community structure in complex networks network brownian motion: a new method to measure vertex-vertex proximity and to identify communities and subcommunities a hierarchical clustering algorithm based on fuzzy graph connectedness adaptive k-means algorithm for overlapped graph clustering overlapping community detection in networks: the state-of-the-art and comparative study web document clustering: a feasibility demonstration information retrieval: data structures & algorithms introduction to information retrieval text analytics in social media principal component analysis indexing by latent semantic analysis latent dirichlet allocation proceedings of the th acm sigkdd international conference on knowledge discovery and data mining data clustering: a review fast and effective text mining using linear-time document clustering evaluation of hierarchical clustering algorithms for document datasets empirical and theoretical comparisons of selected criterion functions for document clustering machine learning in automated text categorization thumbs up?: sentiment classification using machine learning techniques data streams: models and algorithms efficient online spherical k-means clustering scalable influence maximization for prevalent viral marketing in large-scale social networks real-time event detection on social data stream information diffusion in online social networks: a survey inferring networks of diffusion and influence correcting for missing data in information cascades seeding influential nodes in nonsubmodular models of information diffusion graphical evolutionary game for information diffusion over social networks evolutionary dynamics of information diffusion over social networks a review on time series data mining a symbolic representation of time series, with implications for streaming algorithms emerging topic detection on twitter based on temporal and social terms evaluation privacy-preserving discovery of topic-based events from social sensor signals: an experimental study on twitter integrating social networks for context fusion in mobile service platforms semantic information integration with linked data mashups approaches privacy-aware framework for matching online social identities in multiple social networking services a social compute cloud: allocating and sharing infrastructure resources via social networks competing on analytics: the new science of winning effectiveness of advertising on social network sites: a case study on facebook social stream marketing on facebook: a case study twitter power: tweets as electronic word of mouth predicting the future with social media mining social networks using heat diffusion processes for marketing candidates selection opportunities for improving egovernment: using language technology in workflow management a decision support system: automated crime report analysis and classification for e-government mining co-distribution patterns for large crime datasets the utility of hotspot mapping for predicting spatial patterns of crime predicting crime using twitter and kernel density estimation data mining techniques for the detection of fraudulent financial statements real-time credit card fraud detection using computational intelligence identifying the signs of fraudulent accounts using data mining techniques epidemic intelligence: a new framework for strengthening disease surveillance in europe a survey of current work in biomedical text mining nowcasting events from the social web with statistical learning an ontology-driven system for detecting global health events towards detecting influenza epidemics by analyzing twitter messages twitter catches the flu: detecting influenza epidemics using twitter validating models for disease detection using twitter detecting health events on the social web to enable epidemic intelligence the landscape of international event-based biosurveillance the global public health intelligence network and early warning outbreak detection biocaster: detecting public health rumors with a web-based text mining system surveillance sans frontieres: internet-based emerging infectious disease intelligence and the healthmap project use of unstructured event-based reports for global infectious disease surveillance comparison of web-based biosecurity intelligence systems: biocaster, epispider and healthmap big-data visualization visualization of entities within social media: toward understanding users' needs parallelmcmccombine: an r package for bayesian methods for big data and analytics ggobi: evolving from xgobi into an extensible framework for interactive data visualization a visualization framework for real time decision making in a multi-input multi-output system insense: interest-based life logging manyeyes: a site for visualization at internet scale social data visualization system for understanding diffusion patterns on twitter: a case study on korean enterprises k-anonymity: a model for protecting privacy proceedings of th international conference on theory and applications of models of computation educating engineers: teaching privacy in a world of open doors online algorithms: the state of the art ultraconservative online algorithms for multiclass problems better streaming algorithms for clustering problems a survey on algorithms for mining frequent itemsets over data streams a multi-objective genetic graphbased clustering algorithm with memory optimization he, parallel k-means clustering based on mapreduce map-reduce for machine learning on multicore parallel spectral clustering in distributed systems a co-evolutionary multi-objective approach for a k-adaptive graph-based clustering algorithm gany: a genetic spectral-based clustering algorithm for large data analysis on spectral clustering: analysis and an algorithm learning spectral clustering, with application to speech separation dfuse: a framework for distributed data fusion visual analytics: definition, process, and challenges multiple feature fusion for social media applications the role of social networks in information diffusion event identification in social media learning similarity metrics for event identification in social media visual analytics visual analytics tools for analysis of movement data space, time and visual analytics this work has been supported by several research grants: co- key: cord- - shbiee authors: santarone, kristen; boneva, dessy; mckenney, mark; elkbuli, adel title: hashtags in healthcare: understanding twitter hashtags and online engagement at the american association for the surgery of trauma – meetings date: - - journal: trauma surg acute care open doi: . /tsaco- - sha: doc_id: cord_uid: shbiee background: social media amplifies the accessibility, reach and impact of medical education and conferences alike. the use of hashtags at medical conferences allows material to be discussed and improved on by the experts via online conversation on twitter. we aim to investigate the utilization of hashtags at the american association for the surgery of trauma (aast) meetings from to and its potential role in knowledge dissemination and meeting participations. methods: symplur signals software was used to analyze hashtags for the aast meetings by year: #aast , #aast , #aast , #aast . results: number of tweets decreased significantly from to ( to to to , respectively, p< . ). retweets also decreased significantly from to ( to to to , respectively, p< . ). users decreased from to ( to to to , respectively, p< . ). despite this decrease, impressions were . million in , increasing to . million in , then . million in and finally peaking in where impressions reached million (p< . ). the top influencer for – was the aast twitter account. conclusion: twitter #aast – online engagement and interactions have declined during the last years while impressions have grown steadily indicating potential widespread dissemination of trauma-related knowledge and evidence-based practices, and increased online utilization of conference material to trauma surgeons, residents and fellows, trauma scientists, other physicians and the lay public. #aast online engagement and impressions did not have influence on meeting attendance rates. the use of internet platforms has changed the way that modern technology and novel research influences the world of medicine. online discussion boards are being used to grade medical students, lectures are available on various streaming resources and telehealth is becoming a more popular avenue for improving accessibility to healthcare. with all the advancements in medicine, the field of trauma surgery would benefit from adopting similar practices. one of the new trends in medical education is holding online journal clubs using forums such as twitter or whatsapp. a dermatology program highlighted several of the benefits of using twitter as a platform for education. the wider audience allows for more comprehensive topics to be reviewed as more participants allow for a more thorough exploration of new articles. participation is not limited by geographic area and there is a public record of topic discussion that can be returned to and viewed at any time. twitter is useful for subspecialties such as dermatology or trauma surgery by allowing the public to have access to relevant cutting edge information. conversation on twitter also facilitates discussions between different specialties allowing for improved comprehension of the topic. the benefits of twitter can be used for many educational endeavors, including journal clubs, undergraduate medical education and conferences. twitter is available to the public, free of charge and simple to operate. by including lay people, medical students and other health professionals in the same forum as subject matter experts, twitter allows mutual participation in topic discussions. academic resources can be incredibly costly for students and potentially inaccessible to the public. accessibility to experts and their breadth of knowledge is an advantage to the public as well as those in medicine who may be unable to attend the conference. twitter's format allows physicians to share links to research articles, youtube videos and websites, along with text on their own intellectual interpretation of the material. the structure of this format is beneficial for facilitating education and communication. for these reasons, twitter provides a distinct advantage in comparison to other social media platforms. research on the use of social media in trauma surgery, specifically in regard to hashtag use at american association for the surgery of trauma (aast), an international surgical meeting, is limited. we aim to investigate the utilization of hashtags at the aast meetings from to and its potential role in knowledge dissemination and meeting participations. this is a review of twitter data collected by symplur regarding the use of the aast meeting hashtag from to . information was collected for each of the following hashtags: #aast , #aast , #aast and #aast . to obtain this information we used symplur signal software (upland, ca) which is designed specifically for healthcarerelated hashtags. this software allowed us to gather data on tweets, retweets, impressions, influencers, users and titles of research articles tweeted during the conference. filters were placed on the data sets to allow the most accurate information be obtained from the software, and languages other than english were excluded. additional filters were placed to ensure that we were only capturing data during the month that the conference took place (september to september ), meaning data were analyzed yearly and not cumulatively. the aast twitter hashtag data were analyzed within the time frame of all of the meetings. meeting dates were september - , ; september - , ; september - , ; and september - , . using the information from symplur on the top influencers, data on number of followers and specialty were obtained on analyzing their twitter profiles. ibm spss statistics software v. was used for analysis and statistical significance was defined as p< . . certain definitions were used to explain our data. engagement is an active form of social media interaction, defined in this study as a tweet or retweet incorporating the meeting hashtag. impressions are a passive form of interaction, indicating that content containing the meeting hashtag was viewed but no action (such as a comment or retweet) was taken. influencers are users who had high utilization of the meeting hashtag. twitter users agreed to participation via general twitter terms and conditions. this study was reviewed by our institutional review board and the western institutional review board and was deemed exempt. number of tweets decreased significantly from to , from to , respectively (p< . ). number of retweets also declined significantly from to , from to , respectively (p< . ). though tweets and retweets declined, impressions increased significantly from to , from . million to million (p< . ), potentially indicating contents from aast conference presentations were still being widely viewed and disseminated. twitter users dropped significantly from to as well, from to (p< . ) (figure ). the account that used the conference hashtag the most, thus becoming the top influencer, was the aast organization (@traumadoctors) twitter account, from to . online engagement measured by tweets and retweets combined decreased significantly from to , from to (p< . ). attendance remained relatively constant throughout the years - (figure ). figure illustrates international distribution of members of the aast by country of membership. twitter data from the aast conferences between the years of and yielded intriguing results. tweets, retweets and users decreased over the years while impressions significantly increased, indicating increased reachability and dissemination of conference materials throughout the platform. engagement did not increase at the same rate that impressions did. despite the fact that engagement did not increase, conference attendance continued to grow over from to . the significant increase in impressions over time is likely due to a combination of the scope of influence of the aast meeting, the content presented at the meeting and also the increasing use of twitter by medical professionals. the aast has over members in countries and is well established as a leader in the field of trauma surgery. this influence likely plays a role in the increasing number of impressions during aast conferences. the increasing use of twitter by medical professionals is also likely to contribute to growing impressions. twitter is the most popular form of social media used by healthcare professionals due to the ease and freedom of connection, information sharing and communication. the prevalence of healthcare professionals on twitter is likely the cause for an increase in impressions across several surgical conferences, as seen in figure . aast and eastern association for the surgery of trauma had consistent upward growth over the years while asc and acscc peaked during . the decreasing engagement at the aast conference may be related to the concept of social media fatigue. social media users are constantly bombarded with new content that they must read and respond to, while simultaneously creating their own unique posts. initially, social media participation was easy to encourage because it was a new and exciting concept, but as its use has become more common, active participation has decreased. an increased amount of social media users are becoming 'lurkers' or viewing content without acting on it. this phenomenon could contribute to the decreased amount of engagement. the usage of online platforms to improve the diffusion of research, particularly to students, residents and the public, has potential to revolutionize medical education. one orthopedic residency program put this theory to the test by starting a journal club using whatsapp as the platform for discussion. they found that the electronic platform allowed participants to still partake in discourse without the confines of scheduling problems, when given a -day window in which to respond. this method increases flexibility, which can contribute to resident satisfaction and willingness to participate. another study found that up to % of a surgical resident's workday is spent on noneducational endeavors, such as waiting for operating rooms to be ready. using online forums such as whatsapp or twitter can help to bridge this downtime during the workday. previous studies have analyzed the use of twitter at conferences, but very little was known about hashtag use at large trauma conferences. the majority of studies found that the use of hashtags increased meeting attendance, tweets and retweets throughout the years within fields such as critical care, cardiology and anesthesiology. [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] in , the spanish association of surgeons studied the users who had been using their meeting hashtag over the previous four meetings. their research found that originally, physicians made up % of the conference influencers, but by , they only represented % of influencers. these data suggest that physician influencers play a vital role in engaging the public to participate in online conversation and education. twitter has the potential to entirely overturn academic communication, particularly during events such as the global covid- pandemic. many physicians are already using twitter as an advocacy platform, to exchange the most current information on treatment, and improve education to the public, in addition to encouraging other physicians to join as well. essential scientific information travels through the 'twittersphere' much faster than it ever could through the rigid structure of online journals and databases. one recent article stated that research would be made public in 'days rather than months' but the reality of the current situation is that many of the patients do not have days to spare for their physicians to even obtain treatment information. looking at the international distribution of aast membership as illustrated in figure , nearly every continent is well represented by members, which is both extraordinary and essential for an international trauma organization. including trauma surgeons and scientists from across the globe in scientific evidence-based online conversations is beneficial to furthering trauma science and advancing patient care. the only continent that was relatively underrepresented was africa. the inclusion and expansion of aast's membership and online reach in africa could be facilitated and enhanced by magnifying efforts from the aast communication committee and the aast twitter handle @traumadoctors. this will potentially result in forming more connections with physicians in africa via twitter and could be reflected in aast meeting attendance and participation in the near future. to improve the online presence of trauma surgery as a specialty, the aast could implement a specialized social media team, similar to the journal of neurosurgery. this team was composed of neurosurgery residents and medical students dedicated to the field, who were assigned to roles as designated editors. these editors created visual abstracts of current research topics, and were subsequently posted on various social media sites. after implementing this team approach to social media accounts (twitter, facebook), impressions and online viewing of scientific materials increased significantly. the use of social media creates increased connectivity between specialists and experts. of note, the new generation of physicians will be proficient in the use of social media and it will likely play a more significant role in dissemination of research over time. all medical conferences should take advantage of the increased participation that twitter can provide when properly used. including links, polls, photos and videos on postings can increase interactions between users. the use of the meeting hashtag should be encouraged as frequently as possible by displaying the hashtag on name tags, meeting handouts and materials, as well as at the beginning and end of presentations. for those who are attending and pursuing continuing medical education credit, hashtags could be added to those materials as well. conferences could have a social media station alongside the registration station where participants could directly engage on social media with improved ease of access. presentations could be set up where questions are asked via a specific twitter hashtag, allowing a broader audience to participate, including those not in attendance. limitations of this study are primarily related to hashtag filters from symplur software that were used to filter out information irrelevant to conference materials. since hashtags are publicly available for use, anyone can tweet out information using the conference hashtag, even if it does not pertain to the meeting. filters were placed to remove tweets that were not published in english or in the north american time zone from data. twitter engagement relates to the previous year's attendance. future studies should investigate the long-term effects of conference hashtag use on knowledge dissemination after the conclusion of the conference. in addition, future research should examine the influence conference material has on twitter conversations during and after the aast meeting. impressions do not directly indicate information translation. absorption of information may not occur simply because the content of a tweet was viewed. however, there are indications that increased impressions and engagement may indicate overall scientific advancement. one study of at the conference of the american urological association found that the more likes and retweets a presentation received, the more likely it was to be published. in addition, tweets can be formatted in a manner that fosters user engagement and thus the likelihood that they would absorb the information. a twitter blog published in suggested that by including hashtags, photos and links a tweet was more likely to be interacted with by users. medical conferences should continue to promote use of hashtags at their events in an effort to promote an online presence. the network of scientists created through use of hashtags at conferences is beneficial to patients and students, and can be accessed in almost any circumstance. twitter #aast - online engagement and interactions have declined during the last years while impressions have grown steadily indicating potential widespread dissemination of trauma-related knowledge and evidence-based practices, and increased online utilization of conference material to trauma surgeons, fellows, residents, trauma scientists, other physicians and the lay public. #aast online engagement and impressions did not have influence on meeting attendance rates. medical conferences and education alike should take advantage of online platforms, such as twitter, to facilitate information sharing, stimulate online conversations, and advance trauma science. funding the authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors. map disclaimer the depiction of boundaries on this map does not imply the expression of any opinion whatsoever on the part of bmj (or any member of its group) concerning the legal status of any country, territory, jurisdiction or area or of its authorities. this map is provided without any warranty of any kind, either express or implied. twitter journal clubs: medical education in the era of social media american association for the surgery of trauma (aast) a study on personality traits and social media fatigue-example of facebook users a prospective review of a novel electronic journal club format in an orthopedic residency unit resident work hours: what they are really doing social media in critical care: fad or a new standard in medical education? an analysis of international critical care conferences between and #cgs : an evaluation of twitter use at the canadian geriatrics society annual scientific meeting social media expands the reach of the asc annual meeting not your daughter's facebook': twitter use at the european society of cardiology i tweet, therefore i learn: an analysis of twitter use across anesthesiology conferences trends in twitter use during the annual meeting of the spanish society of allergology and clinical immunology twitter expands the reach and engagement of a national scientific meeting: the irish society of urology the social media revolution is changing the conference experience: analytics and trends from eight international meetings tweeting the meeting: twitter use at the american society of breast surgeons annual meeting - twitter ® use and its implications in spanish association of surgeons meetings and congresses strange days specialized social media team increases online impact and presence: the journal of neurosurgery experience (preprint) association between twitter reception at a national urology conference and future publication status using twitter to increase content dissemination and control educational content with presenter initiated and generated live educational tweets (piglets) provenance and peer review not commissioned; externally peer reviewed.data availability statement data are available in a public, open access repository. all data relevant to the study are included in the article or uploaded as supplementary information.open access this is an open access article distributed in accordance with the creative commons attribution non commercial (cc by-nc . ) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. see: http:// creativecommons. org/ licenses/ by-nc/ . /. adel elkbuli http:// orcid. org/ - - - x key: cord- -uy dykhg authors: albanese, federico; lombardi, leandro; feuerstein, esteban; balenzuela, pablo title: predicting shifting individuals using text mining and graph machine learning on twitter date: - - journal: nan doi: nan sha: doc_id: cord_uid: uy dykhg the formation of majorities in public discussions often depends on individuals who shift their opinion over time. the detection and characterization of these type of individuals is therefore extremely important for political analysis of social networks. in this paper, we study changes in individual's affiliations on twitter using natural language processing techniques and graph machine learning algorithms. in particular, we collected million twitter messages from . million users and constructed the retweet networks. we identified communities with explicit political orientation and topics of discussion associated to them which provide the topological representation of the political map on twitter in the analyzed periods. with that data, we present a machine learning framework for social media users classification which efficiently detects"shifting users"(i.e. users that may change their affiliation over time). moreover, this machine learning framework allows us to identify not only which topics are more persuasive (using low dimensional topic embedding), but also which individuals are more likely to change their affiliation given their topological properties in a twitter graph. technologically mediated social networks flourished as a social phenomenon at the beginning of this century with exponents such as friendster ( ) or myspace ( ) [ ] but other popular websites soon took their place. twitter is an online platform where news or data can reach millions of users in a matter of minutes [ ] . twitter is also of great academic interest, since individuals voluntarily express openly their opinions and they can interact with other users by retweeting the others' tweets. in particular, in the last decade there has been an increase in interest from computational social scientists and numerous political studies have been published using information from this platform [ ] [ ] [ ] [ ] [ ] [ ] . previous works applied different machine learning models to these datasets. xu et al. collected tweets using the streaming api and implemented an unsupervised machine learning framework for detecting online wildlife trafficking using topic modeling [ ] . kurnaz et al. proposed a methodology which first extracts features of a tweet text and then applies deep sparse autoencoders in order to classify the sentiment of tweets [ ] . pinto et al. detected and analyzed the topics of discussion in the text of tweets and news articles, using non negative matrix factorization [ ] , in order to understand the role of mass media in the formation of public opinion [ ] . on the other hand, kannangara implemented a probabilistic method so as to identify the topic, sentiment and political orientation of tweets [ ] . some other works are focused in political analysis and the interaction between users, as for instance the one of aruguete et al., which described how twitter users frame political events by sharing content exclusively with likeminded users forming two well-defined communities [ ] . dang-xuan et al. downloaded tweets during the parliament elections in germany and characterize the role of influencers utilizing the retweet network [ ] . stewart et al. used community detection algorithms over a network of retweets to understand the behavior of trolls in the context of the #blacklivesmatter movement [ ] . conver et al. [ ] also used similar techniques over a retweets network and showed the segregated partisan structure with extremely limited connection between clusters of users with different political ideologies during the u.s. congressional midterm elections. the same polarization on the twitter network can be found in other contexts and countries (canada [ ] , egypt [ ] , venezuela [ ] ). opinion shifts in group discussions have been studied from different points of view. in particular, it was stated that opinion shifts can be produced by arguments interchange, according to the persuasive arguments theory (pat) [ , , ] . primario et al. applied this theory to measure the evolution of the political polarization on twitter during the us presidential election [ ] . in the same line, holthoefer et al analyzed the egyptian polarization dynamics on twitter [ ] . they classified the tweets in two groups (pro/anti military intervention) based on their text and estimated the overall proportion of users that change their position. these works analyzed the macro dynamics of polarization, rather than focus on the individuals. in contrast, we found it interesting not only to characterize the twitter users who change their political opinion, but also predict these "shifting voters". therefore, the focus of this paper is centered on the individuals rather than the aggregated dynamic, using machine learning algorithms. moreover, once we were able to correctly determine these users, we seek to distinguish between persuasive and non persuasive topics . in this paper, we examined three twitter networks datasets constructed with tweets from: argentina parliamentary elections, argentina presidential elections and tweets of donald trump. three datasets were constructed and used in order to show that the methodology can be easily generalized to different scenarios. for each dataset, we analyzed two different time periods and identify the larger communities corresponding to the main political forces. using graph topological information and detecting topics of discussion of the first network, we built and trained a model that effectively predicts when an individual will change his/her community over time, identifying persuasive topics and relevant features of the shifting users. our main contributions are the following: . we described a generalized machine learning framework for social media users classification, in particular, for detecting their affiliation at a given time and whether the user will change it in the future. this framework includes natural language processing techniques and graph machine learning algorithms in order to describe the features of an individual. . we observed that the proposed machine learning model has a good performance for the task of predicting changes of the user's affiliation over time. . we experimentally analyzed the machine learning framework by performing a feature importance analysis. while previous works used text, twitter profiles and some twitting behavior characteristics to automatically classify users with machine learning [ ] [ ] [ ] [ ] , here we showed the value of adding graph features in order to identify the label of a user. in particular, the importance of the "pagerank" for this specific task. . we also identified the topics that are considerably more relevant and persuasive to the shifting users. identifying this key topics has a valuable impact for social science and politics. the paper is organized as follows. in the data collection section, we describe the data used in the study. in the methods section, we describe the graph unsupervised learning algorithms and other graph metrics that were used, the natural language processing tools applied to the tweets and the machine learning model. in the results section, we analyze the performance of the model for the task of detecting shifting individuals. finally, we interpret these results in the conclusions section. the code is in github (omitted for anonymity reasons). twitter has several apis available to developers. among them is the streaming api that allows the developer to download in real time a sample of tweets that are uploaded to the social network filtering it by language, terms, hashtags, etc. [ , ] . the data is composed of the tweet id, the text, the date and time of the tweet, the user id and username, among other features. in case of a retweet, it has also the information of the original tweet's user account. persuasive) for the topics relevant (resp. non relevant) to those individuals for this research, we collected datasets: argentina parliamentary elections ( arg), argentina presidential elections ( arg) and united states tweets of donald trump ( us). for the argentinan datasets, the streaming api was used during the week before the primary elections and the week before the general elections took place. keywords were chosen according the four main political parties present in the elections. details and context can be found in the appendix. for the us dataset, "realdonaldtrump" (the official account of president donald trump) was used as keyword. twitter messages are in the public domain and only public tweets filtered by the twitter api were collected for this work. for the purpose of this research, we have analyzed more than million tweets and more than . million individuals in total. the specific start and end collection date, the total number of tweets and users can be seen in table . in this section, we will introduce the methodology used to characterize the twitter users. first the retweet networks (section . ) and the algorithm to find communities (section . ). then, the different metrics which describe the interaction networks among them (section . ). after that, the features obtained by analyzing the text of the tweets (section . ). finally, we describe the supervised learning model which uses the individual's characteristics as instances and predicts the shifting users. we represent the interaction among individuals in terms of a graph, where users are nodes and retweets between them (one or more) are edges (undirected and unweighted). isolated nodes (never retweeting nor retweeted) were not taken into account for this analysis. in figure , we can visualize the retweet network for each time period and dataset. in the case of the us dataset, most of the users are concentrated in two groups, which allows to visualize the political polarization. on the other hand, in the argentinean datasets we can identify two large groups and also some smaller ones. the graph visualizations are produced with force atlas layout using gephi software [ ]. in a given graph, a community is a set of nodes largely connected among them and with little or no connection with nodes of other communities [ ] . we implement an algorithm to detect communities in large networks which allows us to characterize the users by their relationship with other users. in this context, the modularity is defined as the fraction of the edges that fall within a given community minus the expected fraction if edges were distributed at random [ ] . the louvain method for community detection [ ] seeks to maximize modularity by using a greedy optimization algorithm. this method was chosen to perform the analysis due to the characteristics of the database. while other algorithms such as label propagation are good for large data networks, their performance decreases if clusters are not well defined [ ] . in contrast, in these cases the louvain or infomap methods obtain better results. however, given that the number of nodes is in the order of hundreds of thousands and edges in the order of one million, the louvain method has a better performance [ ] than other ones. despite having found several communities, we just considered the largest for each case. for the arg and arg dataset we used the four biggest communities because, when examining the text of the tweets and the users with the highest degree, each one had a clear political orientation corresponding to the four biggest political parties in the election. these communities are labeled as "cambiemos", "unidad ciudadana", "partido justicialista" and " pais" for arg and "frente de todos", "juntos por el cambio", "consenso federal" and "frente de izquierda-unidad" for arg (electoral context is provided in the appendix). regarding the us dataset, we used the biggest communities because of the bipartisan political system of the united states (republicans and democrats) and the clear structure present in the retweet networks, where only two big clusters concentrate almost all of the users and interactions (see figure ). in contrast, the argentinean election datasets have two principal communities and some minor communities as well. considering the the fact that our dataset has more than million tweets and more than . million users, it was not feasible to determine true labels of political identification of the users for this task. neither it was viable to manually assign them. therefore, we decided to use the communities labels of the retweet network as a proxy of political membership, and interpret changes in their label as changes in affiliation over time. this decision is supported by previous literature, where it is shown that communities identify a user's ideology and political membership [ , , , , , ] . moreover, taking into account the stochasticity of the louvain method and following [ ] , we decided to use for the machine learning task only the nodes that were always assigned to the same community, in order to minimize the possibility of an incorrect labeling. additionally, we did not used individuals with less than retweets, since we might have insufficient data to correctly classify them. finally we also manually sampled and checked users from different communities to verify their political identification. with the intention of characterizing topologically the users of the primary election network, we computed the following metrics: degree of each user in the network (i.e., the number of users that have retweeted a given one), pagerank [ ], betweenness centrality [ ], clustering coefficient [ ] and cluster affiliation (the community detected by the louvain method). we used all these metrics as features in the machine learning classification task. in order to determine the topics of discussion during the primary election, we analyzed the text of the tweets using natural language processing analysis and we calculated a low dimensional embedding for each user. the tweets were described as vectors through the term frequency -inverse document frequency (tf-idf) representation [ ] . each value in the vector corresponded to the frequency of a word in the tweet (the term frequency, tf ) weighted by a factor which measures the degree of specificity (inverse document frequency, idf ). we used -grams and a modified stop-words dictionary that not only contained articles, prepositions, pronouns and some verbs but also the names of the candidates, parties and words like "election". then, we constructed a matrix m concatenating the tf-idf vectors, with dimensions the number of tweets times the number of terms. we performed topic decomposition using non-negative matrix factorization (nmf) [ ] on the matrix m . nmf is an unsupervised topic model which factorizes the matrix m into two matrices h and w with the property that all three matrices have no negative elements. we selected the nmf algorithm because this non-negativity makes the resulting matrices easier to inspect and to understand their meaning. the matrix h has a representation of the tweets in the topic space, in which the columns are the degree of membership of each tweet to a given topic. on the other hand, the matrix w provides the combination of terms which describes each topic [ ] . the obtained results, analyzing just the tweets corresponding to the first time period, are detailed in the appendix. the decomposition dimension was swept between and , and for each dataset we chose a number of topics in the corpus so as to have a clear interpretation of each one. the same methodology was used and described in [ , ] . once we collected all this information, twitter users were also characterized by a vector of features where each cell corresponds to one of the topics and its value to the percentage of tweets the user tweeted with that topic. given that our objective was to identify shifting individuals and persuasive arguments, we implemented a predictive model whose instances are the twitter users who were active during both time periods [ ] and belonged to one of the biggest communities in both time periods networks. consequently, the number of users used at this stage was reduced. individuals were characterized by a feature vector with components corresponding to the mentioned topological metrics depicted in section . and others corresponding to the percentage of tweets in each one of the topics extracted in section . . the information used to construct these embedding was gathered from the whole first time period retweet network. the target was a binary vector that takes the value if the user changed communities between the first and the second time periods and otherwise. the summary of the datasets is shown in table . considering the percentage of positive targets, this is clearly a class imbalance scenario. specially in us, which is reasonable given the bipartisan retweet network with big and opposed communities [ ] . the gradient boosting technique uses an ensemble of predictive models to perform the task of supervised classification and regression [ ] . these predictive models are then optimized iteration by iteration using the gradient of the cost function of the previous iteration. in this scenario, xgboost, a particular implementation of this technique, has proven to be efficient in a wide variety of supervised scenarios outperforming previous models [ ] . we used a / random split between train and test. in order to do hyperparameter tuning, we used the randomized search method [ ] over the training dataset with -fold cross-validation, which consists of trying different random combinations of parameters and then staying with the optimum. with the objective of measuring the efficiency and performance of our machine learning model, two other models, namely random and polar, were taken as baselines for comparison. in the former one, the selected user will change of community with a probability of %. in the latter, for a user that belongs to one of the two biggest communities in the network, we predict that he/she will stay in that community, while a user that belongs to a smaller community will change to one of the two main communities with same probability. this polar model is inspired by idea that in a polarized election, members of the smallest communities shift and are attracted to the biggest communities, and was used in the argentinean datasets. we trained three different gradient boosting models for each dataset: the first one was trained only with the features obtained via text mining (how many tweets of the selected topics the user talks about); a second one was trained just with features obtained through complex network analysis (degree, pagerank, betweeness centrality, clustering coefficient and cluster affiliation); and the last one was trained with all the data. in this way, we could compare the importance natural language processing and complex network analysis for this task. in figure we can see the roc [ ] of the different models for each dataset. the best performance is obtained in all cases by the machine learning model built with all the characteristics of the users, which is able to efficiently predict which users are shifting individuals. this result is expected, since an assembly of models manages to have sufficient depth and robustness to understand the network information, the topics of the tweets and the graph characteristics of the users. we performed random permutation of the features values among users in order to understand which of them are the most important in the performance of our model (the so-called permutation feature importance algorithm [ ] ). in figure , we observe that the most important feature in all cases corresponds to the node's connectivity: pagerank, meaning that shifting individuals are the peripheral and least important nodes of big communities. the result is verifiable when comparing the pagerank averages in users who changed their affiliation ( arg pr = . e − , arg pr = . e − and us pr = . e − ) with those who did not ( arg pr = . e − , arg pr = . e − and us pr = . e − ), the latter being at least % higher. this is also consistent with the fact that the model trained with network features gets a better au c than the model trained with the texts of user tweets in all datasets. previous works have used text, twitter profile and some twitting behavior characteristics to automatically classify users with machine learning, but none of them have incorporated the use of these graph metrics [ ] [ ] [ ] [ ] ] . our work shows the importance of also including these graph features in order to identify shifting individuals. this result has a relevant sociological meaning: the unpopular individuals are more prone to change their opinion. besides the importance of the mentioned topological properties, some discussed topics are also relevant to the classifier model. a simple analysis of the most spoken topics in the network does not differentiate between topics discussed by a shifting individual and other users. considering that most users do not change their affiliation, it is interesting to analyze those that do change. the persuasive arguments theory affirms that changes in opinion occurs when people exchange strong (or persuasive) arguments [ , , ] . consequently, we defined a "persuasive topic" as a topic used primarily by shifting individuals and not used by non shifting individuals. with the intention of doing a deeper analysis of the topic embedding for the arg dataset, we first enumerate the main topics in that corpus: equivalent analysis can be done with the other two corpora and the topic decomposition for each can be found in the appendix. in figure , the most important topics for the classifier are "venezuela", "economy" and "santiago maldonado". we can contextualize these results by looking which are the main topics discussed in each community as well the ones discussed among the users that change between them, as it is shown in figure . we can see that "venezuela" is one of the most discussed topics in the people remaining in four communities and "santiago maldonado" is a relevant topic in the communities "unidad ciudadana" and " pais". when we look at the main topics discussed by users that change their communities between elections, we can observe that "venezuela" identifies those that go from "partido justicialista (pj)" to " pais" and "cambiemos" meanwhile "santiago maldonado" is a key topic among those who arrive to "unidad ciudadana" from "partido justicialista (pj)" and " pais". considering that these topics are considerably more used by the shifting twitter users than by the other users, it can be affirmed that these are "persuadable topics". in contrast, other topics such as "economy" or "santa cruz" were also commonly used by most of the users but not by the shifting individuals. in this paper we presented a machine learning framework approach in order to identify shifting individuals and persuasive topics that, unlike previous works, focused on the persuadable users rather than studying the political polarization on social media as a whole. the framework includes natural language processing techniques and graph machine learning algorithms in order to describe the features of an individual. also, three datasets were used for the experimentation: arg, arg and us. these dataset were constructed with tweets from countries, during different political contexts (during a parliamentary election, during a presidential election and during a non-election period) and in a multi-party system and a two-party system. the machine learning framework was applied to these different datasets with similar results, showing that the methodology can be easily generalized. the implemented predictive models effectively detected whether the user will change his/her political affiliation. we showed that the better performance can be achieved when representing the individuals with their community and other graph features rather than topic embedding. therefore, our results indicate that these proposed features do a reasonable job at identifying user characteristics that determine if a user changes opinion, features that were neglected in previous works of user classification on twitter [ ] [ ] [ ] [ ] ] . in particular, the pagerank was the most relevant according to the permutation feature importance analysis in all datasets, showing that popular people have lower tendencies to change their opinion. finally, the proposed framework also identifies which of the topics are the persuasive topics and good predictors of individuals changing their political affiliation. consequently, this methodology could be useful for a political party to see which issues should be prioritized in their agenda with the intention of maximizing the number of individuals that migrate to their community. understanding the characteristics and the topics of interest of politically shifting individuals in a polarized environment can provide an enormous benefit for social scientists and political parties. the implications of this research supplement them with tools to improve their understanding of shifting individuals and their behavior. the percentage on the arrows are the percentage of users that changed from one community to the other (when the percentage was less than %, the corresponding arrow is not drawn). the topics on the arrows show the most important topics among the users that change between those communities. • the president of argentina and the governor of the province of buenos aires at the time of elections (i.e., "mauriciomacri", "macri" and "mariuvidal"). these last two were added, despite not being actively present in the lists, due to their political importance, their relevance and participation during the campaign. in addition, the tweets were restricted to be in spanish. the electoral context is the following: former president and opposition leader cristina fernández de kirchner (former "unidad ciudadana") and sergio massa (former " pais") create a new party "frente de todos" with alberto fernández as candidate for president. on the other hand mauricio macri (former "cambiemos") run for reelection as candidate of "juntos por el cambio". the socialist nicolas del cao of "frente de izquierda-unidad" and roberto lavagna of "consenso federal" were also candidates for president, among others. considering the previous subsection and the candidates for the senate, for deputy and for governor, the following terms were chosen as keywords for tweeter: "elisacarrio", "ofefernandez ", "patobullrich", "macri", "macrismo", "mauriciomacri", "pichetto", "miguelpichetto", "juntosporelcambio", "alferdez", "cfkargentina", "cfk", "kirchner", "kirchnerismo", "frentetodos", "frentedetodos", "lavagna", "rlavagna", "urtubey", "urtubeyjm", "consensofederal", " consensofederal", "delcao", "nicolasdelcano", "delpla", "rominadelpla", "fitunidad", "fdeizquierda", "fte izquierda", "castaeira", "manuelac ", "mulhall", "nuevomas", "espert", "jlespert", "frentedespertar", "centurion", "juanjomalvinas", "hotton", "cynthiahotton", "biondini", "venturino", "frentepatriota", "romeroferis", "partidoautonomistanacional", "vidal", "mariuvidal", "kicillof", "kicillofok", "bucca", "buccabali", "chipicastillo", "larreta", "horaciorlarreta", "lammens", "matiaslammens", "tombolini", "matiastombolini", "solano", "solanopo", "lousteau", "gugalusto", "recalde", "marianorecalde", "ramiromarra", "maxiferraro", "fernandosolanas", "marcolavagna", "myriambregman", "cristianritondo", "massa", "sergiomassa", "gracielacamano", "nestorpitrola". in addition, the tweets were restricted to be in spanish. also, the topic embedding obtained with non-negative matrix factorization: c tweets of donald trump the following term was used as keyword for the tweeter api: "realdonaldtrump". in addition, the tweets were restricted to be in english. lon from friendster to myspace to facebook: the evolution and deaths of social networks longislandpress garcí emotions in health tweets: analysis of american government what the hashtag? a content analysis of canadian politics on twitter information, communication & society linh political communication and influence through microblogging-an empirical analysis of sentiment in analyzing the digital traces of political manipulation: the russian interference twitter campaign politics, sentiments, and misinformation: an analysis of the twitter discussion on the mauricio interest communities and flow roles in directed networks: the twitter network of the uk riots journal of the royal society interface donald j. trump and the politics of debasement critical studies in media communication using machine learning to detect cyberbullying ahme sentiment analysis in data of twitter using pablo quantifying time-dependent media agenda and public opinion by topic modeling physica a: statistical mechanics and its applications a scalable tree boosting system proceedings of the nd acm sigkdd international conference on knowledge discovery and data mining didrik tree boosting with xgboost-why does xgboost winëverymachine learning competition? random search for hyper-parameter optimization comparing effect sizes in follow-up studies: roc area, cohen's d, and r law and human behavior permutation importance: a corrected feature importance measure jonny identifying communicator roles in twitter proceedings of the st international conference on world wide web use of machine learning to detect illegal wildlife product promotion and sales on twitter frontiers in big data analyzing mass media influence using natural language processing and time series analysis michael quantifying controversy in social media proceedings of the ninth acm international conference on web search and data mining sandeepa mining twitter for fine-grained political opinion polarity classification, ideology detection and sarcasm detection proceedings of the eleventh acm international conference on web search and data mining consensus clustering in complex networks scientific reports measuring polarization in twitter enabled in online political conversation: the case of us presidential election judgments and group discussion: effect of presentation and memory factors on polarization sociometry why do humans reason? arguments for an argumentative theory persuasive arguments theory, group polarization, and choice shifts personality and ingmar content and network dynamics behind egyptian political polarization on twitter proceedings of the th acm conference on computer supported cooperative work & social computing measuring political polarization: twitter shows the two sides of venezuela chaos jeffrey investigating political polarization on twitter: a canadian perspective policy & internet testing two classes of theories about group induced shifts in individual choice sergio massa of " pais" (former chief of the cabinet of ministers of cristina kirchner, then leader of the opposition against cristina kirchner in when he won his provincial election) and florencio randazzo of twitter keywords considering the previous subsection, the following terms were chosen as keywords for tweeter • candidates for senate of the main four parties: their name and official user on twitter topic decomposition the topic embedding obtained with non-negative matrix factorization: . president donald trump: the o president of the united states. . obamagate: the accusation that barack obama is conspiring against donald trump world health organization: president trump announcing the us will pull out of the world health organization thank you: individuals thanking president trump for this actions in regard to the covid- pandemic fake news: individuals discussing and claiming that certain news are fake president barack obama: the o president of the united states and his administration key: cord- - lcs uf authors: bilal, mohammad; simons, malorie; rahman, asad ur; smith, zachary l.; umar, shifa; cohen, jonah; sawhney, mandeep s.; berzin, tyler m.; pleskow, douglas k. title: what constitutes urgent endoscopy? a social media snapshot of gastroenterologists’ views during the covid- pandemic date: - - journal: endosc int open doi: . /a- - sha: doc_id: cord_uid: lcs uf background and study aims there is a consensus among gastroenterology organizations that elective endoscopic procedures should be deferred during the covid- pandemic. while the decision to perform urgent procedures and to defer entirely elective procedures is mostly evident, there is a wide “middle ground” of time-sensitive but not technically urgent or emergent endoscopic interventions. we aimed to survey gastroenterologists worldwide using twitter to help elucidate these definitions using commonly encountered clinical scenarios during the covid- pandemic. methods a -question survey was designed by the authors to include common clinical scenarios that do not have clear guidelines regarding the timing or urgency of endoscopic evaluation. this survey was posted on twitter. the survey remained open to polling for hours. during this time, multiple gastroenterologists and fellows with prominent social media presence were tagged to disseminate the survey. results the initial tweet had , impressions with a total of engagements. there was significant variation in responses from gastroenterologists regarding timing of endoscopy in these semi-urgent scenarios. there were only three of scenarios for which more than % of gastroenterologists agreed on procedure-timing . for example, significant variation was noted in regard to timing of upper endoscopy in patients with melena, with . % of respondents believing that everyone with melena should undergo endoscopic evaluation at this time. similarly, about % of respondents thought that endoscopic retrograde cholangiopancreatography should only be performed in patients with choledocholithiasis with abdominal pain or jaundice. conclusion our analysis shows that there is currently lack of consensus among gastroenterologists in regards to timing of semi-urgent or non-life-threatening procedures during the covid- pandemic. these results support the need for the ongoing development of societal guidance for these “semi-urgent” scenarios to help gastroenterologists in making difficult triage decisions. abstr ac t background and study aims there is a consensus among gastroenterology organizations that elective endoscopic procedures should be deferred during the covid- pan-demic. while the decision to perform urgent procedures and to defer entirely elective procedures is mostly evident, there is a wide "middle ground" of time-sensitive but not technically urgent or emergent endoscopic interventions. we aimed to survey gastroenterologists worldwide using twitter to help elucidate these definitions using commonly encountered clinical scenarios during the covid- pandemic. methods a -question survey was designed by the authors to include common clinical scenarios that do not have clear guidelines regarding the timing or urgency of endoscopic evaluation. this survey was posted on twitter. the survey remained open to polling for hours. during this time, multiple gastroenterologists and fellows with prominent social media presence were tagged to disseminate the survey. the initial tweet had , impressions with a total of engagements. there was significant variation in responses from gastroenterologists regarding timing of endoscopy in these semi-urgent scenarios. there were only three of scenarios for which more than % of gastroenterologists agreed on procedure-timing. for example, significant variation was noted in regard to timing of upper endoscopy in patients with melena, with . % of respondents believing that everyone with melena should undergo endoscopic evaluation at this time. similarly, about % of respondents thought that endoscopic retrograde cholangiopancreatography should only be performed in patients with choledocholithiasis with abdominal pain or jaundice. conclusion our analysis shows that there is currently lack of consensus among gastroenterologists in regards to timing of semi-urgent or non-life-threatening procedures during the covid- pandemic. these results support the need for the ongoing development of societal guidance for these "semi-urgent" scenarios to help gastroenterologists in making difficult triage decisions. in march , the world health organization (who) declared the sars-cov- /novel coronavirus- (covid- ) a global pandemic. as of march , , more than , cases have been reported worldwide [ ] . patients can present with varying degrees of symptoms and according to one report, % of all infections were undocumented prior to january , [ ] . therefore, risk of infection reported in healthcare workers is substantial [ ] . in particular, while performing gastrointestinal endoscopy, there is risk of exposure to the endoscopists as well as the endoscopy team, including nurses, endoscopy technicians and anesthesia staff [ ] . while upper gastrointestinal endoscopy is an aerosol-generating procedure, there are now data to suggest that the risk may not limited to upper endoscopy alone, as recent reports have detected sars-cov in stool samples [ ] . this has led to recommendations that all elective and non-urgent endoscopic procedures be cancelled or postponed at this time [ , ] . however, important questions have emerged regarding how to define an urgent procedure vs a non-urgent procedure, or a procedure that can be deferred for a discrete period of time. in some clinical scenarios, the decision to perform or delay a procedure is evident. for example, there is clear consensus that procedures for indications such as suspected variceal bleeding, non-variceal upper gastrointestinal bleeding, acute cholangitis, foreign body removal, and cancer-related care (i. e. tissue acquisition for diagnosis, loco-regional staging, and palliative procedures) are urgent and should continue to be performed [ ] [ ] [ ] . similarly, endoscopic evaluations of chronic symptoms such as diarrhea and gastroesophageal reflux disease (gerd), or screening for colorectal cancer in average-risk individuals, are considered non-urgent and should be deferred. between these definitions exists a large array of potentially time-sensitive but not technically urgent or emergent endoscopic interventions. these grey areas or "semi-urgent" indications pose a clinical dilemma for the gastroenterologist with regard to proceeding with or deferring the procedure during this unprecedented time. we aimed to survey gastroenterologists worldwide using twitter to help elucidate these definitions using commonly encountered clinical scenarios during the covid- pandemic. we hypothesized that there would be significant variability regarding procedures and indications considered urgent or nonurgent, highlighting the need for further guidance and standardization in identifying time-sensitive procedures. a -question survey was designed by the authors. the goal was to choose common clinical scenarios that do not have clear guidelines regarding the timing or urgency of endoscopic evaluation or treatment during the current covid- pandemic [ , ] . these questions were posted on twitter using the "twitter poll" option (by the author mb). this author was chosen given that that he has more than , followers, with the majority being gastroenterology fellows or gastroenterologists from across the world. the initial tweet described the framework of the survey in the context of the current pandemic. the questions were posted under the comments section of the initial tweet. numerous gastroenterologists and gastroenterology fellows from across the world with prominent presence on twitter were tagged to help disseminate the survey. four additional questions were added based on request from other gastroenterologists. the questions were open for polling to twitter audience for hours, after which twitter automatically closes the survey to polling. the survey was completely anonymous. the results were analyzed using the "tweet activity" function on twitter. regarding twitter analytics, two definitions are important to understand. impressions are the number of times a tweet appears on the timeline or "feed" of twitter users. engagements refers to the number of times a user becomes involved in a tweet. these engagements include retweets, likes, replies, or as is important to this study, poll answers. for the purpose of this manuscript, endoscopies that were not classified urgent/emergent, or elective were described as "semi-urgent." semi-urgent endoscopy was defined as a procedure that could reasonably be deferred for at least weeks without negatively-impacting an important patient outcome (e. g. upstaging of a new cancer diagnosis). the initial tweet had , impressions and a total of , engagements. the details of the tweet were expanded , times. the number of votes received on the initial polls ranged from to , providing an estimate response rate ranging from . % to . %. the four additional questions added later had a lower response rate as expected with an average of votes polled. the summary of the results is outlined in ▶ table . the actual twitter analytics report can be accessed at: https://twitter.com/bilalmohammadmd. scenario focused on patients with positive fecal immunohistochemistry testing (fit) or fecal fit-dna test, and % of respondents suggested that colonoscopy was semi-urgent. scenario focused involved barrett's esophagus with dysplasia and/or nodularity needing endoscopic treatment, and . % of respondents deemed this as semi-urgent. scenario included patients with a benign ampullary adenoma needing endoscopic resection, and . % of respondents voted this as semi-urgent in the current setting. scenario questioned respondents regarding patients with melena, and . % thought that "any melena" needs urgent upper endoscopy, while . % thought only patients with "ongoing melena" or hemodynamic instability" should undergo endoscopic evaluation. scenario discussed patients presenting with hematochezia, and the majority ( %) thought only patients with hemodynamic instability should get inpatient colonoscopy. in patients with cirrhosis who had symptoms of upper gastrointestinal bleeding (scenario ), . % thought that they should get urgent upper endoscopy. for patients presenting with dysphagia (scenario ), . % of participants suggested performing esophagogastroduodenoscopy (egd) during this time only if dysphagia was acute in onset. scenario was regarding patients with a double duct sign on cross-sectional imaging, but without a discrete mass seen, and . % suggested it was semi-urgent. in patients presenting with isolated, unexplained weight loss (scenario ), . % thought this was semi-urgent. . % thought that all colonic endoscopic mucosal resections (emrs-scenario ) could be deferred at this time. in regard to endoscopic submucosal dissection (esd) for early gastric cancer (scenario ), . % of respondents thought this was semi-urgent. for patients with common bile duct (cbd) stones without cholangitis (scenario ), . % thought that urgent endoscopic retrograde cholangiopancreatography (ercp) is only needed if patients have symptoms or jaundice, % felt that ercp was not urgent if no cholangitis, while . % deemed all cbd stones as urgent. scenario discussed planned endoscopic removal or exchange of plastic biliary stents previously placed for scenarios currently resolved or for which the patient was currently asymptomatic. . % thought this was semi-urgent and can be deferred (▶ fig. ). for patients who had a pancreatic duct (pd) stent placed during a prior ercp (scenario ), % re-spondents favored deferring this as semi-urgent. in patients with iron deficiency anemia (scenario ) without overt gastroinestinal bleeding, . % thought this was semi-urgent. in a patient with long-standing ulcerative colitis with recent diagnosis of dysplasia (scenario ), % of respondents voted to defer performing chromoendoscopy at this time. our results show that there is significant variability among gastroenterologists in regard to the timing of endoscopic procedures for semi-urgent indications during the covid- pandemic. there were only three of scenarios in which greater than % of gastroenterologists agreed on procedure timing. these scenarios were deferring colonoscopy for patients who had fecal fit-dna or fit positive testing, performing urgent endoscopy for patients with cirrhosis presenting with melena and hematemesis, and deferring endoscopic evaluation for unexplained isolated weight loss. in regard to other scenarios such as the endoscopic treatment for patients with dysplastic barrett's esophagus, patients with ampullary adenoma needing resection and patients with double duct sign on imaging (without ▶ gastrointestinal bleeding is routinely considered an indication for urgent endoscopy. however, in our survey regarding patients presenting with hematochezia, the majority of respondents indicated that colonoscopy should only be pursued if a patient has hemodynamic instability. similarly, in patients with melena, fewer than half ( %) of respondents thought that every patient with melena warranted endoscopy, while the remainder opted for endoscopy only if ongoing signs of melena or hemodynamic instability were present. these findings are interesting since they might represent a change in typical gastroenterology management pathways necessitated by the covid- pandemic. the highest degree of variability was seen in answers related to ercp in patients with cbd stone without cholangitis. onefourth of respondents indicated that all cbd stones are urgent, while % suggested performing ercp only if symptoms were present, and approximately % indicated that ercp was semi-urgent if cholangitis was not present. this highlights that there is currently no consensus on optimal timing for ercp in patients with asymptomatic choledocholithiasis, even prior to the covid- pandemic. another interesting finding was that esd for early gastric cancer was suggested to be deferred by % of respondents at this time, but % of respondents indicated this was an urgent procedure. one hypothesis for this could be that esd can be a long procedure and it might concern gastroenterologists that longer procedure times might increase risk of covid- transmission. secondly, esd also carries a higher ae event rate than routine endoscopy [ ] , so there could be concerns that in case of complications such as perforation, there would be utilization of the operating room services in an already resource-constrained environment. it is interesting to note that the highest degree of variation in responses was for premalignant conditions such as barrett's esophagus, ampullary adenomas, and ulcerative colitis with dysplasia. this suggests that there is some uncertainty among gastroenterologists regarding deferring treatment of premalignant conditions in the current environment. ▶ fig. demonstration of variation in gastroenterologists regarding timing of procedures for semi-urgent procedural indications. there was also variation in answers in regard to patients who underwent prior ercp with biliary and pd stent placement. while most respondents suggested these procedures should to deferred during this time, some indicate that ercp for biliary stent exchange and pd stent removal should be prioritized and performed during this period. there could be several reasons for variations in responses seen in our survey. the respondents are from all over the world and there may be practice variation in different regions of the world. in addition, the covid- pandemic is in different phases throughout the world and as the crisis worsens, the definition of semi-urgent endoscopy may narrow. it's plausible that respondents from hard-hit western european countries such as spain and italy have a stricter definition of what warrants endoscopic evaluation during this pandemic. this cross-sectional analysis captures the opinion of the respondents at a specific time in the pandemic, the severity of which varies by locality. as the crisis worsens, a longitudinal study may well show that these opinions on what constitutes "semi-urgent" endoscopy narrow over time as disease prevalence increases. the variability of responses could also be driven by the fact that currently there is no defined duration for this pandemic and some gastroenterologists might be concerned about the uncertainty of how long a deferred patient would have to wait until their procedure is finally performed. this becomes especially important in patients with premalignant conditions. the joint gi society message on covid- does state that some non-urgent procedures are higher priority and examples included prosthesis (e. g. stent) removal and evaluation of significant symptoms [ ] . however, as the number of covid- cases are exponentially rising in many regions, another major concern for performing any endoscopy is the amount of personal protective equipment (ppe) that is needed to perform a single procedure safely. ppe is an important resource at this time which needs to be judiciously used. hence, many gastroenterologists are opting to defer many of the procedures above, even though in the "non-pandemic" situation these are likely to be performed sooner. our study has several limitations. given that this survey was conducted on twitter, we do not have an exact response rate. we did, however, use the engagements and the number of times the details of tweets were expanded on the initial tweet as the denominator and number of votes as the respondents to estimate the response rate. our estimated response rate is low, however, previously reported response rates are applicable to traditional methods like email and regular mail, and may not apply to social media platforms like twitter. this highlights the need to develop new standards of data acquisition/surveys as social media platforms are going to be increasingly and effectively used for this purpose in the future. also, we cannot tell how many of the people who cast votes were from which countries and what level in their career (gastroenterologists, gastroenterology fellows, internal medicine physicians). expanding on this, there were poll questions regarding procedures such as ercp and esd, which a minority of the respondents actually perform in clinical practice. it is unclear whether answers would vary if only physicians credentialed in these techniques were al-lowed to respond. as previously mentioned, this crisis is in different phases throughout the world. at the time of this manuscript, spain and italy continue to have dire situations regarding ppe availability, and inpatient volume has exceeded the critical care threshold capacity in many of these centers. we were not able to stratify responses by geographical region to assess for variability, which would have certainly provided important information. in addition, releasing a poll on twitter will only capture physicians who are currently engaged in using this particular social media platform, and the popularity of twitter use varies among different countries. however, we did find significant engagement from across the world (north america, south america, europe, asia, africa and australia), and participation from gastroenterologists from both the academic and community setting. despite these limitations, our analysis provides a real-time snapshot of the current thoughts of gastroenterologists around the world during the covid- pandemic. it also highlights that there is currently a lack of consensus regarding how to prioritize certain potentially time-sensitive endoscopic procedures. although each patient is unique and many clinical decisions must be made on a case-by-case basis, our analysis will provide some perspective and guidance to gastroenterologists while dealing with these clinical scenarios. our findings also strongly support the need for developing societal guidance in these "semi-urgent" scenarios to assist during the current covid- pandemic, and we are aware of numerous gastroenterology societies that are engaged presently in this endeavor. finally, this study shows how social media platforms can be positively used to gain instantaneous and clinically useful information from around the globe in response to rapidly changing situations. coronavirus disease (covid- ) situation report - . available at substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (sars-cov ) clinical characteristics of hospitalized patients with novel coronavirus-infected pneumonia in wuhan, china considerations in performing endoscopy during the covid pandemic covid- : gastrointestinal manifestations and potential fecal-oral transmission covid- ) outbreak: what the department of endoscopy should know asge guideline: the role of endoscopy in acute non-variceal upper-gi hemorrhage management of acute upper gastrointestinal bleeding emergent versus urgent ercp in acute cholangitis: a systematic review and meta-analysis aga institute clinical practice update: endoscopic submucosal dissection in the united states the authors thank dr heiko pohl for his critical review of manuscript and feedback in improving the manuscript. the authors also acknowledge the gastroenterologists and gi fellows from around the world who participated in this survey. the authors have no relevant conflicts of interest to this manuscript. but for details, author tb is consultant for boston scientific and medtronic. author dp is disclosures of conflict of interest is a consultant for boston scientific, olympus, fuji film, nine point, csa, medtronic. the remaining authors have no conflicts of interest. key: cord- -uih jf w authors: li, diya; chaudhary, harshita; zhang, zhe title: modeling spatiotemporal pattern of depressive symptoms caused by covid- using social media data mining date: - - journal: int j environ res public health doi: . /ijerph sha: doc_id: cord_uid: uih jf w by may , the coronavirus disease (covid- ) caused by sars-cov- had spread to countries, infecting more than . million people, and causing , deaths. governments issued travel restrictions, gatherings of institutions were cancelled, and citizens were ordered to socially distance themselves in an effort to limit the spread of the virus. fear of being infected by the virus and panic over job losses and missed education opportunities have increased people’s stress levels. psychological studies using traditional surveys are time-consuming and contain cognitive and sampling biases, and therefore cannot be used to build large datasets for a real-time depression analysis. in this article, we propose a corexq algorithm that integrates a correlation explanation (corex) learning algorithm and clinical patient health questionnaire (phq) lexicon to detect covid- related stress symptoms at a spatiotemporal scale in the united states. the proposed algorithm overcomes the common limitations of traditional topic detection models and minimizes the ambiguity that is caused by human interventions in social media data mining. the results show a strong correlation between stress symptoms and the number of increased covid- cases for major u.s. cities such as chicago, san francisco, seattle, new york, and miami. the results also show that people’s risk perception is sensitive to the release of covid- related public news and media messages. between january and march, fear of infection and unpredictability of the virus caused widespread panic and people began stockpiling supplies, but later in april, concerns shifted as financial worries in western and eastern coastal areas of the u.s. left people uncertain of the long-term effects of covid- on their lives. in december , an outbreak of pneumonia caused by a novel coronavirus (covid- ) occurred in wuhan and spread rapidly throughout the globe [ ] . the covid- outbreak has forced people to change their regular routine lives and practice social distancing. such a sudden change can drastically increase people's stress level and lead to other mental health issues. the difficulties caused by the covid- outbreak in different geographic regions can determine the cause and degree of stress in people, which corresponds to their risk of developing serious depression [ ] . according to a poll [ ] , nearly half ( %) of adults in the united states reported that their mental health has been negatively impacted due to worry and stress over the virus. as the pandemic continues, it is likely that the mental health burden will increase as people's sense of normalcy continues to be disrupted by social distancing, business and school closures, and shelter-in-place orders. the preexisting stress, constant unpredictability, and lack of resources lead to even greater isolation and financial distress. traditional mental health studies rely on information primarily collected through personal contact with a healthcare professional or through survey-based methods (e.g., via phone or online questionnaire). for instance, the patient health questionnaire (phq) is a self-administered version of the primary care evaluation of mental disorders (prime-md) diagnostic instrument for common mental disorders [ ] . however, these survey-based methods are time-consuming and suffer from cognitive and sampling biases, and therefore cannot be used to build large datasets for a real-time depression analysis [ ] . furthermore, understanding of spatial epidemic trends and geographic distribution patterns of covid- provides timely information on people's risk perception of epidemics. however, these important spatial and environmental leading factors are difficult to include in a survey-based method to model covid- related mental stress. geographic information system (gis) and social media data mining have become essential tools with which to examine the spatial distribution of infectious diseases [ ] [ ] [ ] , and can be used to investigate the spatiotemporal pattern of mental stress caused by the pandemic. for instance, social media data (e.g., twitter data) provide a unique opportunity to learn about the users' moods, feelings, and behaviors that reflect their mental health as they experience daily struggles [ ] [ ] [ ] . many articles focused on using feature-based approaches to perform sentiment and emotional analysis using twitter data [ ] [ ] [ ] [ ] . for instance, go and colleagues [ ] investigated the usage of unigrams, bigrams, and their combination in training the classifiers for sentiment analysis of tweets. various supervised classifiers were trained, including maximum entropy, naïve bayes [ ] , and support vector machine (svm) classifiers and their performance on the n-grams was compared. however, some methods previously used [ ] have become outdated; for instance, they took emoticons into account for their sentiment index, but nowadays lots of twitter users use emojis more frequently [ ] . barbosa and feng [ ] showed that n-grams are not useful in classifying tweets, as unused words in tweets can cause problems during classifier training. pak and paroubek [ ] proposed the usage of microblogging features like hashtags, emoticons, re-tweets, and comments to train an svm classifier and showed that it resulted in higher accuracy than training using n-grams. several articles address the effect of using part-of-search (pos) tag features in text classifiers [ , ] . abadi and colleagues [ ] investigated pos, lexicon, and microblogging features. the results showed that the most relevant features are those that combine prior polarity with the pos tags of the words. however, there have been mixed results reported on the usage of pos tags. go and colleagues [ ] showed that the pos tags caused reduced performance, although pos tags can be strong indicators of emotions in text and serve as a helpful feature in opinion or sentiment analysis [ ] . moreover, bootstrapping approaches, which rely on a seed list of opinion or emotion words to find other such words in a large corpus, are becoming more popular and have proven effective [ ] [ ] [ ] [ ] . mihalcea, banea, and wiebe [ ] described two types of methods for bootstrapping the subjectivity lexicons into dictionary-based and corpus-based. their research began with a small seed set of hand-picked subjective words, and with the help of an online dictionary produced a larger lexicon of potential candidate words. a similar bootstrapping model was effectively used to build a sentiment analysis system for extracting user-generated health review about drugs and medication [ ] . however, all the aforementioned methods only detect the general emotion of tweets and lack the ability to model depression levels in detail. latent dirichlet allocation (lda) is one of the most commonly used unsupervised topical methods, where a topic is a distribution of co-occurring words [ ] . however, the topics learned by lda are not specific enough to correspond to depressive symptoms and human judgments [ ] . the unsupervised method can work with unclassified text, but it often causes topics overlap [ ] . later, the lda method was extended by using terms strongly related to phq- depression symptoms as seeds of the topical clusters and guided the model to aggregate semantically-related terms into the same cluster [ ] . however, this approach only detects the presence, duration, and frequency of stress symptoms, ignoring the spatial context or environmental factors that are important in modeling the covid- related mental stress. to identify phq related text and unrelated text, a sentiment analysis index generated by python textblob was used [ ] , which only calculates the average polarity and subjectivity over each word in a given text using a constant dictionary [ , ] . work based on the lda probabilistic generative model was found to have limitations related to interpreting high dimensional human input factors which makes it difficult to generalize generative models without detailed and realistic assumptions for the data generation process [ ] [ ] [ ] . in this article, we propose a corexq algorithm that integrates correlation explanation (corex) learning algorithm and clinical phq lexicon to detect covid- related stress symptoms at a spatiotemporal scale in the united states. we aim to investigate people's stress symptoms in different geographic regions caused by the development of the covid- spread. since twitter data are high-dimensional human input data with diverse terms used to express emotions, we used the corex algorithm, a method intended to bypass the limitations of lda implementation and minimize human intervention [ ] . after that, we developed a fuzzy accuracy assessment model to visualize the uncertainty of the analytical results on the map. the rest of the article is organized as follows: section introduces the material and methods used in the research work including the introduction of data collection and processing methods, basilisk and machine learning classifier, and the proposed corexq algorithm. the results and discussion are presented in sections and , respectively. section draws conclusions. twitter data used in this article were collected through the twitter api from january to april for the continental united states. the collected data contained million tweets (~ gb), which posed significant computationally intensive challenges for the traditional gis computing environment. to address this challenge, we used a jupyter computing environment deployed on the texas a&m high performance computer. we filtered the collected twitter data using coronavirus related entities (e.g., hashtag, trends, and news). then, we removed irrelevant information (e.g., non-english language tweets, punctuation, missing data, messy code, url, username, hashtags, numbers, and query terms) from the filtered tweets. some adjustments and normalizations (e.g., uniform lower case, nonmaize vectorized tweets, standardize time sliced tweets) were also made in order to fulfill the common requirements of machine learning models. however, the stop words were removed later when applying the proposed algorithm to match the tweet phrase with lexicon. after that, the tweets were tokenized using the natural language toolkit's (nltk) tweettokenizer [ ] . we also replaced repeated character sequences by using the length value of three for any sequences of length three or greater ( +), since most users often extend words or add redundant characters to express strong feelings. tweets with an exact geospatial tag and timestamp were mapped to the corresponding county using reverse geocoding method [ , ] . other tweets (e.g., without geotags but containing user-defined location information in the user's profile) were geocoded to their corresponding county using a fuzzy set search method and city alias dataset [ ] . we excluded tweets that did not have geotags nor user-defined location information. one of the key innovations in our research was to map the covid- caused stress symptoms at a temporal scale. in this case, we set the temporal scale to biweekly starting from january , so the number of tweets collected in each county could be sufficient for accurate and reliable analysis. we used the basilisk bootstrapping algorithm to find semantic lexicons that could be used to divide the tweets into two categories: stressed and non-stressed. the bootstrapping approach to semantic lexicon induction using semantic knowledge, also known as the basilisk algorithm, was developed by thelen and riloff in [ ] . this approach can extend to divide the tweets into multiple categories across different areas [ ] . it employs a bootstrapping method to determine high-quality semantic lexicons of nouns. the algorithm takes a huge unannotated corpus from where it finds new related words and assigns them to the different semantic categories (e.g., stressed and non-stressed in our case). it is a form of categorization that is based on the seed words manually provided to the algorithm. these seed words are bootstrapped to identify new words that fall within the two categories. basilisk must be seeded with carefully selected terms for it to be effective. the two categories of seeds used for this task consisted of words each (table ) [ ]. the first category contained words describing stress and were used to bootstrap other words semantically related to stress or carrying a similar context. the second category contains words that describe non-stressed or a relaxing behavior. these two categories can be thought of as words that fall at the opposite ends of a stress level spectrum. before the bootstrapping process, the patterns were extracted on the unannotated corpus. this is used to extract all the noun phrases that were either the subject, direct object or prepositional phrase. the noun phrases were extracted from the corpus using the stanford dependency parser [ ] . it is a natural language parsing program used to find grammatical structure in sentences and can be used to find relationships or dependencies between nouns and the actions or words that form a group and go together. the dependency parser was run on all the sentences in the corpus and dependency relations were extracted for each word in the text (in the conll-u format [ ]). for each tweet, the following dependency information was extracted. the conll-u format of the extracted dependency pattern consists of the index, text, lemma, xpos, feats, governor, and dependency relations ( table ). these extracted dependency relations were used to extract patterns that were used by the basilisk algorithm to generate seeds. these extraction patterns were created for each dependency relation obtained in the previous step. the extraction patterns consisted of noun phrases and the dependency of them with other related words in the sentence. this acted as the input to the bootstrapping method. after the input was generated, the next step was to generate the seeds using basilisk. the seed words from the initial pattern pool enlarge with every bootstrapping step. the extraction patterns were scored using rlogf metric [ ] , which is commonly used for extraction pattern learning [ ] . the score for each pattern was computed as: rlogf(pattern (i) ) = f i n i * log f i , where f i represents the number of category members extracted by pattern (i) and n i is the total number of nouns extracted by pattern i . this formula was used to score the patterns with a high precision or moderate precision but a high recall. the high scoring patterns were then placed in the pattern pool. after this process, all head nouns co-occurring with patterns in pattern pool were added to the candidate word pool. at the end of each bootstrapping cycle, the best candidates were added to the lexicon thus enlarging the lexicon set. the process used related to basilisk, as proposed by thelen and riloff, can be described using the algorithm shown on table (for notation description see appendix a). this performs the categorization task of assigning nouns in an unannotated corpus to their corresponding semantic categories. using the words generated by the basilisk algorithm, we counted the total number of occurrences of any of the keywords in both categories. after the total count of stress and non-stress words in each tweet was obtained, we determined whether the tweet was in the category of stressed or non-stressed or neutral. this was done by finding the maximum of the stress and non-stress word counts in three conditions: ( ) if there were more stress words than non-stress words, we annotated the tweet as expressing stress. ( ) if the number of non-stress words is greater than the number of stress words, we annotated the tweet to express relaxed behavior. ( ) if the count was zero for both stress and non-stress words, we did not annotate the data. thus, tweets and their corresponding labels generated using this process were the initial training set, which was used to train a classifier to classify the other unannotated tweets. table . illustration of the basilisk algorithm [ ] . procedure: lexicon = {seed words} for i := . score all extraction patterns with rlogf . pattern pool = top ranked + i patterns . candidate word pool = extractions of patterns in pattern pool . score candidate words in candidate word pool . add top five candidate words to lexicon . i := i + . go to step . the universal sentence encoder [ ] was used to generate word embeddings. these text embeddings convert tweets into a numerical vector, encoding tweet texts into high dimensional vectors that are required to find semantic similarity and perform the classification task. it takes a variable length english text as input and outputs a -dimensional vector embedding. the encoder model was trained with a deep averaging network (dan) encoder [ ] . after the word embeddings were obtained for each stressed and non-stressed category tweet, a technique was used to make the two classes equalized. to do this, we selected the category with fewer samples and made the other category a similar size by removing samples. this ensured that the training process was not biased towards a particular class. before training the classifier, the data were split into training and validation sets. the data were randomly shuffled and put into the two datasets, with % used as the training dataset. to obtain the best performance, multiple classifiers were used, and performance was compared using accuracy metrics. the classifiers used in the training process were svm [ ] , logistic regression [ ] , naïve bayes classifier [ ] , and a simple neural network. svm handles nonlinear input spaces and separates data points using a hyperplane using the largest amount of margin. as a discriminative classifier, svm found an optimal hyperplane for our data, which helped with classifying new unannotated data points. we used different kernels to train the svm. the hyperparameters were tuned and the optimal value of regularization and gamma were also recorded. the logistic regression classification algorithm can be used to predict the probability of a categorical dependent variable. the dependent variable is a binary variable that contains data coded as (stressed) or (non-stressed). the logistic regression model predicts p(y = ) as a function of x. prior to training, it shuffles the data. it uses a logistic function to estimate probabilities to calculate the relationship between independent variable(s) and the categorical dependent variable [ ] . naïve bayes is another probabilistic classifier which makes classifications using the bayes rule. this classifier is simple and effective for text classification. a simple neural network consisting of three dense layers were used to train our datasets. the loss function and optimizer used in the training is binary cross entropy and rmsprop, respectively. training was done for epochs with a batch size of . table illustrates the performance evaluation of these classifiers. after the model was trained, the model was run on the unannotated tweets to label them. to label the sentence embeddings for the tweets, the same procedure was used as for the training set. the universal sentence encoder extracts features and created vectors that were used to classify the tweets based on the model. the svm classifier with linear kernel was used to predict the probabilities of the tweets because it had the best trained models (see table ). here, a threshold of . was set to determine if the tweet belonged to a particular category or not. if the probability of the tweet was above . for that category, the tweet was classified with the corresponding label. the tweets and labels generated using the above process were then used to train another classifier to generate the final model for classification of the entire unannotated corpus. here, a logistic regression model was used to train tweets and their corresponding labels generated using the above process to ensure that the model was robust and was not overfitted on the initial set of tweets that were filtered out using the basilisk generated keywords. the trained model had an accuracy of . % on the validation data. in this article, we propose a novel corexq algorithm to detect spatiotemporal patterns of covid- related stress. table illustrates the general structure of the corexq algorithm. the input of the algorithm was the stressed-related tweets derived by using the trained models (see sections . . and . . ) to all the processed covid- related tweets. we assessed the level of stress expressed in covid- related tweets by integrating a lexicon-based method derived from established clinical assessment questionnaire phq- [ ] . table illustrates the phq- lexicon examples and their corresponding mental stress symptoms. procedure: . shallow parsing each tweet into tweet_pharse using spacy . for each word_set in phq_lexicon do . calculate average vector of word_set and tweet_pharse using glove . match word_set with tweet_pharse set using cosine similarity measure . append each matched tweet_pharse to word_set . calculate tf-idf vector for all the tweets and transform the calculated value to a sparse matrix x . iteratively run corex function with initial random variables v random . estimate marginals; calculate total correlation; update v random . for each word_set in phq_lexicon . compare v random and word_set with bottleneck function . until convergence the phq- lexicon contains about clinical words, which is difficult to understand and match with the spoken language that is often used on twitter. therefore, we used the following methods to transform phq- lexicon to human understandable language by appending matched tweets to their best match phq- categories. in the first step, each tweet was placed into a set of phrase sets using natural language processing toolkit spacy [ ] (see table , procedure ). after that, the tweets and phq- lexicon were vectorized using global vectors for word representation (glove), wikipedia, and gigaword model (with dimensional word vectors and four million unique tokens) [ ] . glove provides a quantitative way to distinguish the nuance difference of two words (e.g., happy or unhappy), which is useful to match phrases set with the phq- lexicon. those pre-trained vectors were loaded to gensim [ ] to perform average vector and cosine distance calculation (see equations ( ) and ( )). we appended all phrases that have the similarity rate higher than . to their corresponding phq- lexicon (see table , procedures - ). given any words in a phrase, the average vector was calculated using the sum of the vectors divided by the number of words in a phrase: given any two average vectors v a and v b of two phrases, the cosine similarity, cos θ, is represented by next, a sparse matrix (e.g., a vocabulary dense matrix) for stressed corpus was calculated by transforming those tokenized and vectorized tweets using frequency inverse document frequency (tfidf). the mathematical formula of tfidf is illustrated below: where t denotes the terms; d denotes each document; and d denotes the collection of documents. the first part of the formula t f (t, d) calculates the number of times each word in covid- corpus appeared in each document. the second part of id f (t, d) is made up with a numerator d = d , d , . . . d n and a denominator | {d ∈ d : t ∈ d}|. the numerator infers the document space, which is all documents in our covid- stress corpus. the denominator implies the total number of times in which term t appeas in all of our documents d. the id f (t, d) can be represented by we utilized scikit-learn tfidfvectorizer to transform preprocessed tweets to a sparse matrix [ ] (see table , procedure ). after that, the sparse matrix and lexicon are used by the anchored corex model to perform anchored topic modeling [ ] . the total correlation tc [ ] (for notation description see appendix a) of each topic is calculated by anchoring the corex model with the document sparse matrix. the total correlation in our phq- lexicon detection can be expressed using kullback-leibler divergence as below [ ] . where p(x g ) represents the probability distribution and tc(x g ) is non-negative or zero factorizes of p(x g ) (see appendix a for more detail). in the context of phq- detection, x g represents the group of word types among the covid- corpus. note that each vector in the tfidf matrix is based on the distance between two probability distributions, which is expressed as cross-entropy entropy(x) [ , ] . when introducing a random variable y, the tc can explain the correlation reduction in x, which is a measure of the redundant information that the word types x carry about topic y [ ] . the total correlation can be represented by: where i(x : y) = entropy(x) + entropy(y) − entropy(x, y) (for notation description see appendix a). thus, the algorithm starts with randomly initialized variables α i,j and p(y i |x i ), where α i,j are indicator variables of tc that are assigned to if the topic is detected and p(x i ) represents the approximate empirical distribution (see table , procedure ). then, the correlation explanation updates both variables iteratively until the result achieves convergence. in each iteration, the estimate marginals p(y j |x i ) = x p(y i |x)p(x)δ x i and mutual information tc are calculated (notation description see appendix a). next, the update for a t i,j in each t step is calculated by where λ conduct a smooth optimization of the soft-max function [ , ] . finally, the soft labeling of any x (for notation description see appendix a) can be computed by after the soft-max function α converges to the true solution at a particular step α k in the limit λ → ∞ , the mutual information terms can be ranked by the informative order in each factor. to perform semi-supervised anchoring strategies, gallagher and reing proposed the combination with bottleneck function and total correlation [ ] . the bottleneck function can be represented by: the connection with corex and anchor words can be described by comparing equation ( ) with equation ( ). the same term i(x : y) in two equations represents the latent factor and the variable z corresponds to x i . it is worth noting that z is typically labeled in a supervised learning task [ ] and β is a constant parameter to constrain supervising strength so that α = β can imply a word type x i correlated with topic y j . in this case, z was represented by each variable generated by the enriched phq- lexicon. to seed lexicon to detect topics, we can simply anchor the word type x i to topic y j , by constraining the β (see table , procedures - ). the symptoms of covid- related stress were visualized at the county level biweekly from january. here, we used the fuzzy accuracy assessment method to evaluate the uncertainty of final phq stress level for each county [ , ] . we summarized the implementation of fuzzy accuracy assessment for a thematic map as presented by gopal and woodcock to explain our model evaluation for the phq map [ ] . let x be a finite universe of discourse, which is the set of county polygons in the study area. let ζ denote the finite set of attribute membership function (mf) topics categories to the d in x; and let m be the number of categories |ζ| = m, (e.g., nine phq categories). for each x x, we define χ(x) as the mf classes assigned to x. the set: defines the data. the subset s ⊂ x of n data is used. a fuzzy set is associated with each class c ζ where µ c (x) is the characteristic of mf of c. the fuzzy set can be represented as: to implement a decision-making system for fuzzy accuracy, the model uses a boolean function σ that returns results of or based on whether x belongs to the class c with respect to the matrix a. that is, σ(x, c) = if x "belongs" c, and σ(x, c) = if x does not "belong" to c. then σ(x, c) is if the numeric scale of the mf for x in category c(µ c (x)) is maximum among all map categories µ c (x), and we set the boolean function σ as max follows: according to the fuzzy set accuracy assessment, the final phq value for each county was selected based on the max function, meaning each county was colored based on the majority tweet phq value derived from the proposed corexq algorithm. since the accuracy assessment was based on a comparison of the phq label assigned to each county with the evaluation given by the expert (e.g., in each county, the majority tweet phq label). the rating system can thus be expressed as linguistic variables that describe the uncertainty associated with the evaluation of the class label. here, the linguistic variables are described below: . score : understandable: the answer is understandable but may contain high levels of uncertainty; . score : reasonable: maybe not the best possible answer but acceptable; . score : good: would be happy to find this answer given on the map; . score : absolutely right: no doubt about the match. it is a perfect prediction. figure illustrates the fuzzy mf created for the fuzzy accuracy assessment analysis. the x-axis represents the percentage of the tweets that belong to the assigned final phq category. the y-axis represents the value of the degree of the membership function corresponding to the linguistic score. for instance, if a county was assigned to a phq category , and % (e.g., x = . in figure ) of the tweets within this county polygon were labeled as phq- using the corexq algorithm, the corresponding mf should be absolutely right with membership value equal to . the accuracy assessment score was further visualized on the phq stress map to show the spatial uncertainty of the analysis results. the numeric scale of the mf for in category ( ( )) is maximum among all map categories ′ ( ), and we set the boolean function as follows: according to the fuzzy set accuracy assessment, the final phq value for each county was selected based on the max function, meaning each county was colored based on the majority tweet phq value derived from the proposed corexq algorithm. since the accuracy assessment was based on a comparison of the phq label assigned to each county with the evaluation given by the expert (e.g., in each county, the majority tweet phq label). the rating system can thus be expressed as linguistic variables that describe the uncertainty associated with the evaluation of the class label. here, the linguistic variables are described below: . score : understandable: the answer is understandable but may contain high levels of uncertainty; . score : reasonable: maybe not the best possible answer but acceptable; . score : good: would be happy to find this answer given on the map; . score : absolutely right: no doubt about the match. it is a perfect prediction. figure illustrates the fuzzy mf created for the fuzzy accuracy assessment analysis. the x-axis represents the percentage of the tweets that belong to the assigned final phq category. the y-axis represents the value of the degree of the membership function corresponding to the linguistic score. for instance, if a county was assigned to a phq category , and % (e.g., x = . in figure ) of the tweets within this county polygon were labeled as phq- using the corexq algorithm, the corresponding mf should be absolutely right with membership value equal to . the accuracy assessment score was further visualized on the phq stress map to show the spatial uncertainty of the analysis results. since corexq represents topic and potential symptoms as a lexicon-based topic modeling, traditional measures such as regression correlation and log-likelihood are unnecessary for the semantic topics. therefore, to evaluate the baseline performance of the corexq model, we first involved the semantic topic quality coherence measure methods with other common topic models. we compared corexq with lda and non-negative matrix factorization (nmf) [ , ] . in addition, we used frobenius normalized nmf (nmf-f) and generalized kullback-leibler divergence nmf (nmf-lk) for a closer comparison with traditional topic modeling. all models were trained with a since corexq represents topic and potential symptoms as a lexicon-based topic modeling, traditional measures such as regression correlation and log-likelihood are unnecessary for the semantic topics. therefore, to evaluate the baseline performance of the corexq model, we first involved the semantic topic quality coherence measure methods with other common topic models. we compared corexq with lda and non-negative matrix factorization (nmf) [ , ] . in addition, we used frobenius normalized nmf (nmf-f) and generalized kullback-leibler divergence nmf (nmf-lk) for a closer comparison with traditional topic modeling. all models were trained with a randomly selected covid- twitter dataset. the topics generated by those models were scored by topic coherence measures to identify the degree of semantic similarity between high-scoring words in the topic. a common coherence measure is umass which calculates and scores the word co-occurrence in all documents [ ] : where d(w i , w j ) represents the number of documents containing both w i and w j words and d(w i ) counts the ones containing w i , and c represents a smoothing factor. the intrinsic umass [ ] coherence measure calculates these probabilities over the same training corpus. additionally, the extrinsic uci measure [ ] introduced by david newman uses a pairwise score function, which is based on pointwise mutual information (pmi). it can be represented by: where p(w i ) represents the probability of seeing w i in a random document, and p(w i , w j ) is the probability of seeing both w i and w j co-occurring in a random document. those probabilities are empirically estimated from an external dataset such as wikipedia. the higher the topic coherence measure score, the higher the quality of the topics. in our baseline evaluation, we calculated the coherence scores by setting the range of topic numbers from to . the abnormal and low-quality topics were cleared and the average coherence scores (table ) were calculated by the sum of all coherence scores divided by the number of topics. on average, the corexq algorithm has a better umass score than lda and nmf. even though the uci score was slightly lower than two types of nmf algorithms, we can take the external estimation dataset as an uncertainty factor of this coherence model because the result of the comparison was still meaningfully coherent and it has the competitive functionality of the semi-supervised feature, which exceeded the usable range of nmf. in our research, the methods described above were combined to generate the final thematic map. to summarize processes for each detailed procedure, the workflow for the research is shown in figure . first, starting from data collection, we prepared a twitter dataset, basilisk lexicon, and phq- lexicon. then, we cleaned each tweet and extracted its location information using the method mentioned in section . . to engage time series analysis, the whole twitter dataset was formatted and sorted by unix timestamp before being sliced into two-week intervals. third, two lexicons were separately assigned to corexq and basilisk algorithm (mentioned in section . ) with the prepared twitter dataset. in the end, we decomposed the result generated by anchored corex model into spare matrix in order to group by all tweets in county level for visualization. note that each row of the results from the corex algorithm represents the correlations index within an individual tweet explained by nine phq levels so that we can reverse convert the result to its original tweets. the selected top symptoms and topics are present in table . the fuzzy accuracy assessment results of the study are illustrated in figure . on each map, the individual county is colored according to the assigned phq index using the proposed algorithm and fuzzy assessment accuracy assessment method. the numbers on the map represent the spatial uncertainty indices derived from the fuzzy accuracy assessment. each number represents the assessment score calculated from section . . . for most of the hot spots areas in figure , the values are greater than two, which indicates middle to high accuracy results have been reached for those regions. higher scores for an area indicate a larger percentage of the topics being present in this area at specific time region. the fuzzy accuracy assessment results of the study are illustrated in figure . on each map, the individual county is colored according to the assigned phq index using the proposed algorithm and fuzzy assessment accuracy assessment method. the numbers on the map represent the spatial uncertainty indices derived from the fuzzy accuracy assessment. each number represents the assessment score calculated from section . . . for most of the hot spots areas in figure , the values are greater than two, which indicates middle to high accuracy results have been reached for those regions. higher scores for an area indicate a larger percentage of the topics being present in this area at specific time region. the results also present the spatiotemporal patterns from january to april (shown in figure a -g. table shows the detected stress symptoms and topics generated from corexq . each map represents the spatial distribution of stress symptoms over a biweekly period. it indicates that most of the regions have low to medium phq values (topic - ) during january and february, since information about the u.s. covid- outbreak was not publicly available in the u.s. during that time. most counties that have a low phq level contain general covid- related topics that are tied to the cases in asia and general symptoms of covid- (e.g., "wenliang li" (a chinese doctor) [ ] , "south korea confirms", "coughing", "sneezing"). from the end of january, a few hotspots appear in some major u.s. cities such as san francisco, denver, los angeles, and seattle with topics related to "mistakenly released", "vaccine", "pandemic bus", and "china death" (see table , figure b ,c). for instance, the keyword "mistakenly released" reflects news story in february about the first u.s. evacuee from china known to be infected with the coronavirus being mistakenly released from a san diego hospital and returned to quarantine [ ] . people who living in california reacted strongly to this news (figure d) . later, on march (figure c,d) , the phq level started to increase rapidly due to the covid- test stations available, increased number of covid- death cases, and a shelter-in-place order in many states (see table , march). an interesting pattern was found that the number of counties with a high phq value kept growing until april and started to decrease after the second week of april [ ] . figure illustrates the number of increased cases in the u.s. from january to may . results show that the phq stress level in our results matches well with the number of increased cases illustrated in the johns hopkins coronavirus resource centers' statistical analysis results [ ] . this means the number of new cases reduced due to the social distancing practice, and at the same time, the level of people's major concerns in many geographic regions reduced as well. the results also present the spatiotemporal patterns from january to april (shown in figure a g. table shows the detected stress symptoms and topics generated from corexq . each map represents the spatial distribution of stress symptoms over a biweekly period. it indicates that most of the regions have low to medium phq values (topic - ) during january and february, since information about the u.s. covid- outbreak was not publicly available in the u.s. during that time. most counties that have a low phq level contain general covid- related topics that are tied to the cases in asia and general symptoms of covid- (e.g., "wenliang li" (a chinese doctor) [ ] , "south korea confirms", "coughing", "sneezing"). from the end of january, a few hotspots appear in some major u.s. cities such as san francisco, denver, los angeles, and seattle with topics related to "mistakenly released", "vaccine", "pandemic bus", and "china death" (see table , figure b,c) . for instance, the keyword "mistakenly released" reflects news story in february about the first u.s. evacuee from china known to be infected with the coronavirus being mistakenly released from a san diego hospital and returned to quarantine [ ] . people who living in california reacted strongly to this news (figure d) . later, on march (figure c,d) , the phq level started to increase rapidly due to the covid- test stations available, increased number of covid- death cases, and a shelter-in-place order in many states (see table , march). an interesting pattern was found that the number of counties with a high phq value kept growing until april and started to decrease after the second week of april [ ] . figure illustrates the number of increased cases in the u.s. from january to may . results show that the phq stress level in our results matches well with the number of increased cases illustrated in the johns hopkins coronavirus resource centers' statistical analysis results [ ] . this means the number of new cases reduced due to the social distancing practice, and at the same time, the level of people's major concerns in many geographic regions reduced as well. our results also show a meaningful explanation of the spatial pattern caused by people's risk perception to various media messages and news during the pandemic. in march , people in the united states had mild concerns about the uk prime minister boris johnson's talk of "herd immunity" [ ] and social distancing (see table , phq , march). on the other hand, the major stress came from topics such as cases of deaths (e.g., in washington state), lack of food and covid- protection equipment (e.g., panic buy), and the increasing number of confirmed and death cases in our results also show a meaningful explanation of the spatial pattern caused by people's risk perception to various media messages and news during the pandemic. in march , people in the united states had mild concerns about the uk prime minister boris johnson's talk of "herd immunity" [ ] and social distancing (see table , phq , march). on the other hand, the major stress came from topics such as cases of deaths (e.g., in washington state), lack of food and covid- protection equipment (e.g., panic buy), and the increasing number of confirmed and death cases in the united states. figure d ,e shows that most of the hotspots were located in washington, california, and new york, and florida matched with to the march covid- increased cases map (see [ ] . in april, keywords such as "death camps", "living expenses", "white house", and "economy shrinks" (see table ) appeared most often in the high phq value categories, which indicated that people's major concerns shifted to financial worries due to businesses shutting down and the economic depression [ ] . our study was conducted to perform a spatiotemporal stress analysis of twitter users during covid- pandemic by the corexq algorithm. according to the model evaluation results, the proposed corexq had the best baseline performance among other similar algorithms such as lda, nmf-lk, and nmf-f models. in addition to the corexq algorithm, we applied a fuzzy accuracy assessment method to the corexq analysis results to visualize the spatial uncertainty of the analysis results. this enables expert knowledge (e.g., phq rating of tweets) to be integrated in the social media data mining process. the result of our observed pattern reasonably matched the relevant events and epidemic trends. ideally, the analytic result of our collected twitter dataset is expected to support the research of mental health for the entire u.s. population as a sample case. in our cleaned twitter dataset, those tweets were posted by , , users, which represent over . % of the u.s. population. however, a previous investigation found that the % of american adults who use twitter are not uniformly distributed across age [ , ] . another study found that twitter users are getting younger [ ] , but the actual age, gender, and race of twitter users from those investigations have been controversial [ ] . to generalize the psychology health analysis to the whole u.s. population, further work related to the user demographic is required to reduce the influence of the sample bias. the thematic maps we created for phq topics distribution were assessed based on fuzzy sets. the purpose of this commonly used method for categorical maps is to allow explicit accounts for the possible ambiguity regarding the appropriate map label [ , [ ] [ ] [ ] [ ] . a wide variety of statistical techniques have been proposed for the accuracy assessment of thematic maps [ ] . in the future, we can use the standard deviation approach to estimate the quantity derived from the distribution of the tweets as a count on specific category if the assessment is focused on how the number of labeled phq tweets were distributed in each category. even though our datasets were preprocessed and selected with entities on covid- related topic, some of the tweets might be outside of the topic or are influenced by other objective factors. our future focus of uncertainty assessment of the thematic maps could be to extend to spatial uncertainty [ ] , temporal uncertainty [ ] semantic uncertainty [ ] , etc. our assessment task can be considered a criterion referenced task that can focus on a selected phq level and can represent the majority level in any location. the fuzzy area estimation methods were extended based on previous research [ ] . category assessment based on fuzzy sets can estimate the accuracy of classes as a function of levels of class membership [ ] . here, we used biweekly data as a temporal scale for the analysis. our research group continues collecting twitter data for this project, so analysis could be applied to more fine-grained temporal scales in the future. since coivd- is a global pandemic, this project could be extended to a global scale to compare the results across different countries. in the future, the model could be applied to other cases to detect the related stress symptoms and provide real-time spatial decision support for addressing the problem. an end-to-end spatiotemporal analysis system could be built if all of the modules were integrated; this would increase the efficiency of determining the potential symptoms and causes of public mental health problems. in this article, we proposed the corexq algorithm to analyze the covid- related stress symptoms at a spatiotemporal scale. the corex algorithm combined with clinical stress measure index (phq- ) helped to minimize human interventions and human language ambiguity in social media data mining for stress detection and provided accurate stress symptom measures of twitter users related to the covid- pandemic. there was a strong correlation between stress symptoms and the number of increased new covid- cases for some major u.s. cities such as chicago, san francisco, seattle, new york, and miami. people's risk perceptions were sensitive to the release of covid- related public news and media messages. many frequently appearing keywords in the high phq value categories represent the popular media and news publications at that time. before march, most regions had mild stress symptoms due to the low number of reported cases caused by the unavailability of test stations, creating a false sense of security among the public in the united states. the number of cases increased suddenly in march due to governmental confirmation of the seriousness of the pandemic in the united states and shelter-in-place orders in many states. from january to march, a major concern for people was being infected by the disease and there was panic-buying behavior, but this shifted to financial distress later in april along coastal eastern and western united states. our main contributions are as follows: first, we introduced a specialized stress tweets classifier, which narrows down the theoretical algorithms to practical usage on the public health area and demonstrates more effectiveness than traditional sentiment index classifiers. second, we framed corexpq as a topic detection model in our research. we explored the latent connection between the social media activity and phq- depression symptoms and topics in united states. finally, as a supplement methodology for the existing questionnaire-driven mental health research, our integrated system was used to glean depression topics in an unobtrusive way. the proposed algorithm provides an innovative way to analyze social media data to measure stress symptoms under covid- pandemic at a spatiotemporal scale. by doing this, we were able to observe spatiotemporal patterns of stress symptoms and answer the questions of what the major concerns related to the pandemic in different geographic regions at different time scales were. in the future, this model could be applied to other cases to detect related stress symptoms and provide real-time spatial decision support for addressing arising issues. the authors declare no conflict of interest. table a . notation table. pattern pool a subset of the extraction patterns that tend to extract the seed words candidate word pool the candidate nouns extracted by pattern pool are placed in candidate word pool tc total correlation, also called multi-information, it quantifies the redundancy or dependency among a set of n random variables. kullback-leibler divergence, also called relative entropy, is a measure of how probability distribution is different from a second, reference probability distribution [ ] . p(x g ) probability densities of x g i(x : y) the mutual information between two random variables p(y|x) y's dependence on x can be written in terms of a linear number of parameters which are just the estimate marginals δ the kronecker delta, a function of two variables. the function is if the variables are equal, and otherwise. a constant used to ensure the normalization of p(y|x) for each x. it can be calculated by summing |y| = k, an initial parameter. situation report centers for disease control and prevention mental health and coping during covid- |cdc. available online the impact of coronavirus on life in america the phq- : a new depression diagnostic and severity measure the evolution of cognitive bias covid- : challenges to gis with big data gis-based spatial modeling of covid- incidence rate in the continental united states using twitter and web news mining to predict covid- outbreak. asian pac quantifying mental health signals in twitter predicting postpartum changes in emotion and behavior via social media twitter sentiment classification using distant supervision techniques and applications for sentiment analysis the impact of social and conventional media on firm equity value: a sentiment analysis approach sentiment analysis on tweets for social events personality classification based on twitter text using naive bayes, knn and svm twitter sentiment analysis via bi-sense emoji embedding and attention-based lstm robust sentiment detection on twitter from biased and noisy data twitter as a corpus for sentiment analysis and opinion mining large-scale machine learning on heterogeneous distributed systems sentihealth: creating health-related sentiment lexicon using hybrid approach end-to-end spoken language understanding: bootstrapping in low resource scenarios micro-blog short text clustering algorithm based on bootstrapping learning multilingual subjective language via cross-lingual projections latent dirichlet allocation partially labeled topic models for interpretable text mining correlations and anticorrelations in lda inference semi-supervised approach to monitoring clinical depressive symptoms in social media the natural language toolkit. arxiv.org discovering structure in high-dimensional data through correlation explanation maximally informative hierarchical representations of high-dimensional data anchored correlation explanation: topic modeling with minimal domain knowledge the natural language toolkit efficient interactive fuzzy keyword search offline reverse geocoder in python full list of us states and cities a bootstrapping method for learning semantic lexicons using extraction pattern contexts a fast and accurate dependency parser using neural networks proceedings of the conference on empirical methods in natural language processing: system demonstrations generalized linear models use and interpretation of logistic regression in habitat-selection studies natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing glove: global vectors for word representation name: over the last weeks, how often have you been date: several more than nearly half the every day software framework for topic modelling with large corpora. gait recognition from motion capture data view project math information retrieval view project information theoretical analysis of multivariate correlation on information and sufficiency an information-theoretic perspective of tf-idf measures redundancy, and independence in population codes greedy function approximation: a gradient boosting machine the information bottleneck method theory and methods for accuracy assessment of thematic maps using fuzzy sets. photogramm. eng. remote sens a fuzzy multiple-attribute decision-making modelling for vulnerability analysis on the basis of population information for disaster management algorithms for non-negative matrix factorization automatic evaluation of topic coherence optimizing semantic coherence in topic models coronavirus kills chinese whistleblower doctor. bbc news. . available online centers for disease control and prevention covidview summary ending on what will be the economic impact of covid- in the us? rough estimates of disease scenarios induced economic uncertainty burdeau boris johnson's talk of 'herd immunity' raises alarms how twitter users compare to the general public|pew research center who tweets? deriving the demographic characteristics of age, occupation and social class from twitter user meta-data understanding the demographics of twitter users tau coefficients for accuracy assessment of classification of remote sensing data. photogramm. eng. remote sens conditional tau coefficient for assessment of producer's accuracy of classified remotely sensed data assessing landsat classification accuracy using discrete multivariate analysis statistical techniques technical note: statistical methods for accuracy assessment of classified thematic maps geostatistics: modeling spatial uncertainty, second edition spatio-temporal uncertainty in spatial decision support systems: a case study of changing land availability for bioenergy crops in mozambique managing uncertainty and vagueness in description logics for the semantic web using known map category marginal frequencies to improve estimates of thematic map accuracy fuzzy set theory and thematic maps: accuracy assessment and area estimation this article is an open access article distributed under the terms and conditions of the creative commons attribution (cc by) license key: cord- - pwjnrgl authors: farrell, tracie; gorrell, genevieve; bontcheva, kalina title: vindication, virtue and vitriol: a study of online engagement and abuse toward british mps during the covid- pandemic date: - - journal: nan doi: nan sha: doc_id: cord_uid: pwjnrgl covid- has given rise to malicious content online, including online abuse and hate toward british mps. in order to understand and contextualise the level of abuse mps receive, we consider how ministers use social media to communicate about the crisis, and the citizen engagement that this generates. the focus of the paper is on a large-scale, mixed methods study of abusive and antagonistic responses to uk politicians during the pandemic from early february to late may . we find that pressing subjects such as financial concerns attract high levels of engagement, but not necessarily abusive dialogue. rather, criticising authorities appears to attract higher levels of abuse. in particular, those who carry the flame for subjects like racism and inequality, may be accused of virtue signalling or receive higher abuse levels due to the topics they are required by their role to address. this work contributes to the wider understanding of abusive language online, in particular that which is directed at public officials. social media can offer a "temperature check" on which topics and issues are trending for certain cross-sections of the public, and how they feel about them [ ] . this temperature has run high during the covid- pandemic, with a number of incendiary and misleading claims [ ] , as well as hateful and abusive content [ ] appearing online. this content can interfere with both government and public responses to the pandemic. a recent survey of coronavirus conspiracy beliefs in england, for example, demonstrated that belief in conspiracy was associated with lower compliance with government guidelines. moreover, the authors found that in of their participants had a strong endorsement of conspiracy thinking [ ] , indicating that this is not just a fringe issue. online verbal abuse contributes as well, being both cause and consequence of misinformation: the quality of information and debate is damaged as certain voices are silenced/driven out of the space, and escalation leads to angry and aggressive expressions [ ] . understanding the interplay between malicious online content and the public's relationships with authorities during a health crisis is necessary for an effective response to the covid- pandemic. this work charts twitter abuse in replies to uk mps from before the start of the pandemic in the uk, in early february, until late may , in order to plot the health of relationships of uk citizens with their elected representatives through the unprecedented challenges of the covid- pandemic. we consider reactions to different individuals and members of different political parties, and how they interact with events relating to the virus. we review the dominant hashtags on twitter as the country moves through different phases, as well as some dominant conspiracy theories. for these data, we show trends in abuse levels, for mps overall as well as for particular individuals and for parties. we also compare prevalence of conspiracy theories, and contextualise them against other popular topics/concerns on twitter. in addition to our quantitative analysis, we present an in-depth qualitative analysis on tweets receiving more than abusive replies, that constitute % or more of total replies that tweet received. we use a set of qualitative codes derived from the literature on how authorities make use of social media during crisis (and health crisis in particular). referring to this as the social media activities of the mp who authored a tweet, we are able to label each tweet according to its most probable agenda (e.g. reaching out to constituents, communicating official information). this allows us to see the distribution of media activities across different parties and genders, for example. we then developed inductive codes for the covid- topic and potential controversial subject of the tweet (e.g. communicating policy about covid- or making criticisms of covid-related policy and brexit or inequality, respectively), and noted any attached urls or images for reference. we also listed the abusive words found in each reply to the tweet. finally, we prepared an analysis of those labels by gender, party, sexual orientation and ethnicity. in our analysis, we consider how social dimensions of antagonistic political discourse in the uk (ideology, political authority and affect), which have been visible in other recent key moments (such as brexit and successive general elections), influence the civility of discourse during covid- . this study aims to answer the following research questions: . how has the context of covid- impacted the typical patterns that have been observed in previous work about hateful and abusive language toward uk mps? . how do the social dimensions that have impacted political discourse on brexit and successive general elections appear to impact how social media activities are perceived during the covid- pandemic? . which social media activities of uk mps during the covid- pandemic receive the most abusive replies? how can we contextualise these results? the contribution of this study is to understand both the content and context of abusive and hateful communication, particularly toward governments and authorities during a health crisis. our focus, uk mps, adds to a growing longitudinal body of work that analyses online abuse at many key moments in british politics from to the present [ , , , ] . in the sections below, we begin with a description of the current context and a brief summary of related work. we then outline our methodology in detail before progressing onto findings. finally, we summarise and conclude our manuscript with some suggestions for future work. the dangers and perceived risks of covid- have fluctuated during the pandemic as a result of emerging knowledge. this feature of the pandemic creates an environment of uncertainty and ambivalence that feeds malicious content on social media. early epidemiological studies of covid- [ , ] implicated certain risk factors, such as age, gender and pre-existing conditions may impact transmission and severity of illness. as the pandemic progressed, researchers began to understand more about asymptomatic transmission [ , , ] , discovering that there may be many more cases of covid- than once realised. this led some researchers to suggest that the morbidity rate of covid- is lower than initially presumed [ ] , though this data is difficult to calculate . however, some communities were discovered to be at a greater risk. research on health care professionals who contracted covid- , for example, indicated that the most seriously ill individuals had multiple exposures to the virus, primarily at work [ ] , as well as longer duration of (sometimes unprotected) exposure to the virus [ ] . as more information emerged about disproportionate cases of covid- and poor health outcomes in black and asian communities in england, researchers began to also investigate early warning signs that some social or ethnic communities were more vulnerable than others [ , ] . this too, has led to challenging debates about social welfare, racism and healthcare during the pandemic. tensions between competing political and social interests during times of uncertainly can lead to an increase in hateful and antagonistic discourse online. valasquez et al found that malicious content of various kinds including misinformation, disinformation and hate speech are proliferating online about covid- . looking at clusters of online communities, the authors found that existing antagonistic groups appeared to mobilise covid- to spread hate and malicious content even into mainstream communities [ ] . organisations use social media during crisis events to correct rumours, prevent crisis escalation, provide facts or information, transmit proactiveness toward resolving the situation, and to communicate directly with members of the public (without temporal or geographic constraints) [ ] . not using social media to address a crisis can incur reputational damage for the organisation [ ] . twitter and other forms of social media are popular tools used by organisations and governments to communicate with citizens during crisis events [ ] . the focus for the literature below is to briefly review how governments and authorities use such tools to communicate about health crises, particularly in the uk, and to explore how malicious content and abuse has been examined previously within this context. before we address how politicians use social media in a health crisis, it is worth examining perspectives of the public and what they expect from politicians when emergencies like covid- arise. evidence indicates that the public expect a swift, transparent response from the government to crisis [ , ] . the public may also wish to engage with the government on its response. the greater the political interest of the user, the more likely they are to perceive and take advantage of the ''connective affordances" that social media provides for politicians and their constituents to engage [ ] . third, and perhaps most importantly, during a health crisis, most citizens will be interested in government advice and support [ , ] . for example, vos and buckner examined tweets that were shared during h n "bird flu" health crisis, and found that the majority of messages were about "collective sense-making responses" under conditions of uncertainty, rather than "efficacy responses" offering specific advice or information that would help the public to respond appropriately [ ] . a similar pattern was observed in response to the zika virus outbreak, with individuals using social media to form a personal risk assessment [ ] . llewellyn et al [ ] found that the public seeks advice from experts and that the informal character of online communication can interfere with the public's ability to form good opinions about the expertise of individuals online, even public figures. if sense-making and risk-assessment are the top public tasks for which they seek information on social media, government messages that do not respond to this need may miss the mark. in their analysis of political communication on social media, stieglitz and dang-xuan [ ] show that politicians may use social media for communication and persuasion, to "meet" voters and engage them in discussion, and also to communicate policy or other important information to their constituents. political analysis of us congress members on twitter shows that self-promotion is also an activity in which politicians engage, using the opportunity to share personal information or stories, and present themselves and their platforms in a good light [ ] . however, this is not always true. studies from the swedish electoral context showed that swedish politicians did not use twitter to engage with voters, but rather to provide information to them [ ] . in the uk, because internal party campaigns are based on individual candidates, politicians in the uk share some media behaviours with their counterparts in the us, where individual voter appeal is critical to campaign success [ ] . the covid- pandemic is a novel political situation in which ministers must respond to the crisis, while continuing to function in their roles. though the situation may be new, the dialogue around covid- is influenced by existing social and political dimensions of british political discourse. in their work documenting positions around the european referendum, andreouli et al named three dimensions for understanding the emergent dia-logue around brexit [ ] , that we feel may be useful here. these are "political values, political authority, and the authority of affect". with regard to values, the authors reflect on how existing ideological themes impact how an issue is perceived and discussed, in particular, where classical dichotomies do not hold up. for example, while the left typically associates itself with anti-prejudice and tolerance, associating such qualities with voting to "remain" is inconsistent with other leftist ideas to be antiestablishment. the authors argue that this tension creates a "liminal hotspot" where cosmopolitanism and critiques of globalisation intersect. we propose this same step, in light of the current crisis, to understand the dominant political and social themes that influence abuse toward uk mps during covid- . we are already seeing evidence of potential areas of tension in the current pandemic, such as the needs of older and younger people , the reliance on science and perception of risk [ , ] , the division between the wealthy and the poor [ ] , and the experiences of the urban and rural [ , ] . second, andreouli et al [ ] discuss the notion of political authority, and how the sovereignty of the uk within the eu becomes a backdrop for discourse on immigration during the european referendum. during the covid- pandemic, the sovereignty of local governments within the uk has been a consistent feature of debate, whether it involved avoiding beauty spots in wales during lockdown , comparing scotland's success in handling the virus , or the differential impact on the economy in northern ireland. media reports during the peak of the outbreak also indicated resistance toward lockdown, , or wearing face-coverings . as we move toward the next phases of the crisis, conflicts about personal agency and choice that may play out at individual or group levels in how the public respond to government guidance. finally, andreouli et al [ ] discuss the role of affect in political discourse. they demonstrate how impassioned speech has become its own kind of credibility, in which the narrative, rather than being factual, is important. in september , parties signed a pledge to use moderate language after a series of heated and antagonistic debates in which boris johnson was criticised for framing the brexit conversation as "surrender" to the european union, and for invoking the name of british mp jo cox, who was murdered by a far-right extremist one week before the referendum. dawn butler, who called the prime minister reckless in his decision to send employees back to work, was accused of using hyperbolic language in describing how such a policy amounted to sending people to work to catch the virus . due to her vocal support of #blacklivesmatter, butler has now had to shut down her offices in response to racism, death threats and other threats of violence. one of the goals of this research is to analyse topics of discussion and responses within these three dimensions (see section ). this will allow us to contextualise the public's antagonistic responses to uk mps on twitter during covid- thus far. as the conversation around online abuse develops, we need to differentiate the hostility that arises from increased visibility and engagement, from that which is based on hate or hate-speech toward a specific group of individuals or communities [ ] . in the uk, where legal frameworks tend to evolve, hate speech was defined through several legal statues, including the public order act of . legal philosopher jeremy waldron [ ] has. argued, however, that hate speech is that which sends a message to undervalued groups of communities that they are not welcome or wanted. it also sends a message to other sympathisers who may devalue or feel solidarity with that community. in particular, hate speech is associated with power [ ] . governments and politicians communicate hateful messages, for example, through language about romani people in europe [ ] or mexican and other latin immigrants in the united states [ , ] . governments have contributed to hate through politicising tribal identity in sub-saharan countries like kenya and rwanda [ , ] , and how they shape debates about free speech [ ] or migration [ ] more generally. politicians, therefore, can both be the targets of hate (as members of protected groups) and the perpetrators (as public authorities whose words matter). governments can also antagonise the public. in the context of covid- , the ways in which politicians are communicating about potentially volatile issues, such as re-opening the economy or avoiding social protest, add to the overall "health" of the discourse, or diminish it. there is a considerable amount of historical data on the prevalence of online abuse directed at british mps, particularly on twitter. previous work [ , , , ] has shown rising levels of hostility towards uk politicians on twitter, particularly in the context of divisive issues, such as brexit or inequality. partisan operators have been implicated in fanning the flames with malicious content, such as misinformation or troll accounts [ ] . in their and papers, ward and mcloughlin examined online abuse received by british mps from november , -january , [ , ] . the major findings from this work were that the amount of online hate (rather than language that can be described as "abusive" or uncivil) is relatively low and, as such, men receive more online abuse than women. the authors also showed that increased name recognition and popularity have a positive relationship with levels of abuse. crucially, however, the authors note that women and those from a minority background are more likely to receive abusive replies that can be classified as hate speech. abuse toward specific parties was difficult to distinguish, as levels of abuse may be influenced by one party member who attracts a significant proportion of abuse. when controlling for this, the authors found that less visible mps had a very small percentage of hate and abuse. this means that women mps with visibility disproportionately attract hate speech, as do men with visibility other forms of abusive language. this work prompted questions about what visibility means for people of different genders and backgrounds. southern and harmer [ ] conducted a deeper content analysis on tweets received by mps during a period and found that while men received more incivility in terms of numbers of replies, women were more likely to receive an uncivil reply. women were more likely to be stereotyped by identity (men by party) and to be questioned in their position as an mp. gorrell et al [ ] extended this work to define four visibility factors that appear to influence the amount of abuse the uk mps receive online: -prominence: individuals in the public eye will receive more abuse; -event surges: events leads to spikes in abuse (such as participation in an event, or political activities); -engagement: expressing strong opinions on social media can result in more personal abuse; -identity: gender, ethnicity and other personal factors impact which opinions one is allowed to hold and express without receiving abuse. gorrell et al also note that the impacts or consequences of abusive language are not manifesting in the same ways for male and female mps, or mps with intersectional identities of race and gender. where some abuse is distressing, other abuse is personal, threatening and limits women's participation in the public office [ , , ] . from this review, a picture emerges of the precipitating activities, mediating factors and dimensions of online abuse toward uk mps during covid- , which can be interrogated through our large-scale study. in this work we apply a combination of computational and social science methods to evaluate abuse toward uk mps on twitter. we utilise a large tweet collection on which natural language processing analysis has been performed in order to identify abusive language. this methodology is presented in detail by gorrell et al [ ] and summarised here. we then follow braun and clarke's [ ] process of thematic analysis on a subset of tweets that received % or more of abusive replies, in which at least abusive replies were received. this analysis is described in more detail below. the corpus was created by collecting tweets in real-time using twitter's streaming api. we used the api to follow the accounts of uk mps -this means we collected all the tweets sent by each current mp, any replies to those tweets, and any retweets either made by the mp or of the mp's own tweets. note that this approach does not collect all tweets which an individual would see in their timeline, as it does not include those in which they are just mentioned. however, "direct replies" are included. we took this approach as the analysis results are more reliable due to the fact that replies are directed at the politician who authored the tweet, and thus, any abusive language is more likely to be directed at them. data were of a low enough volume not to be constrained by twitter rate limits. the study spans february th until may th inclusive, and discusses twitter replies to currently serving mps that have active twitter accounts ( mps in total). table gives the overall statistics for the corpus. a rule-based approach was used to detect abusive language. an extensive vocabulary list of slurs (e.g. "idiot"), offensive words such as the "f" word and potentially sensitive identity markers, such as "lesbian" or "muslim", forms the basis of the approach. the slur list contained abusive terms or short phrases in british and american english, comprising mostly an extensive collection of insults, racist and homophobic slurs, as well as terms that denigrate a person's appearance or intelligence, gathered from sources that include http://hatebase.org and farrell et al [ ] . offensive words were used, along with sensitive words. "bleeped" versions such as "f**k" are also included. on top of these word lists, rules are layered, specifying how they may be combined to form an abusive utterance as described above, and including further specifications such as how to mark quoted abuse, how to type abuse as sexist or racist, including more complex cases such as "stupid jew hater" and what phrases to veto, for example "polish a turd" and "witch hunt". making the approach more precise as to target (whether the abuse is aimed at the politician being replied to or some third party) was achieved by rules based on pronoun co-occurrence. where people make a lot of derogatory comments about a third party in their replies to a politician however, for example racist remarks about others, there may be targeting errors leading to false positives. the abuse detection method underestimates by possibly as much as a factor of two, finding more obvious verbal abuse, but missing linguistically subtler examples. this is useful for comparative findings, tracking abuse trends, and for approximation of actual abuse levels. the method for detecting covid- -related tweets is based on a list of related terms. this means that tweets that are implicitly about the epidemic but use no explicit covid terms, for example, "@borisjohnson you need to act now," are not flagged. to understand more about the kinds of tweets attracting high levels of abuse, we considered several approaches for ranking them. ranking tweets by the most replies will surface prominent individuals, but perhaps not always polarising individuals or viewpoints. we decided on an initial criterion that a tweet had to have received % or more abusive replies, which is nearly twice the average level of abuse noted by [ , , ] . in addition, to filter out tweets that were attracting just a handful of abusive replies, we examined tweets that received at least abusive replies. all tweets were first examined and coded openly, as suggested by braun and clarke [ ] , to see which patterns emerge. contentious subjects (such as brexit, racism, and even jeremy corbyn), as well as potential media agendas (such as reaching out to undervalued communities, or making criticisms of the party in government) emerged from this analysis. we then compared open codes to themes from the literature regarding social media activities that politicians may undertake during a health crisis, as well as ideological themes and existing priors that may be influencing how uk mps are perceived. we created a final set of categories through the processes of reduction and comparison across the codes. we then re-coded the data according to the final annotation scheme in table . in addition to these codes, we also examined the topics mps referred to in their posts and grouped them inductively into categories of topics that are similar to the above, but with "escalation indicators" (described below) for god's sake. the man nearly died. he is going to chequers away from the glare of publicity to recuperate and be with his partner who is due to give birth in a matter of weeks. get a lifeidiot keyboard warrior! #hardhearted -andrea jenkyns direct rebuke mps critiquing someone who is not directly an authority considered controversial in uk politics. these were: home rule/nationalism with respect to northern ireland, scotland or wales; inequality; and brexit, alongside specific individuals and the ways in which they communicate. of course, the topic of covid- itself -the government response and the impactswas a primary topic category as well. following the distribution of these codes and topics in our sample, we used descriptive statistics and further qualitative analysis to expand on trends and observations uncovered through our computational approaches. in consideration of rigour, we have taken several steps to adjust for having used one annotator for the qualitative analysis. barbour has suggested focusing on alternative explanations in analysis, rather than a potentially superficial measure of inter-rater agreement through multiple coders [ ] . in addition, we provide a full justification of the coding scheme against each tweet in the sample, available at the url noted in section , so that other researchers can interrogate and interpret our findings accordingly. in the following section, we present our findings, which are organised by research question and which include both the quantitative and qualitative data analysis that answer that particular research question. for rq , we rely primarily on our literature review of trends and high-level analysis of abuse toward british mps. we compare this with our findings from the covid- period we studied. for rq , we consider how our findings fit the dimensions noted by andreouli et al [ ] as impacting contemporary british political discourse: ideology, authority and affect. we explore the four factors that contribute to how these dimensions are perceived, such as prominence, specific events, engagement habits and features of identity [ ] . finally, for rq , we present our qualitative analysis on the social media activities of uk mps during covid- thus far. we contextualise the abusive responses they receive for these activities, given the dimensions and contributing factors that may play a role. our first research question asked: how has the context of covid- impacted the typical patterns that have been observed in previous work about hateful and abusive language toward uk mps? to answer this question, we begin with a review of the time period studied, namely february th until may th inclusive, placing it in historical context. gorrell et al [ ] use the same (cautious) abuse counting methodology as we use here to show that aside from a blip around the general election, abuse toward mps on twitter has been tending to rise from a minimum of % of replies in , peaking mid- at over % with a smaller peak of around . % around the general election. after the election, however, abuse toward mps fell to around . %. in the timeline in fig we zoom in on the study period, and show abuse levels overall, toward all mps, on a per-week basis since mid-february. this timeline shows a rise in abuse, back up to over % around the time of the introduction of social distancing, before dipping, and then gradually beginning to rise again later in the study period. we see that the macro-average abuse level (red line) remains relatively steady, suggesting that this fluctuation is confined to a small number of high profile politicians (therefore being more evident in the micro-averaged blue line). fig shows abuse received as a percentage of all replies received by mps, for six distinct time periods discussed in more detail below. we see that on the whole, response to the conservative party has been favourable. the exception is after may th, when the negative response to dominic cummings' decision to travel north with covid- symptoms came to the fore. responses to liberal democrat mps are more erratic due to their lower number. in previous studies, we have found conservatives receiving higher abuse levels, yet here we see labour politicians receiving more abuse in most periods. this was in evidence even in february, so precedes the pandemic, although twitter has tended to be left-leaning in the uk [ ] . it remains to be seen if this is the beginning of a swing to the right or if it is specific to the times, e.g. arising from a desire to trust authority during times of crisis [ ] . there is a significant negative correlation between receiving a high level of covid-related attention and receiving abuse (- . , p< . , feb th to may th, spearman's pmcc). we see this clearly in prominent government figures below, who are receiving the lion's share of the covid- attention and lower levels of abuse than seen for them in pre-covid periods [ , ] . however the correlation is significant across the sample of all mps. the reaction of the public to the conservative party and the government's actions during covid- may be related to the conditions of a public health crisis as discussed in [ , ] , in which citizens may feel more motivated to trust authorities, although it may also follow from the crisis engaging a different group of people than usually respond to politicians on twitter. with a view to separating out different groups of twitter users, we tracked hashtags relating to dominant pro-vs anti-lockdown perspectives, as well as issues of concern; namely conspiracies and misinformation, and racism in conjunction with the pandemic. pro-and anti-lockdown hashtags were easily acquired, being dominant hashtags appearing in the dataset. they were then extended with minor linguistic variants. this report from moonshot cve was used as a guide to the overall conspiracy landscape within covid- . they provide some hashtags, and variants were then acquired, again, from looking down the list of hashtags appearing in the dataset for other variants, and including linguistic variations. the areas they highlight are anti-chinese feeling/conspiracy theory, theories that link the virus to a jewish plot, theories that link the virus to an american plot, generic "deep state" and g-based theories and general theories that the virus is a plot or hoax. table shows substantial evidence of ill-feeling toward china. in our analysis, mps using chinese data or referencing the chinese government's response to covid- in a positive context appear to attract abuse. one example, in terms of receiving a high percentage of abuse as well as a notable degree of attention, was the one below from richard burgon. https://twitter.com/richardburgon/status/ ( % abuse, or % of all abuse sent to mps in march post-lockdown): this is a trump-style attempt to divert blame from the uk government's failures. table mention count of viewpoint-related hashtags, in all replies to mps, feb th to may th inclusive. some further variants of the terms given, including non-hashtag mentions in text, are also included but not listed here for brevity; see gorrell et al [ ] for a more complete description a world health organization report says china "rolled out perhaps the most ambitious, agile & aggressive disease containment effort in history" we haven't even sorted out enough tests for nhs staff china's record of human rights or transparency is often provided as evidence of argument against such tweets. however, mixed in with this are also a number of sinophobic comments about china having "caused" or "started" the virus, for which at present there is no reliable scientific evidence. it is not clear how potentially useful critiques of the chinese government may be discussed without also provoking more sinister, racist commentary. classic conspiracy theories are in evidence but numbers of mentions are low (though note that most of the mentions of "nwo" ("new world order") are now covid- -related, suggesting opportunistic incorporation of covid- into existing mythologies). there is considerable evidence of some twitter users not believing in the virus, and numbers of mentions to this effect are within one order of magnitude of the popular "stay home save lives". yet all are surpassed by the theme of economic support for those not in established employment ("#newstarterfurlough"). so across the board, covid- has not led to higher proportions of abuse on twitter for mps compared with the high levels of abuse directed at them in . however, these findings might be partially explained by varying degrees of engagement by different societal groups, in addition to events affecting attitudes to authority. as we will see in section . , the comparatively positive response to boris johnson might be explained by more people replying to him than would normally do so; this extra attention was not abusive. the lower levels of abuse received by mps who receive more tweets mentioning covid- might also be explained by different people replying to politicians than usually would. a particularly striking illustration of this comes from tweets to mps using the hashtag #newstarterfurlough and variants. people who had recently started a new job "fell through the cracks" for financial support from the government. with , tweets to mps using #newstarterfurlough and variants (compared with only , using #stayhomesavelives and variants), #newstarterfurlough is the dominant hashtag campaign of the period. given that those individuals are in an unfortunate position, it is all the more surprising to find that only . % of those tweets contained abuse, as shown in table . a possible explanation is that the "new starters" are a broader, and more polite, cross-section of society than people who usually reply to politicians on twitter. in contrast, tweets containing #stayhomesavelives and variants contained . % abuse. tweets containing hashtags refuting the very existence of the virus, for example #scamdemic and #hoaxvirus, contained . % abuse. tweets describing covid- as "chinese", e.g. containing #chinesevirus, contained . %. tweets found containing anti-lockdown hashtags contained . % abuse. our second research question asked what are the societal dimensions that appear to impact how media activities are perceived during the covid- pandemic and how does this compare with those that impacted brexit or recent general elections?. for this question, we drew on previous work described in section , more specifically in . . on levels of abuse toward uk mps. we then examined the time period covered by our study in the context of three dimensions drawn from andreouli et al [ ] : political authority, ideology and affect. we use the factors from gorrell et al [ ] to help further describe these dimensions in terms of: prominence, event surges, engagement and identity. as mentioned previously, brexit created a notion of political authority that presented sovereignty of the uk on one side or community with europe on the other [ ] . during the past three uk elections, partisanship has led to a splintering of political authority and an erosion of trust [ ] . during the covid- pandemic, we can see a similar effect, for example, the wearing of face-coverings as a personal versus social choice , and participation in protest versus public health. increased name recognition and popularity have also been associated with higher levels of abuse in both the cases of brexit and general elections [ , , , ] . our covid- data shows similar trends. fig shows number of replies to prominent politicians since early february, and shows that for the most part, attention during covid- has https://www.southwalesargus.co.uk/news/ .wearing-face-covering-mask-walesnot-compulsory/ https://www.forbes.com/sites/tarahaelle/ / / /risking-their-lives-to-save-theirlives-why-public-health-experts-support-black-lives-matter-protests/# b ac b focused on boris johnson. he received a large peak in twitter attention on march th. , replies were received in response to his tweet announcing that he had covid- . abuse was found in . % of these replies. this is low for a prominent minister as we may discern from fig , indicating a generally supportive response to the prime minister's illness. further peaks on mr johnson's timeline correspond to the dates on which he was admitted to intensive care (april th), left hospital to recuperate at chequers (april th), and began to ease the lockdown (may th). the late burst of attention on other politicians arises from several tweets by ministers in support of dominic cummings, the senior government advisor who chose to travel north to his parents' home in the early stages of his illness with covid- . however, one tweet receiving a high level of abuse regarded the very first video address made by boris johnson in response to the pandemic: https://twitter.com/borisjohnson/status/ ( % of replies were abusive, tweet received % of all abuse to mps in the period. it also includes a video statement.) this country will get through this epidemic, just as it has got through many tougher experiences before. for those who trust in boris johnson's leadership and who like the way he communicates, this tweet may have provided some comfort. a review of replies to this tweet shows that supporters tweeted messages of appreciation and hope in response. for those who do not trust him and who believe he should have acted sooner, this tweet was perceived as a provocation. several replies that are critical but not abusive point to official sources of information from elsewhere in europe, or make advisement to the public about staying home and avoiding non-essential journeys. in this sense, covid- lends its own political authority to some arguments. however, invoking covid- as a member of the opposition, especially in the context of persistent debates, is often met with accusations of "playing party politics". the following tweet by david lammy received % abusive replies, representing % of all abuse sent to mps in the march pre-lockdown period: https://twitter.com/davidlammy/status/ no more government time, energy or resources should be wasted on brexit this year. boris johnson must ask for an extension to the transition period immediately. #covid is a global emergency. several mps made comments about the need to extend the brexit transition period. all received a considerable amount of replies containing abusive language. more generally, there is disagreement about the priorities of government during a crisis. for example, there was a considerable amount of abuse directed at richard burgon for a tweet in which he discussed the accomplishments of his work as shadow justice secretary. this was regarded by some critics as mistimed, given the pm's health at the time, and received % abuse, constituting % of all abuse sent to mps between april st and th inclusive. in later sections, we will explore how the media activities of opposition parties are perceived by the public, in particular in connection with contentious subjects. events are somewhat in a category of their own, as an mp's past actions and words become part of the public's priors in understanding the position of that mp. to contextualise the level of attention to mps and the abusive replies they received in connection with specific events, we review the events of the period, both in terms of who is receiving abusive replies and in the themes that are rising during the period as demonstrated by the appearance of certain hashtags. the hashtag cloud in fig shows that brexit remained the dominant topic in twitter political discourse during february, with the epidemic not yet having arrived in the uk. table gives a baseline for attention on mps as we go into the pandemic, showing that aside from boris johnson, attention, and abuse, is high for labour politicians. the column authored refers to the number of tweets originally posted from that account that were not retweets or replies. "re-plyto", refers to all of the replies received to the individuals twitter account in that period. the next column, "cov", is the number of replies received to that account containing an explicit mention of covid- , with the following column representing the number of replies that verbal abuse was found in ("abusive"). the last three columns present the data in a comparative fashion. firstly, we have the percentage of replies that the individual received that were abusive. next, we have the percentage of replies that were covidrelated. the last column is the percentage of covid-related replies to that individual, in comparison with all covid-related replies received by all mps. the word cloud in fig shows all hashtags in tweets to mps in earlier part of march, and unsurprisingly shows a complete topic shift, to the subject of the epidemic, to the virtual exclusion of all else. we see from table that with the arrival of covid- in the uk, health secretary matt hancock became more prominent on twitter at this time, though attention was not more abusive. attention on chancellor rishi sunak also increased and was not abusive. this is consistent with previous research that indicates a public willingness and desire to trust authorities in a crisis [ , ] . we see a high level of attention on boris johnson, but the abuse level is lower than was seen for him in previous years (we found . % in the first half of as mentioned above; in as foreign secretary mr johnson received similarly high abuse levels in high volumes). negative attention on labour politicians is high, but note that this was also the case before the start of the epidemic in the uk. matt hancock received % abusive replies ( % of all abuse to mps in the march pre-lockdown period) to the following tweet, in which he released his telegraph article on the government's response to the virus. https://twitter.com/matthancock/status/ news: my telegraph article on the next stage of our #coronavirus plan: we must all do everything in our power to protect lives critics were angry that mr. hancock would post important government information, during a time of extreme uncertainty, behind a pay wall. this goes back to the public's information-seeking needs during a health crisis [ , ] . not only is information important for collective sense-making, it is also important for determining personal risk [ ] . with the commencement of lockdown, the rise of the hashtags "#stayhomesavelives" and #lockdownuknow shows a shift toward comment on the practical details. support for the lockdown appears to be high at this stage, with the top ten hashtags featuring only pro-lockdown or generic covid- tags. attention continues to focus on boris johnson (see gorrell et al [ ] for complete word clouds and tables as well as histograms for each period), and is even less abusive than previously, largely due to a surge in non-abusive attention in conjunction with his being diagnosed with covid- . by volume, the most abuse-generating tweet was boris johnson's illness announcement, but as a percentage this was remarkably un-abusive, as discussed above, with only . % abuse. the high abuse count follows only from the very high level of attention this tweet drew. moving into april , the rise of the hashtag "#newstarterfurlough" shows that prior to donald trump's "liberation" tweets and the visible emergence of an anti-lockdown backlash, attention has already begun to focus on the economic cost of the lockdown, as illustrated by the prominence of hashtags such as #newstarterfurlough and #wearethetaxpayers. boris johnson's abuse level continues to be low as his illness takes a serious turn. in the context of the pandemic, different influences from the public also have a measure of authority. for example, the high abuse level toward jack lopresti during this period relates to his controversial opinion that churches should open for easter. strong opinion in conjunction with a religious event is part of a pattern that we see in conjunction with eid in the next section. https://twitter.com/jacklopresti/status/ ( % abuse, % of all abuse sent to mps between april st and th inclusive): open the churches for easter and give people hope https: // telegraph. co. uk/ news/ / / / open-churches-easter-give-people-hope/ ?wt. mc_ id= tmg_ share_ tw via @telegraphnews https://twitter.com/jacklopresti/status/ ( % abuse, % of all abuse sent to mps between april st and th inclusive): today i wrote to the secretary of state @mhclg and also sent a copy of this letter to secretary of state @dcms to ask the government to consider opening church doors on easter sunday for private prayer. from mid-april , a notable backlash against lockdown began to emerge. hashtags now appear to be critical, often economically focused but also including accusations of lying against china, boris johnson and conservatives, and references to the shortage of personal protective equipment for medical workers. the distinct change in tone echoes events in the usa. in this context it is interesting therefore that the tweet receiving the most abusive response by volume (it also received a striking level of abuse by percentage) is this one by ed davey. https://twitter.com/edwardjdavey/status/ ( % abuse, % of all abuse toward mps for the period): a pre-dawn meal today preparing for my first ever fast in the holy month of ramadan for muslims doing ramadan in isolation, you are not alone! #ramadanmubarak #libdemiftar the following tweet also attracted high levels of abuse by volume: https://twitter.com/jeremycorbyn/status/ : ramadan mubarak to all muslims in islington north, all across the uk and all over the world. this tweet received % abusive replies, % of abuse for the period -this was also st george's day, so perceived as evidence of anti-english sentiment, as in the following paraphrased replies for example: "@jeremycorbyn so nothing about st george's day then? ah, that's because we are english, the country you wanted to run but hate with a vengeance. and you wonder why you suffered such a huge defeat at the election" and "@jeremycorbyn so no mention of st. george's day then? you utter cretin." these attempts to reach voters and how they are perceived by those not within the same ideological framework will be discussed in section . as lockdown begins to be eased in may , we see a return to a high level of focus on boris johnson, with , replies compared to matt hancock's , as the next most replied-to mp. other senior conservatives are also prominent. high levels of abuse are received by ministers who defended dominic cummings' actions on twitter; matthew hancock ( . %), oliver dowden ( . %) and michael gove ( . %). boris johnson also receives more abuse than he did in the previous period ( . %). example tweets are given below of ministers defending mr cummings: https://twitter.com/matthancock/status/ ( % abuse, % of abuse for the period): i know how ill coronavirus makes you. it was entirely right for dom cummings to find childcare for his toddler, when both he and his wife were getting ill. https://twitter.com/matthancock/status/ ( % abuse, % of abuse for the period): dom cummings was right today to set out in full detail how he made his decisions in very difficult circumstances. now we must move on, fight this dreadful disease and get our country back on her feet hashtags show a high degree of negative attention focused on the partial treatment of dominic cummings, whilst continued attention on the economic plight of new starters is also in evidence. this signals the beginnings of contention that blossom in the periods after this study was concluded. the ideologies that influenced brexit, such as anti-prejudice and tolerance [ ] are still apparent in the context of covid- . "virtue signalling" is a common complaint attached to nearly every tweet that addresses issues of inequality and some that are trying to engage voters (see section ) . virtue signalling is defined as behaviour that indicates support for causes or sentiments that carry moral value, such as donating to charity [ ] , without much actual effort or care for the topic behind it. disagreement on what constitutes virtue signalling versus actually caring about a social issue creates a "liminal hotspot" where misunderstanding takes place [ ] . we go more deeply into the subject of virtue signalling in section . in many cases, identity and ideology are interrelated through experience. individuals from minority backgrounds speak about racism more often, also in context of covid- , potentially because they experience it. when mps from under-valued or under-represented minorities speak to their voters about racism, they are not only speaking to voters who need to understand experiences of racism, but also to voters who experience it directly. if the mp has a track record in working toward racial justice, their election is a signal that they should keep doing this work. of the tweets shared by women of colour in our qualitative sample ( tweets receiving a high percentage and number of abusive replies), more than % are about engaging voters and being proactive toward issues of inequality (we see this in more detail in . and in table ). another % are direct rebukes of authority from women in opposition parties. it seems that women of colour are disproportionately carrying the flame for the highly abused topic of inequality, as we see in fig . this may have partly to do with the party they belong to, but it appears to be also partly about the topics and expertise these women bring to the table. for example, bell ribeiro-addy was elected for the first time in december after a long career of addressing inequality in migration. she addresses sinophobia in communications about coronavirus and its origins in this tweet: https://twitter.com/bellribeiroaddy/status/ as senior conservatives publish sinophobic screeds in the rw press to distract from their own governments lethal complacency, its clear racism wont stop for #coronavirus. neither must we opposing it. join the fightback in an hour! http://ow.ly/ gbm qw jx this quote communicates proactiveness, for example, with the words "oppose" or "fightback". as such, this is the kind of statement that is not meant for those who don't accept racism as a fact of experience, or who do not think it is an important issue in the context of covid- . it is a call to action for those who do. this tweet, from mp rupa huq, addresses the lack of women ministers present at press briefings: https://twitter.com/rupahuq/status/ once again headed up by a man for the umpteenth time -who we only heard a week ago had tested #covid positive to boot. when will downing street allow a woman minister to front up one of these press shindigs? several responses to this tweet ask why the mp chooses to focus on gender, given that the roles most relevant during the covid- pandemic just happen to be held by men. or, they ask her to explain why she believes a woman could do a better job. some responses also refer to this as "virtue signalling", though the mp is a woman and her comment is about representation of women. in terms of the general elections in and , topics of concern were primarily around the economy, europe and the nhs. while europe and brexit are still present as important subjects, our topic analysis and the hashtags collected in each period show that the economy is now the greatest concern for users on twitter, along with various implications for public health, survival of businesses, unemployment, and social welfare. the significant financial support the government has provided during covid- has revived discussions about socialism and capitalism more generally as economic models: https://twitter.com/jeremycorbyn/status/ there's no statutory sick pay for part-time, low-paid or zero-hours contract workers. and the rate of sick pay isn't enough to live on. wrong at any time -but dangerous while people who might be ill are asked to stay home. the system is broken and now is the time to fix it. this quote from jeremy corbyn attracted supportive messages, including some that are abusive toward boris johnson or other members of the conservative party. however, there were also many criticisms of the tweet, which primarily express exasperation with mr. corbyn or chastise him for bringing this subject up in the middle of a crisis. if one agrees with mr. corbyn and feels that this crisis has only further exemplified the failings of the social welfare system in the united kingdom, his words will resonate. previous work has suggested that impassioned speech provides a measure of credibility on its own. mps have been asked to consider the tone of their messages to the public and to one another in parliament [ ] . however, in covid- , in which the country is embroiled in a large public health crisis, it is difficult to determine exactly how tone impacts the political discourse. when we were looking for escalations in our qualitative sample, we looked at critiques of mps' tweets to try and understand what someone might object to in a given statement. typically, hyperbolic language, sarcasm, insult, and making something personal are the escalations that are named. however, to avoid potentially classifying a tweet as escalating for using strong language around events that are urgent matters of public health, we required at least two measures of escalation to be present in order to classify a tweet as a potential escalation. when examining our sample, escalations were present in about a quarter of the tweets. as a percentage of replies, a notable tweet was the following: https://twitter.com/henrysmithuk/status/ ( % abuse, % of abuse for the period of may): not that i should be surprised by the lazy left but interesting how workshy socialist and nationalist mps tried to keep the remote parliament going beyond june. in the context of an increasingly uncertain economic situation for many individuals, respondents felt that mr. smith was accusing those who have been furloughed or who are shielding of avoiding work. several respondents also implied that work in communities to provide social support was not being valued. though mr. smith's comments (and his affect) may have been directed at his colleagues in parliament, he hit a mark with left-leaning members of the british public as well. many other tweets (n= ) are about rebuking authorities, which is sometimes done using strong language. we have tweets from david lammy in the sample, which is more than % of the sample. a portion of these tweets include some sort of escalation indicator, such as hyperbole, sarcasm, or personal insults that critics tend to pick up on in their replies (see example in table . once again, this definition of "escalation" was defined by those who criticised the tweets. in order to understand this further, we extended our analysis to look more deeply at the specific subject matter being discussed. the subjects mr. lammy is discussing are urgent and controversial in british political discourse, such as racism, brexit and more personally, the leadership of boris johnson. it is not clear whether or not he is targeted because of his communication style (as critics say), his party membership, or because of his ethnic background. affect and race are connected in how much anger and frustration those from a minority background are expected to express by a predominantly white society [ ] . the stereotypes of the "angry black woman" (or man) persist, in particular with connection to the topic of racism or inequality more generally. it has only been years this past june that black mps have been elected to parliament (https://labourlist.org/ / /watch-labour-celebrates- rdanniversary-of-first-black-mps-elected/). for this reason, it is worth considering how speech is understood through that lens. our third research question asked: which social media activities of uk mps during the covid- pandemic receive the most abusive replies? to answer this question, we applied our coding scheme described in . to a sample of tweets that received a substantial number of replies that contained abusive language (see section ). the purpose of this was to identify qualitative differences in authors, content, or delivery that may help explain the negative discourse related to their tweets and highlight any other social factors at play. in total, we identified tweets meeting these criteria. mps authored the tweets that received the highest number and percentages of abusive replies. mps are women, which is approximately % of the sample. women make up approximately % of uk parliament. however, of the women, % are women of colour (n= ), though women of colour only make up a small percentage of an already small percentage mps from a "minority" background. it is important to note that none of these women are members of the conservative party. in fact, labour politicians have authored of the tweets in this sample. conservatives authored , liberal democrats , the scottish national party and the democratic unionists . to break this down further, we had conservative mps who are female (all white), and male mps. for the labour party, that split is more even, with women and men. table gives corpus statistics in terms of tweets authored. "tweets" is number of tweets authored, "% of corp" is the percentage of the qualitative corpus that number constitutes, and "% repr." is the representation that demographic has among mps with twitter accounts for comparison. "# replies" is the number of replies tweets by that demographic in the qualitative corpus received, and "# abusive" is the number of those replies that were abusive (recall that the tweet is only included if it receives a high level of abuse). examining each tweet, we had categories of social media activities, plus one additional "unclear" media category (see table ). we added a modifier to the media activity of "escalation", if there were combinations of what we referred to as the five indicators of escalation: the presence of hyperbole (language that is perceived as having high valence), sarcasm or flippancy, insult and abusive language, making something personal (criticising the individual rather than their actions) or solidifying "us and them" narratives. these escalation indicators were derived from how critics and abusers of the tweets in our sample speak about escalating language and what they think is antagonistic. if a tweet only contained escalation indicators and no other content, it was categorised as "escalation" only. examples of each category are provided in table and the full qualitative sample and coding notes are provided at the url given in section . escalations were particularly subjective and therefore difficult to classify. however, only tweets contained some measure of escalation and were coded as "escalation" only. here is an example of a tweet from labour party member ian lavery that contains an escalation, in addition to its main media activity of rebuking authorities: https://twitter.com/ianlaverymp/status/ so does this herd immunity @borisjohnson strategy mean accepting the end of life for many elderly & vulnerable people but others should be fine ? just asking for the elderly lady across the street. this tweet was categorised as an escalation because it contains both sarcasm and hyperbole. some tweets received criticisms and abuse related to a particular event or pattern of behaviour. we have examples of this in our sample. examples of this include dominic cummings' behaviour during lockdown in may (see section . . ), the birth of boris johnson's child, or sammy wilson's previous voting record on the nhs (when combined with a tweet promoting clapping for the nhs). fig shows the media activity category and the number of abusive replies per category. while tweets with escalations may attract the higher percentages of abusive replies, the most common activities getting abusive replies are those ordinary to the job of an mp. the most common media activities receiving replies that contain abusive language are direct rebuke of authorities (n= ) and engaging voters (n= ). the following quote from lisa nandy (coded as a direct rebuke of authorities) received % replies including abusive language: https://twitter.com/lisanandy/status/ it is irresponsible and short-sighted from the government to rule out extending the post-brexit transition period. we should be taking action now to provide certainty for business in the face of this global economic challenge. this is a fairly standard argument from a member of the opposition party who was a "remain" voter, preferring a "soft brexit". likewise, the following tweet from conservative jack lopresti is an attempt to speak to a core group voters and champion their interests: https://twitter.com/jacklopresti/status/ if off-licences and takeawayys are open, churches should be, tory mp claims https://t.co/aa cy xru this tweet was sign-posting to an article in the telegraph in which mr. lopresti makes his views known. this tweet attracted nearly % abusive replies. perhaps unsurprisingly, all of the parties receive replies that contain abusive language when they do the parts of their job that aggravate the other parties. for example, the parties in opposition criticise the party in power, which will defend itself. all parties attempt to reach voters. however, the left and left-centrist parties tend to reach out to voters that conservatives do not, such as religious minorities, migrants and people of colour. the subject of "virtue signalling" arises in criticisms of this type of social media activity. this is evident in the large amount of abusive replies received by liberal democrats in response to their participation in ramadan (see section . ). as mentioned previously, virtue signalling is defined as communicating support for a specific issue with high moral value (such as fighting racism) without providing tangible support and effort [ ] . when the accusation of virtue signalling is levelled, the implicit assumption is that the gesture is empty or amounts to "moral grandstanding" [ ] without substance behind it. in this case, in the past years, evidence shows that the muslim community has shifted support from the labour party to the liberal democrats. the party has responded to this, attempting repeatedly to elect a muslim mp. the party has shown some attention to the social challenges of this group, suspending a candidate in for his comments online doubting the existence of islamaphobia. they have also had a few gaffes, showing a lack of awareness of the culture and have still not had a muslim mp, despite efforts. still, there is also evidence that the gesture of fasting and donating to charity as part of ramadan was perceived as showing solidarity with the muslim community during the covid- pandemic. virtue signalling needs to be considered within a framework of whose attention is being courted and whether or not that community views the attention as tokenistic or meaningful. our second round of coding dealt with the covid- subject referenced or alluded to in the tweet, which we determined inductively by going through the set of tweets using thematic analysis. we assigned tweets to one of categories, including one "non-covid" category (n = ), if the topic was not related to covid- . these non-covid topics include the floods that happened just prior to the pandemic, some more general thoughts on platform issues that are continuously relevant, such as budget and migration. many non-covid related tweets were posted before covid- had reached the uk to such a significant extent. topics include the cabinet reshuffle, class issues, brexit, and migration. after covid- began to take hold, those non-covid topics shift and are primarily related to specific issues or events, such as jeremy corbyn stepping down and keir starmer taking lead of the labour party, renewed references to a one uk policy/ one parliament, boris johnson and the birth of his child, and the liberal democrats celebration of ramadan. within the covid- topics, some more specific categories, such as "fatalities" or the nhs were https://www.aljazeera.com/focus/britishelection/ / / .html https://metro.co.uk/ / / /lib-dem-candidate-suspended-comments-muslims- / https://www.telegraph.co.uk/politics/ / / /lib-dem-councillor-apologisestweeting-photo-bacon-solidarity/ http://muslimnews.co.uk/newspaper/top-stories/record- -muslim-mps-electedmajority-women/ https://www.easterneye.biz/lib-dem-mps-to-fast-during-ramadan-to-show-unity-formuslim-community/ absorbed under the main category of health challenges and deaths to arrive at groupings that including roughly the same number of examples from our data sample. the full list of categories can be found at the url provided in section . in fig , however, we show which topics around covid- were associated with receiving more replies that are abusive in our qualitative sample. leadership and communication (n= ), along with lockdown and social distancing issues (n= ) were the covid- topics that had the best representation in the sample, with high number of replies that contain abusive language. these two categories include issues such as perceived government inaction and tone (leadership and communication), and guidance or impacts around lockdown or social distancing, such as wearing a face mask (lockdown and social distancing). the following tweet from labour mp yvette cooper attracted more than % abusive replies, addressing perceived confusion around guidance from the uk government: https://twitter.com/yvettecoopermp/status/ i watched the prime ministers press conference in despair. in a public health emergency communication and information saves lives. yet time & again the government keeps failing to push out a strong clear message to everyone. for all our sakes they urgently need to get a grip. the following tweet from jacob rees-mogg linked to an article about the queen isolating herself amidst covid- concerns. it attracted % abusive replies, mostly for invoking the name of the queen or comparing her experience to those of citizens who are struggling to meet their needs. https://twitter.com/jacob_rees_mogg/status/ as always an example to the nation: god save the queen https://t.co/n egxedzxd?amp= discussion of health challenges and deaths (n = ) received the greatest percentage of abusive replies. this includes conversations around uk fatalities in comparison to other nations, support for the nhs and issues with testing. this type of comparison, as mentioned previously (especially from someone in a left-orientated party), is generally received as negative, and even unpatriotic in some critiques. women of colour tended to discuss topics that address the needs of minorities and undervalued groups during covid- . white women have a profile more similar to men, in which questioning leadership and communication tended to be the covid- subject for which they received more replies containing abusive language. clearly, questioning the government in power may lead to criticism. our literature review indicated that people tend to want to trust authorities during a crisis. labour politicians' desire to keep the subject of racism and discrimination during covid- at the forefront attracts some abusive comments for what is called "playing politics", delivering "low blows" or "playing the race card" to the party in power. naturally, the opposition parties believe it is their job to rebuke authorities and suggest alternative policies. likewise, racism is not seen as a platform issue, but a social issue that is continuously relevant. in this sense, what is relevant to covid- and should be prioritised is being negotiated in some of this dialogue. when we looked deeper at the controversy that might be latent in the topics above, we identified more than distinct subjects from our open coding, and had one category of "unclear". some categories were related, such as islam and muslims, racism and immigration. we reduced the categories to predominant issues (see fig ) : home rule/nationalist perspectives, inequality and perceptions of inequality, brexit (a continued issue with new relevance), covid- response and impact, and lastly, people and communication (which includes subjects like personal folly, tone, etc.). proportions of each subject to appear in tweets from different demographic groups, as well as overall, are shown in fig . for example, the #stayalert slogan of the conservative government received considerable criticism for being confusing and potentially working against the goal of encouraging citizens to simply stay home. criticising the government's efforts, mp ian blackford tweeted: https://twitter.com/ianblackford_mp/status/ "#stayalert. what kind of buffoon thinks of this kind of nonsense. it is an invisible threat. staying alert is not the answer #stayhomesave-lives is." the highly abused tweets in the qualitative sample, split by topic and by demographic. "ww" means white women, "woc" means women of colour, and similarly for men. this tweet received both a number of abusive replies to mr. blackford for critising the government's attempts to resolve complex challenges during the pandemic, and a number of supportive replies (some of which are also abusive toward the government and certain ministers). this polarisation indicates a liminal hotspot around trusting and critiquing authorities in crisis. in terms of party insights, the liberal democrats received a considerable amount of attention for speaking about ramadan and the muslim community tweets that appear to address long-standing conflicts about the union and its governance, including pro-scotland and anti-snp sentiment. really the most politically frustrating aspect of this crisis is not having an independent scotland to do the emergency guaranteed income, to do the testing -to take a different path. we are instead shackled to inept economic extremists at westminster. -angus macneil grand total ( % of their sample). the snp, as one might expect, gathers some abusive replies when tweeting support for scottish independence or for promoting scottish excellence. hannah bardell received abusive and critical replies for posting the following tweet: https://twitter.com/hannahb livimp/status/ once again the pm following nicola sturgeons lead the tweet was accompanied by an article from the guardian about the prime minister's decision to ban mass gatherings. angus macneil was called "divisive" and "divisionist" for the following tweet in response to clapping for the nhs: it is "nhs scotland" https://twitter.com/guardiannews/status/ /photo/ likewise, making statements in favour of uniting the four uk countries, or making light of those countries' prerogative to handle the pandemic differently, attracts strong criticism and abuse. for example, stephan crabb received several abusive replies for the following tweet: quite the change in rhetoric from the days when welsh government were encouraging people to come to wales and drive their x 's right onto the beaches. another liminal hotspot is around who is expected to speak about social issues of injustice, which commonly attract abusive responses. women of colour were speaking mostly about inequality when they received replies containing abuse ( tweets, which is % of our sample of women of colour, and % of our total sample). men of colour more closely resembled the subject attention of both white women and white men on the more general discussion of covid- response and leadership through the crisis (for all men, issues with leadership, or covid- response made up % of all topics; for white women, more than %. % of the topics discussed by men in their sample are about controversial people specifically, such as boris johnson, jeremy corbyn and dominic cummings). all activity online can be viewed as communication and persuasion -there are people on different sides of different issues, vying for the public attention. this can attract positive and negative responses. however, overall, what our investigation has shown us is that dimensions of political discourse are mediated by perceptions of power, potentially due to the uncertain situation created by covid- . in the sections below, we summarise our findings about the influence of ideology, political authority and affect on how the words of mps are communicated and interpreted by the public during the covid- pandemic. when parties on the left speak directly with and from undervalued communities or voters, this may be perceived as virtue signalling by the right. when a party is not in power, this is even more difficult to communicate. however, when the party in power shows a lack of tact toward excluded communities, this is also judged more harshly. when the party in power communicates policy about controversial issues in the name of the people, it will get push-back from people who did not vote for that party. when the left attempts to meet voters by discussing issues (racism, migration) that a portion of the public may feel are external to relevant british politics right now (covid- and brexit), they will get abuse. however, for the left, those are topics that exist everywhere all the time and are systemic. looking at successful interventions by opposition parties in the government's activities may constitute an interesting area for future research. more specifically, this research could help to answer questions about the origins of priorities and disagreements. what the answers to our three research questions indicate is that it matters who is "in charge", when looking at how the public and other ministers respond to the social media activities of uk mps during covid- . the party in power (along with its members) will have more responsibility to the public for mastering tone and explaining their actions. opposition parties will have more difficulty in a health crisis to not be perceived as unnecessarily antagonistic. in addition, we found that it matters who a person is and what they represent, whether or not an individual will be perceived as a trustworthy authority. many tweets in our sample appeared to be speaking to core groups of voters and other parliamentarians. they are not necessarily an invitation for debate. in terms of affect, our data shows that it matters what you say and how you say it, particularly in connection with priors. if jeremy corbyn posts about racism, and has been continuously in the news for not handling antisemitism in his party, he will get some angry replies. if sammy wilson voted against a pay-rise for nurses in , and then posts a "clap for carers" post on twitter, those who remember his prior voting record will be angry. the more affected a tweet is, the more this appears to aggravate. vitriol as a result of one's previous political statements or actions is one side of the story. hate is another. the issue of what is abusive versus what is hate speech needs to be disentangled from both abuse and from racism. racism does not only involve hate speech. it also involves a) expecting people of colour to champion racial equality, as the breakdown of topics indicates and b) framing racism as a fringe issue. this is evidenced in our dataset. abuse, though uncomfortable and uncivil, is a different type of speech whose study may be useful for any number of discussions, potentially on the subject of agonism or counter-speech. agonism argues that the contestations of the time can be used to renew democracy and strengthen public discourse [ ] . promising work on recognising an highlighting counter-speech in online communication is already on the horizon [ , ] . in this paper, we explored how uk mps contribute to the information and communication environment during covid- , and the abusive replies that they receive. contextualising these activities in terms of what the public expect during a health crisis, how ministers typically use social media to communicate in crisis, and which mitigating features of either the person or context interfere in those activities, we were able to advance the conversation about online abuse toward some new directions, such as how to understand virtue signalling or what it means to play party politics. building on previous studies of abusive language toward british mps, we offered a large-scale, mixed-methods study of abusive and antagonistic responses to uk politicians during the covid- pandemic from early february to late may . we found that -similarly to other key moments in british contemporary politics -political ideology, authority and affect have played a role in how mps social media posts were received by the public. in the context of covid- , we found that pressing subjects, like financial support or unemployment, may attract high levels of engagement, but do not necessarily lead to abusive dialogue. as with earlier findings, prominence and event surges impact the amount of abusive replies mps received. in addition, the topic of the tweet (in particular if it is divisive) and the individual bringing that topic into discussion (their gender, ethnicity or party, for example) impacted levels of abuse. women of colour appear to bring the topic of inequality to the table and this attracts a variety of abuse. other mps may be discussing inequality and not receiving abuse (which this work did not cover). in conclusion, this work contributes to the wider understanding of abusive language online, in particular that which is directed at public officials. issues of power, which are crystallised in terms of political power or social power impact communication at all stages. this work was supported by the esrc under grant number es/t / , "responsible ai for inclusive, democratic societies: a cross-disciplinary approach to detecting and countering abusive language online." no conflicts of interest or competing interests. data is available here: https://gate-socmedia.group.shef.ac.uk/abuse-mps-covid/ brexit and emergent politics: in search of a social psychology the angry black woman: the impact of pejorative stereotypes on psychotherapy with black women presumed asymptomatic carrier transmission of covid- checklists for improving rigour in qualitative research: a case of the tail wagging the dog? and they thought papers were rude the politics of crisis management: public leadership under pressure thematic analysis belief in science influences physical distancing in response to covid- lockdown policies characteristics of health care personnel with covid- united states hate unleashed: los angeles in the aftermath of proposition conan-counter narratives through nichesourcing: a multilingual dataset of responses to fight online hate speech the covid- social media infodemic protecting organization reputations during a crisis: the development and application of situational crisis communication theory covid- : identifying and isolating asymptomatic people helped eliminate virus in italian village a large-scale crowdsourced analysis of abuse against women journalists and politicians on twitter managing a health crisis on facebook: how the response strategies of apology, sympathy, and information influence public relations twitter as a public relations tool exploring misogyny across the manosphere in reddit covid- navigating the uncharted preventing genocide by fighting against hate speech coronavirus conspiracy beliefs, mistrust, and compliance with government guidelines in england twitter use by the us congress race and religion in online abuse towards uk politicians online abuse toward candidates during the uk general election which politicians receive abuse? four factors illuminated in the uk general election partisanship, propaganda and post-truth politics: quantifying impact in online debate local media and geo-situated responses to brexit: a quantitative analysis of twitter, news and survey data mp twitter abuse in the age of covid- : white paper twits, twats and twaddle: trends in online abuse towards uk politicians managing uncertainty: using social media for risk assessment during a public health crisis transmission of covid- to health care personnel during exposures to a hospitalized patientsolano county, california covid- : deprived areas have the highest death rates in england and wales the social media logic of political interaction: exploring citizens and politicians relationship on facebook and twitter is ethnicity linked to incidence or outcomes of covid- ? the dark power of words: stratagems of hate in the presidential campaign who tweets? tracking microblogging use in the swedish election campaign early transmission dynamics in wuhan, china, of novel coronavirusinfected pneumonia social media in the uk election campaigns - : experimentation covid- : how to be careful with trust and expertise on social media democracy and agonism in the anthropocene: the challenges of knowledge, time and boundary covid- vs social isolation: the impact technology can have on communities, social connections and citizens thou shalt not hate: countering online hate speech turds, traitors and tossers: the abuse of uk mps via twitter estimation of the asymptomatic ratio of novel coronavirus infections (covid- ) ethnicity and covid- : an urgent public health research priority in the name of hate: understanding hate crimes public expectations of social media use by critical infrastructure operators in crisis communication freedom of speech requires actions: exploring the discourse of politicians convicted of hate-speech against muslims toxic for whom? examining the targets of uncivil and intolerant discourse in online political talk anti-romani speech in europes public space-the mechanism of hate speech from incivility to outrage: political discourse in blogs, talk radio, and cable news violence, hate speech and inflammatory broadcasting in kenya: the problems of definition and identification twitter, incivility and everyday gendered othering: an analysis of tweets sent to uk members of parliament social media and political communication: a social media analytics framework. social network analysis and mining moral grandstanding hate speech in spain against aquarius refugees in twitter a work-in-process literature review: incorporating social media in risk and crisis communication hate multiverse spreads malicious covid- content online beyond individual platform control blind trust: large groups and their leaders in times of crisis and terror social media messages in an emerging health crisis: tweeting bird flu the harm in hate speech the global impact of covid- and strategies for mitigation and suppression consuming goodon social media: what can conspicuous virtue signalling on facebook tell us about prosocial and unethical intentions turds, traitors and tossers: the abuse of uk mps via twitter turds, traitors and tossers: the abuse of uk mps via twitter. ecpr joint sessions clinical course and outcomes of critically ill patients with sars-cov- pneumonia in wuhan, china: a single-centered, retrospective, observational study not applicable.acknowledgements thanks to mehmet e. bakir for comments and suggestions. key: cord- -nvi h t authors: dinh, ly; parulian, nikolaus title: covid‐ pandemic and information diffusion analysis on twitter date: - - journal: proc assoc inf sci technol doi: . /pra . sha: doc_id: cord_uid: nvi h t the covid‐ pandemic has impacted all aspects of our lives, including the information spread on social media. prior literature has found that information diffusion dynamics on social networks mirror that of a virus, but applying the epidemic susceptible‐infected‐removed model (sir) model to examine how information spread is not sufficient to claim that information spreads like a virus. in this study, we explore whether there are similarities in the simulated sir model (sirsim), observed sir model based on actual covid‐ cases (siremp), and observed information cascades on twitter about the virus (infocas) by using network analysis and diffusion modeling. we propose three primary research questions: (a) what are the diffusion patterns of covid‐ virus spread, based on sirsim and siremp? (b) what are the diffusion patterns of information cascades on twitter (infocas), with respect to retweets, quote tweets, and replies? and (c) what are the major differences in diffusion patterns between sirsim, siremp, and infocas? our study makes a contribution to the information sciences community by showing how epidemic modeling of virus and information diffusion analysis of online social media are distinct but interrelated concepts. assert that the sir model can be applied to examine the decline of diffusion activities on friendster online social network, and found that the decline started when popular users left friendster (labeled as r in sir model). however, applying the epidemic sir model to examine how information spread is not sufficient to claim that information spreads like a virus (lerman, ; wu, huberman, adamic, & tyler, ) . there are different mechanisms that influences how information spread from one user to another, but does not influence how a virus spread from one person to another, and vice versa (lerman & ghosh, ; mønsted, sapieży nski, ferrara, & lehmann, ) . in this study, we examine in parallel the epidemic and information diffusion, and the mechanisms by which both diffusion processes contribute to covid- 's spread. specifically, we compare covid- virus's (a) sir -modeled and (b) empirically observed diffusion patterns with (c) information cascades of retweeting, quote tweeting, and replying behaviors on twitter social network to understand the relationships between information and virus diffusion. to do this, first, we create an sir simulation (we call this sirsim) of covid- 's diffusion with respect to empirically validated parameters such as reproductive rate (r ), incubation period, and symptom length range. secondly, we create an sir model from actual confirmed cases with data gathered from johns hopkins university (jhu-csse, ) (we call this siremp). thirdly, we construct information cascades from on our collected twitter data (we call this infocas) based on three dimensions: retweets, quote tweets, same as retweets but with comment included), and replies to tweets. for the information cascades, we also categorize each piece of information to either susceptible (new tweet about the virus), infected (retweets, quoting of retweets, or replying to tweets), or removed (tweet not shared by others after a period of time). consistent with the aspects of the study, we propose three primary research questions: rq : what are the diffusion patterns of covid- virus spread, based on sirsim and siremp? rq : what are the diffusion patterns of information cascades on twitter (infocas), with respect to retweets, quote tweets, and replies? rq : what are the major differences in diffusion patterns between sirsim, siremp, and infocas? our study makes a contribution to the information sciences community by showing how epidemic modeling of virus and information diffusion analysis of online social media are distinct, but interrelated concepts. with the advent of social networking sites and online microblogs such as twitter, individuals can create and exchange information with larger amounts of people in lesser amounts of time. these online social networks are thus instrumental for researchers to examine what types of information diffuses between individuals and what underlying mechanisms facilitate the diffusion. in the context of social networks, information diffusion is formally defined as a process by which a piece of information is passed down from one node to another node through an edge (gruhl, guha, liben-nowell, & tomkins, ; guille, hacid, favre, & zighed, ) . two seminal models have been widely adopted to examine diffusion dynamics with network structure considered, namely independent cascade models (goldenberg, libai, & muller, ) and linear threshold models (granovetter, ) . independent cascade models assume that each node has a certain fixed probability to spread, or "infect" a piece of information to a neighboring node. on the other hand, linear threshold models posit that a node would be "infected" by a piece of information if a certain threshold of neighboring nodes have also been infected by that information. both models have been widely used to detect influential topics (gruhl et al., ) and influential users (yang & leskovec, ) in online social networks and the impacts they have on diffusion rate. (gruhl et al., ) focus on the spread of topics on blogs based on rss (rich site summary) feeds and found that topics were either consistently popular (called "chatter") or only popular for a short time (called "spikes"). the authors also observed that topics with high chatter also contained larger and more frequent spikes. (yang & leskovec, ) demonstrate that an influential node can be detected with respect to how many nodes have been influenced by that particular node before. in addition to the independent cascade model and linear threshold model, scholars studying information diffusion from a wide range of disciplines have also found the utility of modeling diffusion as an epidemic process. in particular, the sir model has been frequently used to explain how information in an online social network becomes "infectious" and passes from one node to another. sir is known as a compartmental model. because it categorizes an individual to be in one of three states at a certain point in time, susceptible (s), infected (i), or removed (r) (kermack & mckendrick, ). an individual may transition their state due to influence from another individual in the same network, in which the transition is linear (s!i, i!r). at the first transition point, s!i occurs because a susceptible individual was in contact with an infected individual and therefore got the virus. the infection assumed at this transition point is at a constant rate of β per time unit. at second transition point, i ! r transition occurs when an infected individual either recovered from the virus and got immunity from it, or has been removed (i.e., has died). at this transition, the model assumes that recovery rate is fixed at γ per time unit. these assumptions are stated in the following set of equations of (s), (i), (r) at time (t): abdullah & wu, ) examine how trending news spread on twitter by sorting users into three compartments, s for users who saw tweets from an infected user, i for users who tweet about a news topic, and r for users who no longer tweet about a topic after a predefined timeframe of h. the authors also assume fixed infection rate β and recovery rate γ in their epidemic simulation and observed model with twitter data, and found a strong fit between the models. in addition to news, scholars have also examined whether false rumors and disinformation diffuse on social networks in a manner similar to how an infectious disease spread (jin, dougherty, saraf, cao, & ramakrishnan, ; nekovee, moreno, bianconi, & marsili, ) . research by (nekovee et al., ) conceptualizes rumor spreading as a epidemic transition process between ignorants, spreaders, and stiflers. they found that rumor spread rate is higher in scale-free networks than in random graphs. their finding is consistent with (lerman & ghosh, )'s observation that information cascades on twitter follow a power-law distribution. (jin et al., ) also refine the sir model to examine rumor diffusion by adding exposed e and skeptical z individuals, and found that the rate of rumor infection (i) increases as the rate of e decreases, and the susceptible (s) rate decreases as z increases. other works have also found sir models to be useful in explaining diffusion of content on other social networking platforms such as flickr (cha, mislove, adams, & gummadi, ) and digg (ver steeg et al., ). on the other hand, several studies observe that there are clear differences in sir epidemic model and information diffusion process. (goel, munagala, sharma, & zhang, ) do not find strong correlation between the sir model and observed retweet cascades as the epidemic model do not take into account users' characteristics. similarly, (liu & zhang, ) point out that information diffusion process includes variables not in sir model such as content of the information, strength of ties among individuals, and other social factors. in light of diverse findings on the extent to which sir models can explain information diffusion on social networks, we examine whether there are similarities in our simulated sir model (sirsim), observed sir model based on actual covid- cases (siremp), and observed information cascades on twitter about the virus (infocas). we empirically test whether there are similarities between the information diffusion process on twitter about covid- topics and the diffusion of the virus itself between individuals. to do this, we develop three different networks. the first two networks are created to capture the diffusion of the covid- virus in the entire population, via an sir simulated model (sirsim) and an observed model based on reported data about infected (i), and removed (r) cases (siremp). the third network is constructed from information cascades on twitter (we call this infocas), where infected (i) are tweets that interacted with the original tweets about covid- by either retweeting, quoting, or replying, and removed (r) include tweets that are no longer interacted with for a defined period. we describe the datasets used and the process of constructing each network in the following sections. all data collected and code used in this work are available on figshare (dinh & parulian, ) . we implement a sir simulation model of covid- on netlogo, an open-source environment for agent-based modeling. we extended an existing model on virus spread on netlogo, and refined model parameters based on official sources' information about covid- spread and shown in table . we keep the parameters constant throughout the simulation, and set the duration of the simulation to days. we choose the duration of days to reflect the timeframe between december , to march , . we choose december , as opposed to december , as the first date of covid- to take into account the days (see table for virus symptom length) of symptoms leading up to the confirmation of the infected case. the initial population for our model includes the entire world population, at . billion people. figure shows the netlogo interface of our sirsim model, with additional parameters included to simulate the transitions of agents from (s!i), and i!r). adhering to the sir model, s agents represent the carriers of the virus, i agents are those infected by the carriers, and r are agents who are removed due to death. due to computational limitations that poses difficulty to represent each individual as an agent, we group million people in each agent (#-people-per-agent setting). thus, our model contains , agents interacting with one another. the first agent represents patient zero, and is originated the city of wuhan in our world map (x-axis: , y-axis: - ). we assign agents to move around major cities across the world (e.g., new york city, paris, tokyo, moscow) (see table a in appendix). all agents initially started in s state, except for patient zero, who then spreads the disease by contacting with agents from other cities through two modes of traveling: driving (parameter mode = "human") or flying (parameter mode = "plane"). we set these parameters through the use of patches (pixel) feature, enabling each agent to move certain distances depending on the patch size. the circumference of our simulated "world" is pixels, and with the given circumference of , miles, each patch covers about miles in our model. to simulate driving, we calculate the average mileage driven per day ( . miles), and then derive a movement of . patches per day for each agent. to simulate flying, each agent has a random chance to create an airplane and fly to any other major cities. while our model accounts for many parameters that are reflective of actual virus spread dynamics, we do not take into account any virus control strategies such as quarantine or social distancing. we repeat the simulation over iterations to ensure reliability of experimental results. each iteration result is presented as a network that contains multiple types of nodes, susceptible, infected, and removed. an edge can form between any two node types, and node type can change over time (e.g., from susceptible to infected if there is an edge between the two nodes), except for when a node has been labeled as removed. we gather actual cumulative cases of covid- from johns hopkins center for systems science and engineering (jhu csse)'s data repository. this repository contains global confirmed cases, death cases, and recovered cases from january to march , , for over countries (jhu-csse,- ). to our knowledge, this data repository is the most comprehensive so far, with triangulation of cases counts from sources (e.g., who, china cdc, italy ministry of health, worldometers). we analyze this dataset within the assumptions of sir model, where s are individuals in the population that are not yet infected nor immune to the virus, i is equivalent to "confirmed cases" in the dataset, and r is equivalent to "deaths cases". we do not include the "recovered" cases in our model as the data does provide whether these cases are re-entered into the "confirmed cases" in latter time-frames. in the original dataset, there is no inclusion of s, given that susceptible nodes include all members of the world population. the third dataset we use for this research is twitter data that contains information about covid- . we collect tweets during the period of december , to march , with a maximum of , samples (limit set by firehose) for each day from crimson hexagon firehose. we collect , tweets that include either or all of the hashtags #coronavirus, #covid , #ncov. we construct information cascades based on three primary behaviors that occurs between tweets in our dataset: ( ) retweet, ( ) quote tweet, and ( ) reply. we exclude all tweets content originated from european countries, in recognition of general data protection regulation (gdpr). based on the sir model, we define the conditions for infected nodes, and removed nodes below. our approach does not consider susceptible nodes because in this context, susceptible tweets are all tweets that exist on twitter. an original tweet that has yet to be retweeted or interacted with is counted in this category. there are , tweets in this category. if a tweet interacts with an s tweet through either retweeting, quoting, or replying, the tweet is counted in this category. (a) retweet is an action of reposting an original tweet, and without changing the original tweet content. if an original tweet has not been retweeted, quoted, or replied to by other tweets in a defined period. we used the average delta time between each activity on the original tweet as our incubation period. therefore, if there is no user interaction with the tweet between the average time frame from the latest spread, we consider the tweet is removed. average delta time statistics for each type of cascade (retweet, quote tweet, and reply tweet) can be seen in table . there are , tweets in total that are in this category. an information cascade is determined by the period other tweets (i) interact with an original tweet (s) on this dataset. given an original tweet (t ) on time t the cascade c on time t (c t ) is equal to: for each type of information cascade, we analyze the cascade growth by aggregating the s, i, and r tweet for each day. our first research question asks about the diffusion patterns of covid- based on both a simulated sir model (sirsim) and actual number of cases from empiricallyvalidated sources (siremp). for sirsim, across iterations of our simulation, we find the average counts of susceptible agents to be , . million, average counts of infected to be . million, and average counts of removed to be . million. thus, the proportion of healthy, but susceptible agents is . % (s) in our model. there are only % (i) of agents that are infected by the virus, and only . % (r) are removed due to death. as shown in figure , the distribution of infected (blue line, left) and removed (red line, left) agents per day, noncumulatively, and find an increasing pattern for both trendlines. the proportions of removed cases is much lower than infected cases, and this is shown in the network visualization in figure . we then compare these results to siremp, which finds that as of march , , there were , infected cases, and , removed cases (deaths only). by proportion with the world population, therefore, infected cases is . %, and removed cases is a minimal percent. by comparison, the empirically-validated results show substantially lower proportions of infected and removed agents, and in turn, higher proportion of susceptible agents. we also analyze the distribution of infected (blue line, right) and removed cases (red line, right) for siremp, and finds multiple spikes in the blue line, but flat distribution for the red line. the spikes in infected counts are due to inclusion of cases from countries such as the u.s, south korea, italy. in comparison to the distributions from sirsim, the distribution of removed cases in siremp is relatively static throughout. table (network statistics) shows the sizes of the three network cascades within infocas, retweet, quote tweet, and reply tweets. we find that retweet cascade is times larger in size than the quote tweet cascades, and times larger than the reply cascades. this finding is consistent with the notable differences in the number of cascades f i g u r e distribution of infected and removed agents for sirsim (left) and siremp (right) models f i g u r e sirsim network. blue nodes = infected cases, red nodes = removed (death) cases present in each network, in which retweet network has times more cascades than quote tweet network, and times more cascades than reply tweet network. figure presents the rapid growth in tweet activities, with stark increase in retweets, quote tweets, and reply tweets during mid-january. we find that the growth distributions for all three tweet types follow a logarithmic curve. in addition, the number of infected users, equivalent to individuals spreading the information, is much higher compared to the new information consistent on the three observations. we also observe that the cascade growth for retweets is substantially higher than growth for quote tweets and reply tweets. table shows the coefficients and parameters for each linear fit of the number of tweets to the day-period. as we can see from the table, the slope of a retweet is the highest, followed by the quote tweet and reply tweet. the slope for removed information is the lowest compared to the infected and new information and consistent for all cascade types. this indicates that as the number of new information is introduced each day, some portion of the information stops spreading. we aggregate the data from sir-simulation over iterations (sirsim) and csse's real-infection data (siremp) and analyze correlation with twitter's information growth (infocas) for the same time period. table shows the correlational values in terms of pearson's correlation, for each sir state. for the cascades of infected nodes, we find the highest correlation between sirsim and infocas -retweets (r = . ). the second-highest correlation is between retweets and quote tweets (r = . ). another notable correlation is between sirsim and quote tweets (r = . ). siremp has low correlations with all other types of cascades, with correlations ranging from . to . . f i g u r e retweet, quote tweet, and reply tweet growth for each day during the covid- outbreak period. x-axes represent the day, y-axes represent the number of tweets. new information represents the original source of information, infected represents an interaction with another user, and removed represents the end of the information spread after a defined period intercept (β ) in terms of the removed nodes cascades, there is also high correlation observed between infocas -retweets and quote tweets (r = . ). retweets also have high correlation with reply tweets (r = . ). these two correlations show that retweet cascades are most correlated to quote tweets and reply tweets with respect to tweets that are no longer interacted with, and thus can no longer spread that particular tweet content in the network. correlation between infocas and sirsim is relatively lower (r = . - . ), showing that there is a weaker relationship between the simulated and observed twitter's removed cascades. similarly, there is a weak relationship between siremp and all infocas cascades, especially with reply tweets (r = . ). our study focuses on the diffusion patterns of covid- virus itself and the information shared online about the virus. to capture the diffusion patterns of the virus, we create an sir model (sirsim) based on empirically-validated transmission dynamics of covid- (e.g., reproductive ratio, incubation period), and then compare with actual confirmed cases of covid- from january to march , (siremp). to examine diffusion patterns of information discussed online about covid- , we construct three cascades (infocas) based on retweets, quote tweets, and reply tweets on twitter that mentioned covid- from the period of december st to march , . our first research question asks about the diffusion patterns of covid- virus, based on epidemiological assumptions of sir. from our sirsim model, we find the proportions of infected cases to be only % of the entire world population, and the proportions of removed (dead) cases is only . % of the population. our model accounts for days since the first case of the virus, and the upward trajectory beyond linear growth suggests to us that rate of infection and deaths may increase logarithmically. this is consistent to current findings on covid- that finds the distributions of infected cases follow a logarithmic distribution (cao et al., ; maier & brockmann, ) . (cao et al., ) finds the logarithmic growth rate is suitable considering that covid- is relatively in the early stage, and thus growth is slowly increasing. we also find notable differences in the simulated model and the actual confirmed cases of covid- (from siremp). in fact, the distribution of removed cases in siremp is flat, as opposed to the increasing distribution observed in sirsim. there are two reasons for the mismatch in simulated and actual distributions of sir cases. the first is that our model does not take into account preventive measures such as social distancing, self-quarantine, and shelter-in-place which are found to be effective in "flattening the curve" (lewnard & lo, ; parmet & sinha, ) . the second reason may be that the quantification of infection and death rates need further modifications, specifically because there is still limited testing (ioannidis, ) , and reporting delays (gardner, zlojutro, & rey, ) . the second research question asks about the diffusion patterns of information cascades on twitter about covid- . we construct retweet cascade, quote tweet cascade, and reply cascade (we call these infocas) to fully capture the different types of interactions between users on twitter. all three cascades show strong fit with linear-log distribution, suggesting a power-law decay in the diffusion of new information about covid- over time. with this finding along with the cascade length of each tweet type, we expect that retweet cascade decays at the fastest rate, given that its cascade length is only approximately hours. on the other hand, we find quote tweets' average cascade length to be about days, which means that each original tweet that has been interacted with via quotes has longer duration in terms of activity. this is also observed for reply tweets, where the average cascade length is about days. the third research question focuses on the correlation in diffusion patterns between sirsim, siremp, and infocas to address the connection between epidemic and information diffusion dynamics. based on the examination of infected cascades, we find the stronger positive correlation between sirsim and infocas -retweets (r = . ), and quote tweets (r = . ). on the other hand, we observe low correlations between siremp and all three infocas types (r = . - . ). this shows that the distribution of infected agents are more correlated between infocas and sirsim, and not so much with siremp. with the rapid spread dynamics seen in sirsim, this correlation shows that tweets about covid- gets retweeted most quickly, then followed by quote tweets, and then reply tweets. the correlation between sirsim and siremp is relatively low (r = . ), which may indicate that either the simulated model potentially overestimates the infection rate, or that the actual reported cases may underestimate the infection rate. for the removed cascades, we find strongest correlations between infocas cascades, specifically between retweets and quote tweets (r = . ), retweets and reply tweets (r = . ), and quote tweets and reply tweets (r = . ). we find weaker correlations between infocas and sirsim (r = . - . ), and weakest correlations between infocas and siremp (r = . - . ). this result is consistent with our observation that the removed distribution on siremp is more uniform and flat compared to other distributions. it is also expected that the removed distribution for infocas would be different from sirsim, given that the likelihood of tweets to transition from infected to removed is notably higher. overall, we find complex relationships between diffusion dynamics about covid- from the simulated virus spread model, the actual reported cases of the virus spread, and the information shared and discussed online. our study demonstrates how epidemic modeling, in combination with examining information cascades about the virus can help capture the many activities surrounding the covid- pandemic. in future work, we hope to expand our data collection to more recent dates, given the constantly-changing nature of the pandemic. additionally, we aim to improve our simulated epidemic model (sirsim) to include additional control variables that reflects prevention strategies, namely social distancing, self-quarantine, and shelter-in-place. an epidemic model for news spreading on twitter estimating the effective reproduction number of the -ncov in china. medrxiv coronavirus disease characterizing social cascades in flickr covid datasets to examine diffusion patterns modeling the spread-ing risk of -ncov who director-general's opening remarks at the media briefing on covid- a note on modeling retweet cascades on twitter talk of the network: a complex systems look at the underlying process of word-ofmouth threshold models of collective behavior information diffusion through blogspace information diffusion in online social networks: a survey a fiasco in the making? as the coronavirus pandemic takes hold, we are making decisions without reliable data coronavirus -ncov global cases by johns hopkins csse epidemiological modeling of news and rumors on twitter a contribution to the mathematical theory of epidemics the incubation period of coronavirus disease (covid- ) from publicly reported confirmed cases: estimation and application information is not a virus, and other consequences of human cognitive limits information contagion: an empirical study of the spread of news on digg and twitter social networks scientific and ethical basis for social-distancing interventions against covid- information spreading on dynamic social networks effective containment explains sub-exponential growth in confirmed cases of recent covid- outbreak in mainland china evidence of complex contagion of information in social media: an experiment using twitter bots theory of rumour spreading in complex social networks covid- -the law and limits of quarantine the collapse of the friendster network started from the center of the core what stops social epidemics information ow in social groups characteristics of and important lessons from the coronavirus disease (covid- ) outbreak in china: summary of a report of cases from the chinese center for disease control and prevention global health crises are also information crises: a call to action modeling information diffusion in implicit networks key: cord- - ju uu authors: nikolovska, manja; johnson, shane d.; ekblom, paul title: “show this thread”: policing, disruption and mobilisation through twitter. an analysis of uk law enforcement tweeting practices during the covid- pandemic date: - - journal: crime sci doi: . /s - - - sha: doc_id: cord_uid: ju uu crisis and disruption are often unpredictable and can create opportunities for crime. during such times, policing may also need to meet additional challenges to handle the disruption. the use of social media by officials can be essential for crisis mitigation and crime reduction. in this paper, we study the use of twitter for crime mitigation and reduction by uk police (and associated) agencies in the early stages of the covid- pandemic. our findings suggest that whilst most of the tweets from our sample concerned issues that were not specifically about crime, especially during the first stages of the pandemic, there was a significant increase in tweets about fraud, cybercrime and domestic abuse. there was also an increase in retweeting activity as opposed to the creation of original messages. moreover, in terms of the impact of tweets, as measured by the rate at which they are retweeted, followers were more likely to ‘spread the word’ when the tweet was content-rich (discussed a crime specific matter and contained media), and account holders were themselves more active on twitter. considering the changing world we live in, criminal opportunity is likely to evolve. to help mitigate this, policy makers and researchers should consider more systematic approaches to developing social media communication strategies for the purpose of crime mitigation and reduction during disruption and change more generally. we suggest a framework for so doing. the covid- pandemic has had a profound effect on society worldwide, influencing how we work, interact with others, and travel. unsurprisingly, it has also had an impact on crime, with studies suggesting that lockdown restrictions have been associated with reductions in crimes reported to the police for offences including burglary (e.g. ashby ; halford et al. ; felson et al. ) , shoplifting (e.g. halford et al. ) , and assault (e.g. halford et al. ) . studies concerned with domestic abuse (usher et al. , piquero et. al campbell ; chandanet et al. ; boserup et al , pfitzner et al. ) have produced mixed results, with initial spikes being followed by reductions in calls for police service. with such studies it is unclear whether the reductions observed represent reductions in offending or the rate at which offences are reported to the police. regardless, the patterns observed suggest an impact of the lockdown on these types of crime. while increases in crime have also been reported for cybercrime (buil-gil ; hakak et al. ) , including online fraud (e.g. naidoo ; cimpanu ), malware (brumfield ) , hacking and phishing (muncaster ; kumaran open access crime science *correspondence: m.nikolovska@ucl.ac.uk dawes centre for future crime at ucl, university college london, london, uk and lugani ), hawdon, parti and dearden ( ) report that cybercrime remained unchanged despite the swift change in routine activities. however, data on such crimes is more elusive and analyses-at least in the academic and open source literature-less complete than for more traditional crimes such as those discussed above. interestingly, previous research on the impact on crime of previous epidemics/pandemics is limited. research (fong and chang ) conducted during the sars epidemic examined community collective efficacy in taiwan in communities that experienced sars and those that did not. however, the authors did not directly examine the effect of sars on crime. for this and other reasons, understanding the extent to which covid- has impacted crime is important and will doubtless feature strongly in academic research in future. to provide a complete picture of what has and will happen, will require access to police recorded crime data, but also that reported to, or collected by other organisations. this is because not all crimes (e.g. domestic abuse) are reported to the police and because patterns of reporting may have changed during the lockdown. additional insight may also be gained about patterns of offending, and concerns about this, from analysis of data posted to social media platforms, such as twitter. in this paper, we analyse data from uk government and law enforcement twitter accounts with a view to understanding how law enforcement used this platform to inform the public about crime risk and what to do about it during the early stages of the pandemic. while our focus here is on the covid- pandemic, we consider this to be just one example of disruptions to society with the potential to impact on crime opportunity and motivation, and security. as such, we view the research that follows as having implications for other future large-scale disruptions, and national and global emergencies, and how society prepares for them, including anticipating their consequences for crime and security. the use of social media to communicate in times of crisis or disaster has become essential for the mitigation, coordination and recovery of societies hit by disruptions (houston et al. ) . for example, to deal with security and public safety during the pandemic, law enforcement and government agencies cannot act alone. a web of influence has to spread out from these (and other) stakeholders to (for example) other agencies, private businesses and householders to enable them to play their part in mitigating both the pandemic and its knock-on effects on other aspects of life, such as crime and security. characterising that web and how it works is vital to targeting, assessing and improving the influence process. in the uk, 'communication policing ' , or open source police communications that use the internet and social media, has been characterised as a new form of community policing by the. open source communications analytics research [oscar] development centre. their research on the use of open source communications by police suggests that social media communications should be routinely incorporated into police investigations, intelligence gathering and community engagement. in the present study, we focus on how uk law enforcement institutions have sought to communicate with and influence others to undertake, or desist from, a range of actions as required. in setting up the paper, we first discuss existing crime science approaches for describing and assessing how 'professional security influence' is spread. there will be many useful parallels in other policy areas, such as medicine (e.g. see michie et al. ) and generic influence processes such as the 'nudge' approach (halpern ) , but our focus here is more limited. in studying the dissemination of influences on people's behaviour, and that of organisations, it is helpful to think about roles to be played, and associated with these, the accompanying responsibilities. opportunity theories of crime (e.g. cohen and felson ) note that crime can only occur when a likely offender and victim converge at a particular place (on or offline) and time, absent a capable guardian. however, these are clearly not the only actors involved. the likelihood that such convergences will occur, and whether they are conducive to crime, is further influenced by the actions of place managers, and 'handlers' . handlers are those who have an emotional attachment to a particular offender (e.g. parents, friends) and can exert some control over them (e.g. discouraging them from offending). place managers on the other hand are directly responsible for specific locations (e.g. shops, bars, hospitals), and can (for example) ensure the environment is designed to make crime more difficult (e.g. by placing expensive items behind a counter in shops), by training their staff to act in particular ways, or by employing specific tactics that deter crime or de-escalate situations as they arise. extending the conceptual framework further, sampson et al. ( ) note that the actions of guardians, place managers and handlers are influenced by 'supercontrollers' , who can include formal organisations (e.g. regulators, government departments, police forces), diffuse collectives (e.g. the media), as well as more personal networks (e.g. families). while guardians, place managers and handlers can have a direct influence on the likelihood of a crime event taking place, supercontrollers exert their influence indirectly via the impact they have on these latter 'controllers' . other approaches are also relevant. mazerolle and ransley ( ) introduced the concept of 'third-party policing' , describing a blurring of the boundary between law enforcement and civil action to tackle crime. to all these 'crime preventer' roles, ekblom ( ) adds the concept of 'crime promoters'-people or organisations that, inadvertently or deliberately, increase the risk of crime, and hence who must be influenced to desist. he also introduces the concept of involvement as a separate crime prevention task from the practical side of implementation, centering on the actions of alerting, informing, motivating, empowering and directing individuals and organisations to undertake particular crime prevention roles/responsibilities that have been identified and assigned. both these additions will be returned to in the discussion section of the paper, but for now it is important to note that twitter can be used as a medium to encourage (or discourage, as appropriate) individuals or those with a responsibility to reduce crime, to act. in what follows, we examine how uk law enforcement used twitter during the early stages of the pandemic to alert the public and others about crime problems, inform them about how they are committed, and to empower them to reduce their risk, or the actions they could take if victimised. in the context of crime prevention, much has been learned about what works to reduce crime (e.g., weisburd et al. ) . however, as far as we are aware, the evidence base regarding police use of social media to involve people and other agencies in implementing or supporting security interventions is under-developed (see below). as such, this study represents an attempt to catalyse activity in this area. while it is out of scope to examine if law enforcement use of twitter actually influenced the behaviour of the stakeholders listed above (including potential victims), we examine the following related questions: what is tweeted; whether messages are sufficiently retweeted for them to have the potential to have their desired effect; what factors, if any, are associated with whether or how frequently messages are retweeted; and whether messages provide advice that empowers citizens (or others) to act? in the next section, we briefly review research concerned with twitter use and the pandemic, before presenting our methodology and results. in , the micro-blogging platform twitter reported million active users and over million daily posts. as of july , statista reports that the uk ranks fifth in terms of twitter active users, with just over million. the popularity of the platform and its use has consequently attracted much data-driven research (miró-llinares et al. ; ashktorab et al. ; see also : cheong and lee ; kumar et al. ; mandel et al. ; imran et al. ) . while the general public's engagement with the platform has raised its popularity, public bodies and government agencies across the world commonly employ twitter to communicate with the populace, via their own verified user accounts. previous research (crump ; lee and mcgovern ; heverin and zach ; lieberman et al. ; walsh ) has shown that law enforcement agencies (leas) may use twitter and other social media platforms for operational purposes (e.g. sharing alerts, warnings, upto-date and verified information); for building community trust, involving and educating citizens in and on the governance of crime, risk and insecurity; and for sharing successful enforcement stories. in relation to this, the use of twitter by leas has been saluted for enhancing "police-citizen encounters and the foundational goals of community policing-fostering non-adversarial relations through public participation, decentralised decision-making, and two-way communications" (walsh : ) . the use of twitter by leas has also been found to increase transparency, which can (to some extent) increase police legitimacy (grimmelikhuijsen and meijer ) . in the uk, leas started to use social media around , with north yorkshire and west midlands police taking the lead by using facebook and youtube to share information about local policing (crump ) . it was anticipated that twitter would not become the main platform from police-citizen engagement as it was difficult for general users to engage with twitter discussions (heverin and zach ; crump ; lieberman et al. ) ; but the platform has since evolved. the twitter of today has become a primary platform for the sharing of (media-rich) information and news during crises. denef et al. ( ) studied the tweeting practices of two uk police forces during the august riots, finding that one adopted a more formal, or depersonalized approach, while the other adopted a highly personalized, informal and interactive style which also included interaction with users. they conclude that, as different communication strategies may influence public engagement with police content on social media, there is a need to adjust communication strategies and polices to the local context (see also : meijer and thaens ) . police tweeting practices have now become popular, but dekker et al. ( ) suggest that police social media policies inadequately address the barriers, structural and cultural, that may arise-and will need to adapt. they note the benefits of user engagement that twitter affords, including learning from the public. while research on police use of twitter has received relatively limited attention (particularly in times of crisis, and in terms of the approach taken in this paper), research on twitter use during epidemics has received substantially more, mostly focusing on changes to public awareness and the reporting and spread of outbreaks (broniatowski et al. ; grover and aujla ; ji et al. ; smith et al. ; diaz-aviles and stewart ) . the swine flu outbreak in / was the last and most recent pandemic that attracted twitter-driven research. most of this research examined public perceptions, or involved the gathering and analysis of twitter data regarding the sharing of information about that pandemic (ahmed et al. ; chew and eysenbach ; kostkova et al. ; mcneill et al. ; ritterman et al. ; signorini et al. ) . unsurprisingly, research on the covid- pandemic using twitter data is gathering pace. for example, cinelli et al. ( ) used twitter and other social media data to examine the diffusion of information regarding covid- for the period january to february . alshaabi et al. ( ) analysed the spread of the use of the word 'virus' among languages to track how the covid- pandemic has been discussed through late march on twitter. further, in their study, dong et al. ( ) created an interactive web-based dashboard that tracks covid- in real time using twitter feeds, while chen et al. ( ) have created the first public coronavirus twitter dataset (which is continuously updated). however, to the best of our knowledge to date, twitter data has not been used to examine the covid- -crime association, or law enforcement use of twitter during the pandemic. for the purposes of this study, to answer the research questions outlined above, we concentrate on user-timeline twitter data concerned with covid- from public sector stakeholders involved in crime reduction across the uk. in what follows, we first describe the approach taken to sampling and data collection. next, we discuss our analytic approach and present our findings. we conclude the article with a discussion of our findings, what they might mean for policy and practice, and future research directions. we first identified each of the police forces (territorial and national) in england, wales, scotland and northern ireland, along with the other uk agencies with responsibilities for crime reduction (e.g. the home office, national police chiefs' council, the college of policing, action fraud and neighbourhood watch). the full list of ( ) stakeholders considered in this study can be found in additional file : stakeholder list. next, we manually searched for the primary verified twitter accounts for each of the stakeholders. we opted to analyse the activity of only the primary accounts for each stakeholder as-while other accounts exist -we reasoned that these would be the accounts that the general public typically engaged with. moreover, there currently exists no comprehensive repository off police twitter accounts, which makes the systematic identification of other accounts difficult (for us and the general public). on may , the r package 'rtweet ' (kearney et al. ) , was used to download the tweets posted and retweeted by these accounts. while we could not collect tweets for a specified period, the 'user_timeline' search function enabled us to download the previous tweets published by each stakeholder (up to the collection date). this resulted in the extraction of , tweets from all stakeholders. due to differences in the frequency with which stakeholders posted tweets, the date of the first tweet varied for each stakeholder. however, complete data were available for all stakeholders from september . as such, we analyse trends in the data from this period to may , which was the date on which the uk government published its plans for the easing of the lockdown and changed its messaging from 'stay home' to 'stay alert' . this equated to a total of , tweets. in selecting the data for this period, this enabled us to analyse twitter data for the months prior to and since the onset of the covid- pandemic. in future work, we aim to analyse data for later intervals. for each tweet, we downloaded data for variables including the name of the twitter account, the date and time of the tweet, the text tweeted, and the number of likes and times the tweet was retweeted. automated approaches have been developed for the purposes of extracting and analysing large volumes of text data. these include sentiment analysis (pak and paroubek action fraud is the uk's national reporting centre for fraud and cybercrime (see https ://www.actio nfrau d.polic e.uk/what-is-actio n-fraud ). for example, the primary twitter account of the greater manchester police is "@gmpolice". however, this police force also has specialized verified accounts such as "@gmpcitycentre" which concentrates on the policing of manchester city centre, "@gmpfraud" which concentrates on providing updates from the greater manchester police economic crime unit, "@gmptraffic" which concentrates on providing updates on greater manchester traffic, and so on. a peer-reviewed r language package designed for implementation of calls to collect and organize twitter data via twitter's rest and stream application program interfaces (api)-can be found on https ://devel oper.twitt er.com/en/docs. ; kouloumpis et al. ) and message polarity (lima et al. ) . however, the reliance on such approaches has been criticized for missing the deeper context or meaning of communications (walsh ) . this is particularly likely to apply to novel datasets for which such techniques may not work well. law-enforcement tweets may include information on various sorts of crime, the publicising of policing actions, as well as interactive content and suggested crime prevention advice (walsh ) . moreover, when we consider the novelty and disruption to social settings that covid- has engendered, such information can become inconsistent and highly variable. for example, many law enforcement agencies have been committed to raising awareness of social distancing and the policing of covid- restrictions. for these reasons, our initial analytic strategy involved the use of a qualitative approach, in this case a thematic analysis (see, strauss and corbin ; walsh ; heverin and zach ; crump ; lieberman et al. ) . this allowed us to immerse ourselves in the data and capture the richness of its content. our approach to coding is discussed next. we first filtered all , tweets to identify those concerned with covid- . to do this, we searched for all tweets that included terms such as 'coronavirus' , 'covid- ′, 'pandemic' and their variations. this identified covid- related tweets across all stakeholders. next, we randomly selected a sample of these and coded them manually to enable analysis of their content. this was an iterative process involving the identification of themes that emerged from the data and the development of a coding manual to inform subsequent (automated) coding. after coding about % of the tweets (n = ), it appeared that we had reached saturation in terms of the themes that emerged from the data, with each new tweet fitting one (or more) of the existing themes. we confirmed this by coding a further sample of tweets, ultimately manually coding a total of messages. as one of the aims of the paper was to inform understanding of the types of crime reported as being of concern during the pandemic, we manually coded the crime-related tweets according to the following categories: crime type (what type of crime a tweet focused on), modus operandi (information about how the crime discussed was perpetrated), vulnerability (information about behaviour that may make the public vulnerable to the modus operandi), and any advice offered (e.g. a phone number to report offences to, crime prevention advice, or links to additional file ). next, based on the most common themes and keywords that emerged from the qualitative analysis, we built a coding matrix to automate a content analysis of the tweets. this was implemented in microsoft excel. the coding matrix comprised a series of boolean search terms that took the tweet text as input and generated dummy codes for a total of themes as output. here we note that, as this coding matrix and the boolean terms were developed based on the emergent themes of our qualitative analysis, a different dataset of tweets (for example, from different stakeholders, or stakeholders from different countries) may require a modification of the themes, or the boolean terms, considered for an automated content analysis inherent to the corpus of tweets in question. table provides examples of the boolean terms used. in this case, those used to identify incidents of fraud and domestic abuse. as table shows, some of the boolean terms were more extensive than others. to test the reliability of the approach, we applied these functions to another sample of tweets (selected at random) that were not used to generate the keywords or identify themes in the data. doing so generated new 'dummy' values for each tweet for each of the themes discussed above. as an example, consider a tweet that warned that during the pandemic the selling of medical counterfeits on people's doorsteps was increasing and that incidents could be reported to action fraud. for this sort of tweet, the boolean logic would generate positive values for the tweet being: crime related, concerned with fraud, discussing an exploit that was an example of doorstep crime, that the crime involved covid- relief products, and that advice was provided about who to report this kind of incident to and how. for all other 'dummy' variables, zero values would be recorded. to test the accuracy of the automated coding, we also manually table boolean terms used to code themes for fraud and domestic abuse (* is a wildcard operator, such that 'violen*' would identify terms such as 'violence' , 'violent' and so on) boolean search term fraud contains:[{"*fraud*", "*scam*", "*phish*", "counterfeit", "illegal", "fake", "pir ate*", "forgery", "forged", "falsified", "suspicious", "unexpected", " unsolicited"} and contains: {"email*", "text*", "account*", "call", "attachment*", "link*", "ad*", " good*", "website*", "tax", "photo*", "message*", " impersonate ", " pretend,*takefive ", "actionfraud"}] or contains: [{"*fraud*", "*scam*"}] domestic abuse contains: {"domestic", "intimate", "partner", "home"} and contains {"abuse", "violence"} coded these tweets and computed a simple index of inter-rater reliability using cohen's kappa statistic (cohen ) . the cohen's kappa score (k = . ) calculated indicated near perfect agreement between the tweets that were manually-and those that were automatically-coded. however, where possible, we modified the original string search function to improve accuracy further. the automated coding was then applied to all , tweets. it is important to note that as this is a qualitative analysis, the coded categories are not mutually exclusive (e.g. cybercrime and fraud); more than one type of crime could be discussed within a single tweet. to preserve the contextual richness of the data we coded tweets as concerning all the crime types to which they referred. table shows the proportion of tweets that focused on crime or other issues for the period before and during the pandemic, as well as the proportion of tweets that focused specifically on covid- . for all stakeholders, it appears that for each period considered, the majority of tweets focused on non-crime issues, but that this was particularly the case for the covid- period. such tweets focused on, for example, government guidance about public behaviour during the pandemic (e.g. regarding frequent handwashing, monitoring symptoms and self-isolating accordingly) and general policing (e.g. police community presence, traffic announcements, and so on). a similar pattern emerged when we focused on the twitter accounts of the territorial police forces only. table shows the proportion of tweets that were original messages, retweets, replies, or quotes. the figures shown are for all tweets, those that concerned covid- , those that concerned covid- and crime, and those that were sent by territorial police forces. it is apparent that the proportion of original tweets sent was about % of all tweets, regardless of whether they concerned covid- or not. however, relative to other tweets, for those that concerned covid- , a much larger proportion of messages were retweets. this was true regardless of whether the twitter account belonged to a territorial police force or another type of stakeholder. at least for this sample of twitter accounts, (like the virus itself ) it seems that messages about covid- were more likely to spread than were other types of message. the increased percentage of retweets that were covid- related might be due to the urgency associated with spreading information regarding the pandemic, as retweets require only 'one click' to send, which is simpler than creating an original tweet. whether a tweet is retweeted or not is considered crucial for the dissemination of information, and is an important measure of the impact of the intended message and the visibility of the tweeting account (suh et al. ; boyd et al. ; hong et al. ; zaman et al. ; fernandez et al. ) . the above descriptive statistics consider the proportion of tweets that were retweets, but not how frequently messages sent by the stakeholders were retweeted. we consider the latter here. for all tweets, we find that the mean number of retweets-regardless of who retweeted them-was (median = ). however, it is also evident from fig. that some tweets were more 'viral' than others. for example, nineteen percent of all tweets were never retweeted, sixty-six percent were retweeted less than times, whilst five percent were retweeted more than times (one percent more than times). given that the rate at which messages are retweeted is considered an important indicator of their impact, this raises questions about whether there are particular characteristics of tweets that are associated with the frequency with which they are retweeted. some of these might be considered when stakeholders post messages to try to increase the impact of tweets. previous analyses of twitter accounts (e.g. suh ; fernandez et al. ) have shown that characteristics of the account (e.g. the number of followers an account has), as well as the content of the tweet (e.g. whether it includes a url) are significantly associated with the likelihood that a tweet will be retweeted. as far as we are aware, no studies have conducted this kind of analysis for police twitter accounts during a pandemic (but for a general analysis of police twitter accounts, see, fernandez et al. ) . to examine this issue, we conducted a statistical analysis to examine which factors were associated with the frequency with which messages (original messages not those that were retweets of existing material) were retweeted. given the skewed distribution of the data, and the fact that we have many zeros, we use a hurdle model to estimate the frequency with which tweets were retweeted. hurdle models (e.g. loeys et al. ; mcdowell ) are used where two data-generating processes are assumed to contribute to the generation of zeros and non-zero values in a dataset. a logit model is used to estimate the probability of observing non-zero values, and an appropriate (truncated at zero) count model is used to estimate the likelihood of observing particular non-zero values (e.g. , , , , ….). in the case of the latter, we use a negative binomial model as this provided a much better fit to the data than did a poisson model. this was illustrated by an improvement (of , , ) in the akaike information criterion (aic), and the inspection of hanging rootagrams, which show the extent to which the model correctly predicts different counts of retweets (see appendix a). zero-inflated negative binomial (zinb) models offer an alternative to the hurdle model. zinb models also estimate the influence of two data-generating processes, but do so using a slightly different approach; one part of the model estimates excess zeros, while the other models non-zero counts and non-excess zeros. that is, both parts of the model estimate zeros but different types of them (excess and non-excess). as discussed elsewhere (e.g. loeys et al. ; mcdowell ; zeileis et al. ) , the two types of model often yield similar results but the findings from the hurdle models are easier to interpret. for this reason, we employ the latter here. for this analysis, we included variables constructed by extracting data from the content of the tweets as well as the metadata associated with the accounts. for the latter, we considered the effect of the number of followers an account had, the number of times the account had 'favourited' other tweets (a measure of account activity), and whether messages were posted by a territorial police force. for the former, we considered whether the tweet text was about covid- , whether the message was about crime, whether messages were about crime in general (as opposed to a specific offence type), whether tweets quoted other tweets, whether tweets were a reply, and whether tweets included a photo. while some of these analyses were conducted using the hurdle() function in the r pscl library. table shows the results. it is apparent that whether a tweet was retweeted, and the number of times it was retweeted, was positively associated with the number of followers an account has, the activity of the account as measured by the number tweets 'favourited' by the account owner, whether the tweet included reference to covid- , whether it covered a crime topic, and whether it included a photograph. for both parts of the model, the partial regression coefficients shown are exponentiated (i.e. they are odds ratios) and are consequently multiplicative. so, for example, if a tweet contained a photo, that message was almost twice as likely to be retweeted than a tweet that did not, all else equal. replies, quoted tweets, and tweets that discussed crime in general (as opposed to specific crime types) appeared to be less likely (and less frequently) to be retweeted than did other types of messages. tweets sent from territorial police force accounts were more likely to be retweeted than messages sent by other account holders, but when they were retweeted, they appear to have been retweeted less frequently. having examined the likelihood that tweets would be retweeted, we looked at the content of the messages in more detail. figure shows the results of a content analysis concerning the crime type themes discussed in tweets posted during the pre-covid- period (n = , ) and ) . here, we focus only on the crime related (subset of ) tweets as one of the aims of our study was to assess the crime trends being reported by the police forces on their twitter accounts before and during the pandemic. there were clear differences in the frequency with which tweets concerned the different types of crime. and, while there was an association between which crime types received most coverage across the two periods, there were differences. to highlight these, for the covid- period, we also estimate the expected values (and % confidence intervals), assuming that the proportion concerned with a particular crime theme during this period would be the same as that for the pre-covid- period. relative to the pre-covid- period, for the covid- period we see higher than expected frequencies of tweets concerned with fraud, domestic abuse, cybercrime, child abuse and stalking, and drops in (for example) those concerned with general crime, violence, burglary, terrorism and knife crime. looking closely at the covid- themed crime tweets in particular (fig. ) , the majority concerned fraud ( . %), followed by cybercrime ( . %), general crime ( . %) and domestic abuse ( . %). for this paper, we subsequently concentrate on these four crime themes. table provides example tweets (reported verbatim) for each crime type to illustrate the kinds of issues covered. as noted above, in some cases (e.g. example for fraud), tweets may refer to two of the crime themes that emerged in our content analysis (in this example, cybercrime and fraud). in terms of how the crimes discussed were perpetrated, the majority of covid- fraud tweets concerned tax matters, covid- relief materials and scams associated with working from home. covid- cybercrime tweets also tended to focus on offences related to covid- relief or working from home. covid- general crime tweets mostly included warnings about criminals exploiting the pandemic (in general terms) and victimisation. tweets concerned with domestic abuse tended to concentrate on the impact of the lockdown (i.e. changes to mobility and time spent at home) on this form of offending. most of the tweets that covered these crime specific themes also offered some form of advice on how to avoid victimisation, or web links where readers could find further information on the topic (via a url link embedded within the tweet to an external source of information). however, very rarely were details provided (within a tweet) about how victims could report offences. for example, for the covid- tweets, for only . % of those concerned with fraud (n = , ), . % of those concerned with cybercrime (n = ), . % of those concerned with general crime (n = ), and . % of those concerned with domestic abuse (n = ) was a reporting number provided. next, we consider changes in the pattern of tweets over the course of the pandemic. figure shows weekly time series data regarding the frequency of tweets concerned with fraud, cybercrime, domestic abuse and general crime. while the frequency of tweets concerned with non-specific crime matters (ie. general crime and offending) remained relatively steady throughout the entire period considered, there was an increase in tweets if you're in an emergency situation, but can't talk, here's how to let us know you need help: (link) people facing violence/controlling behaviour at home should still report their experiences to police or seek advice & support from local domestic abuse services. officers will attend calls for help and arrest perpetrators despite the additional pressures on the service. #covid concerned with fraud, domestic abuse and cybercrime from march onwards. initially, these tweets explicitly referenced the pandemic (see the frequency of covid- tweets), but the frequency with which this was the case appeared to decline over time. the aim of this paper was to analyse the content of tweets posted by uk law enforcement and associated agencies during a time of global disruption. in this case, the disruption was due to the covid- pandemic, but the findings of the research also have implications for handling other disruptions and the use of social media by law enforcement stakeholders more generally. the analysis of , tweets and their metadata indicate that (a) most of the tweets focused on issues that were not specifically about crime; (b) during the time of crisis the stakeholders in question tended to increase their retweeting activity rather than creating original tweets; (c) the visibility of an account (number of followers and favouriting habits) and the richness of the content (discussing covid- , crime specific issues and including media such as images) were associated with the likelihood of messages spreading (both in terms of whether they were retweeted and the frequency with which this was so); (d) relative to the preceding months, during the first months of the pandemic tweets on fraud, cybercrime and domestic abuse increased significantly. our finding that most tweets were not crime-focused, but centred instead on encouraging the public to comply with government guidance about behaviour during the pandemic or concerned general policing, is broadly in line with walsh's study on the tweeting practices of migration policing actors, which found that . per cent of tweets sent by policing agencies were informational and intended to raise awareness about policing and operational activities and capacity. in our case, this was even more so when we considered the covid- tweets. it seems that the stakeholders from our sample were 'lending' their tweeting capacity to spread public health-oriented information to raise awareness about the pandemic and its prevention. while the pandemic has proven to be a call for 'all hands-on deck' , straying from a crime reduction focus may prove counterproductive in some respects. for example, as noted, and in agreement with previous studies (e.g. fernandez et al. , heverin and zach , velde et al. , users tend to retweet law enforcement tweets that contain crime specific content, and are content-rich with media such as photo, video, and url's. other research also suggests that users favour retweeting messages that contain time-sensitive material (boyd et al. ) , which may be particularly relevant in times of crisis. therefore, while it is crucial to spread the message about the 'general picture' and urgent issues connected to the disruption in question (in this case, social distancing and lockdown measures), law enforcement stakeholders should consider whether it is better to maintain a focus on the dissemination of crime-specific prevention tweets that are within their mandate. stakeholders should also consider prioritizing information (and in doing so boost its impact) that is time-sensitive, and ensure that as well as discussing such content, they use adjectives to convey its urgency, such as 'urgent' , or phrases like 'time-sensitive' (as perhaps they would in an email). such (in fact all) messaging would need special care to check the validity of the content prior to dissemination, as urgency messaging can be fertile soil for spreading fake news or misinformation. moreover, care would need to be taken to not overuse such phrasing, which would likely dilute its potency. the detected increase of tweets on fraud, cybercrime and domestic abuse is in line with preliminary reports of these crimes being on the rise during the pandemic. the surge in the frequency of (all) tweets concerned with fraud ( fig. ) is clearly also explained by the occurrence of covid- specific tweets that mention this type of crime. while there is some evidence of a similar pattern for cybercrime and domestic abuse, this is less cleartweets concerned with these crime types remain elevated throughout the covid- period, but those that explicitly mention covid- account for a much smaller fraction. one reason for this could be that the particular modus operandi employed to commit these types of crimes may not have changed due to covid- , even though the opportunity or motivation to commit them did. for example, with more people staying at home, the opportunity for domestic abuse may increase. likewise, with more people staying at home and using the internet to work remotely, the risk (per unit of calendar time) of cybercrime would be increased. these are indirect effects of the virus. in contrast, fraudsters have been adapting their modus operandi to create and exploit specific opportunities that the restrictions associated with covid- presents. for example, fraudsters have been selling fake coronavirus testing kits or impersonating relevant coronavirus crisis response governmental bodies to defraud people. at the same time, the fact that people may be increasingly vulnerable to fraud and cybercrime during the pandemic may be explained by how we react when we feel threatened, scared and exposed to uncertainty. for example, experimental research on protection motivation theory (pmt: rogers, )-which considers how people view suggested actions when they perceive a threat-suggests that when people perceive a high expectation of threat exposure, they are easier to persuade using any information that offers a possibility of threat evasion. moreover, research by floyd et al. ( ) suggests that fear-stimulating communications increase the adoption of proposed adaptive behaviours. these findings have informed a number of 'public health'-type programmes intended (for example) to encourage smoking cessation (greening ) or to promote cyber secure behaviours (vance et al. ) ; pmt was also recently used to encourage social distancing and protective measures for hospital staff against the virus (kemp ; barati et al. ) . however, in the case of fraud, it may be that criminals are exploiting the fear associated with the pandemic and the consistent messaging about the need for positive protective action. this may create the conditions for them to trick members of the public into paying for counterfeit (or non-existent) goods (e.g. a vaccine, testing kits, protective equipment and so on) or services (e.g. tax relief schemes). this is an unintended consequence of well-intended messaging. to counter this, our recommendation would be that stakeholders should be mindful when sharing information that may trigger hyper-defensive behaviour and-where possible-provide clear advice, recommendations, or links to trusted sources that can do so; recall that only . % of the tweets we analysed provided a reporting number within the covid- fraud tweets. another point worth noting concerns the precise timing of tweets in relation to that of the lockdown. in all cases, some covid- tweets concerned with crime started to be posted prior to the lockdown. however, for covid- related fraud, cybercrime and crime in general, twitter activity commenced sooner and increased more rapidly than it did for domestic abuse. in the case of domestic abuse, the peak in twitter activity observed was several weeks after the lockdown had started. given the potential for the lockdown to make this crime more likely, and because victims/survivors may be less able to report offences under such conditions (as they may be more closely monitored by offenders) this is unfortunate. it is easy to say this in hindsight, but it would have been better to communicate about this type of offending when there was more opportunity for victims to contact support services and for their support networks to be able to meet or contact them. for the avoidance of doubt, the above is not a criticism of the communication strategies of the leas examined here, as the conditions are unprecedented and there was much uncertainty about the government's strategy, including the timing of the lockdown. however, lessons should be learned. with respect to future communications strategies, it would be sensible for agencies to engage in short-term foresight activities to review which crimes are most likely to be affected by a disruption (such as a pandemic) and, for which crimes the window of opportunity to do something about the problem is collapsing most quickly. most of the guides for use of social media by police (at least, those available to the public) emphasise the need for freedom of information and advice on privacy and confidentiality best practices (see for example, guidelines on the safe use of the internet and social media by ministry of defence police officers ). or, as discussed in fernandez et al. ( ) , they provide general engagement guidelines, such as the need to use simple language and clear and focused messaging. however, to get the most out of it, social media mobilisation in times of disruption may require a more systematic and strategic approach. as discussed, the crimes about which information is to be disseminated could be prioritised according to the emergent, or anticipated disruption scenario. but leas may also wish to consider adopting a more coordinated and structured approach. for example, ekblom ( ) suggests mobilisation involves at least seven tasks, encapsulated by the acronym claimed: clarify the specific crime prevention roles, responsibilities and tasks that need to be undertaken in relation to a given crime problem (in the present case, to address the crime risks associated with covid- ); or the inadvertent crime promotion actions that should be ceased (e.g. insecure procedures for tracking and tracing that provide opportunities for fraudsters or distraction burglars). locate the individuals and organisations best-placed as dutyholders or wider stakeholders to undertake these roles, in terms of, say, expertise, local knowledge, legitimacy, coverage on the ground; and having achieved these steps, alert them about the existence and scale of the problem, inform them about the nature of the problem, what the causes and consequences are, who are the offenders etc., motivate them e.g. by incentives, regulations and laws, naming and shaming, 'the right thing to do' , empower them with appropriate know-how, legal powers, tools, funds and so forth; and if appropriate, direct them through audits, commitment to objectives, performance standards and so on. while this framework was initially developed for thinking about crime reduction actors (e.g. place managers), in times of disruption the above elements of the framework will apply to the public too. note also that while some of the tasks, roles and responsibilities in question will be direct preventive interventions intended to reduce crime opportunities (e.g. how to avoid succumbing to covid- related fraud), others will relate to supporting activities (e.g. providing training for interventions) or disseminating influence further down the chain or to other stakeholders. in a similar vein, fielding and caddick (n.d) suggest that there are six communication purposes associated with police use of social media, to: publicise, advise, inform, warn, appeal and engage. such frameworks could be used at operational or strategic levels, both to construct individual tweets ('have we considered motivation?' etc.) and to coordinate twitter campaigns (e.g. does the messaging have clear implications for all relevant stakeholders?). they could thus be used to think systematically about what messages are intended to achieve and to subsequently tailor the messages to address these goals. more broadly, the sending and receiving of tweets can be considered from a 'system of influence' perspective. as an illustration to the covid- scenario and applying the claimed framework, an inform tweet containing information that wearing masks is now mandatory, should contain an alert that this can also be exploited by criminals through personal protective equipment scams. on the basis of our regression analysis, stakeholders could additionally boost the impact of alert tweets by uploading a photo or other media that is relevant to the alert. conscious of the -character limit, law enforcement actors may wish to consider adding such information through the 'thread' creation option, which allows the insertion of additional tweets as an attachment. of course, this study is not without limitations. chief amongst these is the fact that our findings are for a sample of uk police organisations. different findings may be observed for police organisations in other countries, or for other uk stakeholders not sampled here, or the personal accounts of police officers in these jurisdictions. however, while extending the sample would be beneficial, we believe that the insights provided here achieve our intended aims. nevertheless, creating an open source cohesive repository of all verified twitter accounts by uk police forces would be highly beneficial, for users, as well as for research. in closing, we emphasise three points. the first is that the pandemic has made it very clear to all that we live in a changing world. additional waves of the pandemic may lead to further changes. however, the pandemic is only one dimension of change. for example, changes to technology-to include rapid advances in artificial intelligence (e.g. caldwell et al. ) , internet connectivity (e.g. blythe and johnson ) and biotechnology (elgabry et al. )-, society (e.g. brexit) and the environment (e.g. climate change) all have potential implications for crime that require attention (johnson et al. ; topalli and nikolovska ) . for example, like the pandemic, they have the potential to create uncertainty or changes to people's routine activities that criminals might exploit. doing something about these, including communicating about the potential impact of these changes on crime, will be important, and stakeholders and researchers should now be thinking about when and how best to do this. not doing so may mean missing windows of opportunity. the second point is that there currently exists relatively little research on the use of twitter by law enforcement agencies (for an exception, see fernandez et al. ) . current guidance tends to focus on the composition of messages, but the focus is on compliance with regulation and avoidance of (say) reputational damage rather than a consideration of what is effective in terms of reducing crime or encouraging crime reduction activity. as such, we encourage other researchers to look at police use of twitter with a view to developing a literature on 'what works' . finally, in the current study, we observed increases in twitter activity about particular forms of crime (fraud, cybercrime and domestic abuse) during the pandemic. future work might examine the extent to which twitter data serves as an open source 'leading indicator' that anticipates, in real-time, changes to crime problems of this nature. supplementary information accompanies this paper at https ://doi. org/ . /s - - - . additional file . stakeholder list. novel insights into views towards h n during the pandemic: a thematic analysis of twitter data how the world's collective attention is being paid to a pandemic initial evidence on the relationship between the coronavirus pandemic and crime in the united states tweedr: mining twitter to inform disaster response factors associated with preventive behaviours of covid- among hospital staff in iran in : an application of the protection motivation theory a systematic review of crime facilitated by the consumer internet of things alarming trends in us domestic violence during the covid- pandemic tweet, tweet, retweet: conversational aspects of retweeting on twitter. in rd hawaii international conference on system sciences national and local influenza surveillance through twitter: an analysis of the - influenza epidemic beware malware-laden emails offering covid- information, us secret service warns cybercrime and shifts in opportunities during covid- : a preliminary analysis in the uk ai-enabled future crime an increasing risk of family violence during the covid- pandemic: strengthening community collaborations to save lives covid- : a public health approach to manage domestic violence is needed using social media for actionable disease surveillance and outbreak management: a systematic literature review tracking social media discourse about the covid- pandemic: development of a public coronavirus twitter data set a microblogging-based approach to terrorism informatics: exploration and chronicling civilian sentiment and response to terrorism events via twitter pandemics in the age of twitter: content analysis of tweets during the h n outbreak ) fbi says cybercrime reports quadrupled during covid- pandemic the covid- social media infodemic. arxiv preprint a coefficient of agreement for nominal scales social change and crime rate trends: a routine activity approach what are the police doing on twitter? social media, the police and the public social media adoption in the police: barriers and strategies social media and the police: tweeting practices of british police forces during the tracking twitter for epidemic intelligence: case study: ehec/hus outbreak in germany an interactive web-based dashboard to track covid- in real time. the lancet infectious diseases crime prevention, security and community safety using the is framework a systematic review protocol for crime trends facilitated by synthetic biology routine activity effects of the covid- pandemic on burglary in detroit an analysis of uk policing engagement via social media police communications and social media a meta-analysis of research on protection motivation theory community under stress: trust, reciprocity, and community collective efficacy during sars outbreak adolescents' cognitive appraisals of cigarette smoking: an application of the protection motivation theory does twitter increase perceived police legitimacy? twitter data based prediction model for influenza epidemic have you been a victim of covid- -related cyber incidents? survey, taxonomy, and mitigation strategies crime and coronavirus: social distancing, lockdown, and the mobility elasticity of crime inside the nudge unit: how small changes can make a big difference cybercrime in america amid covid- : the initial results from a natural experiment twitter for city police department information sharing predicting popular messages in twitter social media and disasters: a functional framework for social media use in disaster planning, response, and research extracting information nuggets from disaster-related messages in social media • fast, convenient online submission • thorough peer review by experienced researchers in your field • rapid publication on acceptance • support for research data, including large and complex data types • gold open access which fosters wider collaboration and increased citations maximum visibility for ready to submit your research ? choose bmc and benefit from future crime packagrtweet: collecting twitter data. r package version . . e 'rtweet'title collecting twitter data covid- , protection motivation theory and social distancing: the inefficiency of coronavirus warnings in the uk and spain (spanish network of early career researchers in criminology # swineflu: the use of twitter as an early warning and risk communication tool in the swine flu pandemic twitter sentiment analysis: the good the bad and the omg! tweettracker: an analysis tool for humanitarian and disaster relief identity and security. protecting businesses against cyber threats during covid- and beyond policing and media: public relations, simulations and communications police departments' use of facebook: patterns and policy issues a polarity analysis framework for twitter messages the analysis of zeroinflated count data: beyond zero-inflated poisson regression a demographic analysis of online sentiment during hurricane irene third party policing from the help desk: hurdle models twitter influence on uk vaccination and antiviral uptake during the h n pandemic social media strategies: understanding the differences between north american police departments the behaviour change wheel: a new method for characterising and designing behaviour change interventions hate is in the air! but where? introducing an algorithm to detect hate speech in digital microenvironments cyber-attacks up % over past month as #covid bites. infosecurity magazine a multi-level influence model of covid- themed cybercrime twitter as a corpus for sentiment analysis and opinion mining responding to the 'shadow pandemic': practitioner views on the nature of and responses to violence against women in victoria, australia during the covid- restrictions staying home, staying safe? a short-term analysis of covid- on dallas domestic violence using prediction markets and twitter to predict a swine flu pandemic a protection motivation theory of fear appeals and attitude change handbook of health behaviour research : personal and social determinants super controllers and crime prevention: a routine activity explanation of crime prevention success and failure the use of twitter to track levels of disease activity and public concern in the us during the influenza a h n pandemic towards real-time measurement of public epidemic awareness: monitoring influenza awareness through twitter epidemic intelligence: for the crowd, by the crowd basics of qualitative research techniques. thousand want to be retweeted? large scale analytics on factors impacting retweet in twitter network the future of crime: how crime exponentiation will change our field family violence and covid- : increased vulnerability and reduced options for support police message diffusion on twitter: analysing the reach of social media communications motivating is security compliance: insights from habit and protection motivation theory social media and border security: twitter use by migration policing agencies what works in crime prevention and rehabilitation: lessons from systematic reviews predicting information spreading in twitter regression models for count data in r publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations we thank the anonymous reviewers for their helpful comments. not applicable. the authors declare that they have no competing interests. hanging rootograms were generated using the rootogram() command in the r countreg library. as shown by the figure on the right, the poisson hurdle model dramatically underpredicts counts up to and overpredicts those between to (about) . for the negative binomial hurdle model, the fit is clearly much better. key: cord- - ah w o authors: sakurai, mihoko; adu-gyamfi, bismark title: disaster-resilient communication ecosystem in an inclusive society – a case of foreigners in japan date: - - journal: int j disaster risk reduct doi: . /j.ijdrr. . sha: doc_id: cord_uid: ah w o the number of foreign residents and tourists in japan has been dramatically increasing in recent years. despite the fact that japan is prone to natural disasters, with each climate-related event turning into an emergency such as with record rainfalls, floods and mudslides almost every year, non-japanese communication infrastructure and everyday disaster drills for foreigners have received little attention. this study aims to understand how a resilient communication ecosystem forms in various disaster contexts involving foreigners. within a framework of information ecology we try to get an overview of the communication ecosystem in literature and outline its structure and trends in social media use. our empirical case study uses twitter api and r programming software to extract and analyze tweets in english during typhoon (hagibis) in october . it reveals that many information sources transmit warnings and evacuation orders through social media but do not convey a sense of locality and precise instructions on how to act. for future disaster preparedness, we argue that the municipal government, as a responsible agent, should ( ) make available instructional information in foreign languages on social media, ( ) transfer such information through collaboration with transmitters, and ( ) examine the use of local hashtags in social media to strengthen non-japanese speaker’s capacity to adapt. the geographic characteristics of japan makes the country highly vulnerable to disasters such as earthquakes, typhoons, volcanic eruptions, flash floods and landslides [ ] . therefore, disaster risk reduction and resilience measures are enshrined in the every-day activities of the japanese people including school curriculum, building regulations and design, as well as in corporate organization setup [ ] . disaster risk reduction drills often take place in schools, work places, homes, and include the publication of detailed evacuation plans and procedures in all local communities [ ] . however, these efforts of building the capacity of residents facing emergencies could be falling short in terms of coverage and results due to the influx of foreigners who may neither be enrolled in schools or engaged with japanese establishments or are staying only for a short period of time [ ] . the population of foreign residents in japan has been increasing exponentially in recent years according to a report by mizuho research institute [ ] , with their number reaching a record high of . million people by january , . the report reveals an increase by , residents from january to december in . additionally, the japan national tourism organization reports that the number of foreign tourists visiting japan is rapidly increasing. it recorded million in , which was four times as many as visiting in , and six times over the number visiting in . this trend of increased foreign tourists may partly stem from the influence of numerous sporting and other events across the country which include the formula grand prix, the fivb volleyball women's championship and the japan tennis open [ ] . the tokyo olympic and paralympic games, which has been postponed to , as well as the push by government stakeholders to amend bills to relax immigration laws can also be expected to boost the number of foreign residents and tourists in the country [ , ] . even though some foreigners become permanent residents, the qualification for permanent residence status requires a ten-year continuous stay in japan including five years of work, but does not necessarily require the acquisition of japanese language skills [ ] although it may matter at some points. living in japan briefly or as a permanent resident does not necessarily mean the person speaks japanese or is accustomed to disaster risk reduction culture and procedures in japan. although the composition and dynamics of the japanese population is gradually changing in terms of the non-japanese population, existing infrastructure and systems that support non-japanese speakers and foreigners as a whole during disasters seem to last accessed on march , . https://www.jnto.go.jp/jpn/statistics/visitor_trends/ j o u r n a l p r e -p r o o f be inadequate or lacking . the current japanese disaster management system is composed of both vertical and horizontal coordination mechanisms and, depending on the scale of the disaster, activities flow vertically from the national level, through the prefectural to the municipal level before getting down to the communities themselves [ ] . the municipal disaster management council together with other parties are close to disaster victims and responsible for municipal disaster management plans and information [ ] . when necessary, the municipality is also responsible for the issue of evacuation orders to residents during disasters [ ] . again, when it comes to emergency alerts or risk information, there is an existing mechanism where warning messages proceed from the national government (the cabinet, japan meteorological agency (jma)) to local administrations (prefecture, municipalities) and from there to the people. messages may also be transmitted directly top-down through base stations and are received on cell phones. however, reports by news outlets suggest that language barriers, coupled with the inexperience of many foreigners with regard to japanese disaster procedure protocols, create a huge sense of panic and confusion during disasters [ , ] . lack of appropriate risk information and procedural evacuation or alert actions create this confusion amongst the foreigners. therefore, most foreigners access news outlets in their respective countries or rely on social media for disaster risk particulars, alert instructions, or evacuation information [ ] . the use of social media is seen as a significant trend in accessing swift, precise and easy feedback information in critical situations [ ] [ ] [ ] . social media has become the most sourced avenue for information during disasters [ ] . this applies to both japanese and non-japanese users. irrespective of the chaotic nature or confusion that erupts due to limited risk information accessibility and delivery, a certain level of resilience is achieved through instinctive user responses. thus, in extreme events, a spontaneous system of resilience is often formed within the context of all the actors involved in the situation [ ] . this process usually develops because the participating actors share a common interest or find themselves in a situation that requires urgent solutions. the characteristics of this phenomenon create a state of ecosystem which becomes unique to the area affected and to the nature of the event; it generates interactions and communication procedures or methods to reduce vulnerabilities to the disaster [ , ] . the interactions within this ecosystem could be formal or informal, depending on the situation and different communication structures, modes and tools employed. the high frequency of disaster occurrences in japan in conjunction with a booming foreign population up to covid- , and an anticipated, fast rising population again thereafter, provides an ideal case for trying to understand how this ecosystem is established within the context of resilience and communication. resilience in this study is defined as a system's ability to absorb disturbances [ ] . to analyze this ability, we define the system in terms of the information ecology framework. we regard disaster resilience in the information ecology framework to encompass the efforts of collaboration and communication dependencies that exist amongst stakeholders engaged in the situation within a local context. we want to investigate how foreigners in japan obtain disaster related information while facing a language barrier and inexperience with disaster management procedures. this will guide us to understand the characteristics of the communication ecosystem for a foreign population. this paper is divided into the following parts: first, we introduce the information ecology framework as the theoretical premise of a resilient communication ecosystem. a literature review should give us insights into current studies on the topic and their deployment. a key question is how a resilient communication ecosystem is formed in different disaster contexts. to better understand this question, we seek cases amongst the literature which highlight disaster resilience through collaboration, communication and stakeholder participation. cases with such attributes are selected, reviewed and discussed from the view point of information ecology. following the case review, we use the twitter api (application programming interface) application and the r programing language to collect and analyze english language tweets shared on the twitter platform during typhoon hagibis which hit japan in . this gives us an empirical understanding of a resilient communication ecosystem. it is assumed that information shared on twitter in english is meant for consumption by non-japanese. again, media coverage of the typhoon is also monitored to serve as supplementary information to our analysis. based on insights from literature review and findings from tweet analysis, we try to describe a structure which guides our understanding of a communication system applicable to foreigners during a disaster. we conclude the paper with observed limitations and future research directions. information ecology is composed of a framework which is defined as "a system of people, practices, values, and technologies in a particular local environment" [ ]. previous research uses the notion of ecosystem to describe the nature of resilience in a societal context [ ] . information ecology framework, which is an extended notion of the ecosystem, helps us to examine the capabilities of each element within the system. the framework contains five components, including: system, diversity, coevolution, keystone species and locality. table summarizes the elements of information ecology. not shown but also included is the perspective of a technology role which focuses on human activities that are enabled by the given technology implementation. table table table table the presence of keystone species is crucial to the survival of the ecology itself, e.g., skilled people/groups whose presence is necessary to support the effective use of technology. locality locality locality local settings or attributes that give people the meaning of the ecology. given that resilience is defined as a system's ability to absorb disturbances [ ] , information ecology believes that this ability emerges through interrelationships or dependencies between system entities [ ] . these interrelationships exist between a diversity of players equipped with relevant technologies. keystone species play a central role in creating interrelationships, which, in turn, may lead to new forms of coevolution. the notion of locality reminds us of interrelationships and coevolution emerging in a local setting [ ] . the information ecology framework comes down to a set of key elements providing a systematic view of resilience. it enables us to understand who within a network of relationships is a key for realizing resilience, and what relations can be observed within a system and under which local contexts. coevolution is a driving force for forming the resilient ecosystem in times of disaster and is produced by collaboration of systems with j o u r n a l p r e -p r o o f diverse species that provide local context or knowledge [ ] . to this we need to add the role of technology in helping human activities and support the formation of coevolution. we notice a similar framework exists as communication ecology [ ] . it refers to individual or socio-demographic group connections that strengthen neighborhood communication infrastructure. it helps identify communication patterns of local communities or groups of people. in a disaster context, communication ecology includes community organizations, local media and disaster-specific communication sources [ ] . such communication fosters community resilience [ ] , which can be described as the ability of a neighborhood or geographically defined area to collectively optimize their connected interactions for their benefit [ ] . collective ability is required to deal with stressors and resume daily life through cooperation, following unforeseen shocks. hence, community resilience and its collective ability aim to empower individuals in a given community [ ] . resilience is enhanced through economic development, social capital, information and communication, and community competence [ ] . provision of or ensuring access to vital information through proper communication systems is essential to strengthen community capacity towards the unexpected. in this regard, a resilient communication ecosystem that this paper tries to get an understanding of, encompasses the dynamically evolving processes that empower the collective ability of information gathering and provision, as well as interactions and collective communication structures among individuals, communities, and local organizations. a literature review in the following section will help us extract the essential elements of a resilient communication ecosystem based on information ecology. this study follows a systematic mapping approach [ ] as it has the flexibility, robustness and ability to chronologically categorize and classify related materials in research contexts. the process was based on papers which applied the same approach, i.e., [ ] [ ] [ ] . we set the following research question (rq); rq: how is a resilient communication ecosystem formed in time of a disaster? we want to understand how resilience mechanisms are able to evolve spontaneously during disasters through communication, collaboration and the roles of different j o u r n a l p r e -p r o o f stakeholders affected by them. to accomplish this, this study categorized the process into three main activities [ ] . they are ( ) search for relevant publications, ( ) definition of a classification scheme, and ( ) review of relevant journal papers. the search for relevant publications was undertaken in two steps. first, an online database search was conducted with the keywords "collaboration in disaster", "communication in disaster" and "institutional role in disaster" across a number of journal databases such as sciencedirect (elsevier), springerlink, and emerald this procedure resulted in the selection of papers from all the sampled papers. the classification used in this research was to read, clearly identify and arrange the contents of the potential papers to extract the following content; . research titles, . source (type of journal database), . research question of the article . purpose (which keyword is dominant), . type of disaster, . country or region of disaster, . identified stakeholders, . communication tools used, and . communication structure of the case. this stage identified the contents relevant to this study. the communication ecosystem found to be prevalent within the reviewed cases highlights the presence of institutions, individuals and other actors who become involved in disaster events and perform various tasks or activities aimed at reducing either their own risk or that of the stakeholders. in most cases, a system evolved consisting of actors receiving and sharing information when a disaster or crisis took place [ , ] . information ecology promotes sharing and learning; particularly about the use of new technologies, and to reduce given levels of confusion, frustration and ambiguities [ ] . in this review, social media emerges as a new trend in technology and rather becomes the medium for sharing information with the aim to reduce anxiety about a disaster situation that could negatively affect the people involved [ ] . actors include government agencies, non-government agencies, and other actors who evolve to become part of the resilient structure of the framework, matching the nature of each situation. actors can be clearly distinguished based on the roles they play within the system. they include ( ) remote actors, ( ) responsible agents and ( ) transmitters, and ( ) austria [ ] . the world health organization also supports local communities by giving constant guidance and recommendations as a remote actor during ebola outbreak in west africa [ ] . responsible agents have the highest authority and are responsible for issuing needed guidance for people to stay out of risk zones, advising rescue efforts, and providing information on how to reduce risk [ ] . they are the national agencies such as at the level of ministry or governmental organization [ ] , including security agencies and fire brigades [ ] . and information flow between them and other parties [ ] . keystone species identified in the described disasters represent the group or actors whose presence is crucial for all resilience activities. although they are usually described as highly skilled actors, evidence from the reviewed cases also proves that their activities would not be successful without the cooperation and support from other entities [ ] . for instance, staff of the oslo university hospital can be described as "keystone species" in the case of terrorist attacks in norway [ ] . however, actors such as blood donors to the hospital and patients, news agencies who covered all events, individual volunteers who help spread "blood request" alerts by the hospital on social j o u r n a l p r e -p r o o f media, as well as others who sent time situational information on social media to update others can be said to be the auxiliary actors who complemented the "keystone species" effort. they give rise to the notion that resilience efforts are often created by a system of actors or groups who are connected through information sharing and play specific roles in a cohesive and coordinated manner during events or disasters [ ] . furthermore, the key to all these systems are the space or context in which their actions take place [ , ] . in information ecology, it is the "locality" or the setting that initiates all processes. this gives the specification of the context and guides the actions that follow. in some cases, factors such as the geographical location, messaging or information content, characteristics of actors, the medium of information delivery, duration of the event, and the actions taken give the "local identity" of the system [ , , ] . during the ebola crises, this meant the names of the affected countries [ ], while for earthquakes and disasters such as fires it refers to their exact locations. for the purpose of this study, locality refers to a system of individuals, agencies and local communities that refer to a local information and actions to be taken in order to reduce risk within a given area. table summarizes insights from our case review which illustrate the structure of a resilient communication ecosystem. table table table table as stated previously, the japanese governance system is composed of three tiers, national, prefectural, and municipal. among these, municipal government is closest to people at large and therefore is in charge of issuing evacuation orders, opening and operating evacuation centers, and managing relief efforts [ ] . when an evacuation order is issued by a local municipality, residents who live in specific zones are supposed data collection was done using twitter api in r package to scout tweets in the english language. to extract the tweets, the following steps were used: a. registration with twitter api to secure api credentials b. integrating twitter cran (programming language for r package) into r statistical package c. searching by tweets. this was conducted by searching the hashtags "#hagibis" and "#typhoon_ ". to restrict tweets to only english language, the code "en" was added. typhoon hagibis made a landfall on october , . therefore, the tweet search was restricted to search for information of tweets made on that day. the entire code use for the search was (hagibis <-searchtwitter ("#hagibis","#typhoon_ ", n=" ", lang = "en", since = " - - : ", until = " - - : "). five thousand tweets were collected from this search. d. export and further analysis: the results were exported to microsoft excel for further analysis. data analysis was based on the framework of a resilient communication ecosystem. the first aim was to find keystone species supporting the communication ecosystem. therefore, we focused on the number of tweets and twitter id accounts with a high number of retweets. as a result of retweet number analysis, we established the formation of coevolution within the ecosystem. we also aimed to explore the kind of information content exchanged in the ecosystem. hence we conducted thematic analysis [ ] which allowed us to apply key elements of a communication ecosystem. it shows how the communication ecosystem for english language was formed when the typhoon hit japan. in order to extract a sense of locality, we picked up the name of a region or city, and searched for local information such as evacuation orders or emergency alerts within a certain tweet. diversity in this case reflects the number of tweets and different account id information that posted or retweeted with #hagibis or #typhoon_ during the sampled time frame. frame, but could be the n th retweet for that particular tweet. from our findings, the highest number of original tweets from a single account was out of the original tweets. they were sent by @nhkworld_news, a public broadcaster in japan (figure ). this was followed by tweets created through a twitter handle name @earthuncuttv; which is leading online portal, documenting extreme natural events or disasters in the asia pacific region and around the world. around half the tweets within the original tweets were made by individuals. this was determined based on their handle ids. the total aggregated number of retweets we found from our time frame was , . amongst them, % were retweets which originated from @nhkworld_news ( figure ). @earthuncuttv covered % of total retweets while % came from other origins. the results show that within our sample framework, @nhkworld_news is the most dominant tweeter account. the ranking of retweets exceeding one hundred per source is shown in figure . the most retweeted tweet recorded , retweets. the original tweet was posted by @nhkworld_news saying "hagibis is expected to bring violent winds to some areas. a maximum wind speed of kilometers for the kanto area, and kilometers for the tohoku region." the second most retweeted one was posted by an individual account saying "ghost town //shibuya at pm before a typhoon" with a photo taken in shibuya city. the top ten retweeted tweets are shown in table . names in italics stand for region or city name. table table table table the th and th tweets mentioned evacuation orders issued by local municipalities while other tweets shared the event as it happened in different places around japan, or they reported on its effects. we found only three among original tweets that captured the emergency alert issued by jma (figure ). those three were tweeted by individuals and generated only a few retweets. the emergency alert was described in japanese which is translated as "special alert at level for heavy rain has been issued to tokyo area. level alert means the highest level of warning. take action for saving your life. some areas have been newly added under the special alert. check tv, radio, or local municipalities' information," as of october , : pm. the alert requested people to refer to information from local municipalities but all of it was in japanese. our constant monitoring of the event on the internet also reveals that jma was giving constant updates of the real-time weather status and alerts on its website. this information was provided in english. however, most of it did not appear in our sample frame as it did not utilize twitter in english. as the japanese disaster management system also grants direct access of emergency information from jma to the population, an emergency alert was sent to people's cell phones from jma, but this was written in japanese ( figure ) . hence, people who do not speak japanese may not have understood this, and the real time information could only be solicited by a few who originally knew about such a website or the information provided. on the other hand, much evacuation information is issued by municipal government. we found five tweets mentioning evacuation orders. two of those were the j o u r n a l p r e -p r o o f @nhkworld_news (italics refers to location), as follows. as to take [ ] . for example, a dutch man, who had been living in japan for a half year, received an emergency message but he could not find information where to evacuate. another american woman was supported by her friend in translating disaster information. emergency situations generate exceptional communication features which contravene the norm [ ] . therefore, a communication ecosystem forms and breaks up every time a disaster occurs. we observed dynamic characteristics of crisis communication structure with the involvement of various organizations [ ] and tools [ ] . the tools greatly hinge on social media which have become essential for digital crisis communication [ ] . however, more research is required to better capture the nature of social media use in crisis communication [ ] . several studies reported that j o u r n a l p r e -p r o o f local governments especially in thailand and the u.s. are reluctant to utilize social media in disaster situations [ , ] . they prefer to use traditional media such as tv and radio though those mass media tend to allow only one-way communication during a crisis [ ] . as our study shows, information through mass media becomes limited or challenged to meet specific needs due to its broader audience base [ ] compared to the multiple-interaction communication style that characterizes social media [ ] . government, as a responsible agency, needs to understand the characteristics of mass media and social media communication and create an appropriate communication strategy or collaboration scheme when preparing for future catastrophes. in the recent covid- crisis, the need for more targeted health information within a community and the importance of strong partnerships across authorities and trusted organizations are being discussed [ ] . we hold that more consideration should be paid to the provision of targeted information to those who do not share a common context or background of disaster preparedness. on daily-basis practices in japan, almost all local municipal governments prepare a so-called "hazard map" and distribute it to residents. ordinarily, municipalities only issue evacuation orders and do not provide residents with specific information where to evacuate to. from a regular everyday disaster drill, trained residents are expected to know where the evacuation centers are in their area. as people interpret new information based on previously acquired knowledge [ ] , a responsible agent should also consider how to provide information to people who haven't had any disaster training. beside the feature of interactive communication characterizing social media, the pull of multiple information sources and agents creates a sense of assurance and security in coping with or adjusting to events. this was experienced in the haiti earthquake, when social media contributed swiftly to creating a common operating picture of the situation through the collection of information from individuals rather than from a hierarchical communication structure [ ] . in addition, what is important in crisis communication is who distributes a message and how it is mediated in a population [ ] . our study shows that news sources who already seem to have gained social trust played a role as a keystone species in foreign language communication minorities prioritize personal information sources over the media [ ] . these cases suggest that social media promote effective resilience in communication, and that the delivery of information to foreigners in japan from different language backgrounds and cultures further creates traits where personal connection contributes to information accessibility choices. it is also true that social media-based communication raises situation awareness [ ] by receiving warning alerts as well as emotionally-oriented messages such as from friends and relatives [ ] . language barrier is likely to have been an obstacle that actually helped to shape the ecosystem identified in this study. during typhoon hagibis, responsible agents sent warning messages only in japanese language as shown in figure , so this could have been a major factor in harnessing the full potential of social media for disaster communication. results from this study reveal a lack of local knowledge or reference points across twitter communication, which shapes discussion within the ecosystem. for instance, in the case of the chennai flooding in india, tweets covered subjects such as road updates, shelters, family searches and emergency contact information [ ] , none of which were much referred to in our sampled tweets. while in the and philippine flooding, the most tweeted topic was prayer and rescue, followed by traffic updates and weather reports with more than half the tweets written in english [ ] . moreover, around % of those tweets were secondary information, i.e., retweets and information from other sources [ ] . our study recognizes the importance of secondary information. information exchange between agencies and affected residents and tourists assists in reducing risks such as, for example, during hurricane sandy in , where the top five twitter accounts with a high number of followers were storm related organizations posting relevant news and the status of the hurricane [ ] . as a remote actor becomes a source of secondary information, however, such information should guide people in what they should do next, rather than just point to the disaster situation. in this context, we note the importance of instructional messages whose content guides risk reduction behaviors [ ] . an example of such a message is "do not drive a car" which was tweeted by police during a snowstorm in lexington [ ] . it tells what specific life-saving action should be taken [ ] . generally, however, there appears to be little investigation of instructional information in crisis communication [ ] . the findings of this study imply that there could be two types of disaster information: a) risk information that refers to the potential effect of the disaster, i.e., emergency alert or warning, and b) action-oriented information that carries instructions for reducing the risk, i.e., an evacuation order and itinerary to be followed. risk information is published by remote actors while action-oriented information is provided by responsible agents. both types of information must contain local knowledge and allow easy access when people search for that information [ ] . in order to make sure people can reach instructional messages, it is not enough to just point to the information on organizational websites [ ] , as jma did during typhoon hagibis. instead, "localized hashtags" [ , , ] can support people to find life-saving information as quickly as possible. a previously cited study argues that social media can be useful when sending a request for help [ ] . "typhoon damage of nagano" is an example of a localized hashtag, assisting a fire brigade in nagano prefecture utilizing twitter to respond to calls for help from residents during typhoon hagibis . a localized hashtag can also be used as a point of reference where people can find relevant information [ ] . last accessed on july , , http://www.nhk.or.jp/politics/articles/lastweek/ .html. in future disaster communication, local information should be distributed using geographical information with instructions as to what to do next [ ] . responsible agencies should take this into account, while also analyzing which information was shared across social media in previous disaster cases [ ] . furthermore, cultural differences can also be considered, as western culture emphasizes individual action [ ] while asian cultures prioritize collective actions or community commitment. as this may generate behavioral differences among japanese and foreign residents or tourists, further investigation will be beneficial. based on information ecology framework and literature review, the structure of resilient communication ecosystem is proposed and verified through empirical data analysis. the resilient communication ecosystem is structured with heterogeneous actors who could be a driving force for collaboration or coevolution. the empirical case revealed that a media source might transmit warnings and evacuation orders through social media but that such information does not contain points of reference. limited delivery of such information particularly in the english language results in confusion to non-japanese communities who need it most. based on study results and discussions, we suggest that in any disaster a form of ecosystem is spontaneously generated. municipalities, which are often responsible agents should ( ) produce instructional information in foreign languages on social media, ( ) transfer such information through collaboration with transmitters who may have a strong base on social media to assist translating it to reach the wider audience, and ( ) examine, in association with the mass media and weather forecast agencies, the use of local hashtags in social media communication for future disaster preparation. it brings various actors together, creates a sense of locality, and supplements individual efforts in risk reduction. our empirical data is limited to some extent as we only observed twitter for one day, october , . again, we intended to extract tweets in english language just before and after the typhoon hagibis hit japan. hence, we chose hashtag #hagibis and "#tyhoon_ " when we were crawling tweets. we are also aware that our data is based on only a single disaster case and requires further data sets from other disaster events for corroboration. we may consider different characteristics of information and actors in the resilient communication ecosystem in future research. nevertheless, our findings provide interesting insights into disaster communication which may guide the direction of future research. as instructional information sometimes contains ethics and privacy issues [ ] , rules for dealing with specific local or personal information should be taken j o u r n a l p r e -p r o o f into account. behavioral differences are also worth investigating because of increasing human diversity in japanese society. the case of japan reveals the importance and potential of communication as a main mechanism for offering information and response activities to promote resilience and reduce the risk of disasters. social media space serves as a major platform that brings numerous stakeholders together. this platform creates the main avenue for information sharing on events and offers feedback mechanisms for assessment and improvement. for future disaster preparedness, then, we may benefit from a closer understanding of the nature of a resilient communication ecosystem, its structure and the knowledge that it allows us to share. resiliency in tourism transportation: case studies of japanese railway companies preparing for the tokyo olympics a study of the disaster management framework of japan effect of tsunami drill experience on evacuation behavior after the onset of the great east japan earthquake a framework for regional disaster risk reduction for foreign residents (written in japanese) japan's foreign population hitting a record high the tourism nation promotion basic plan japan's immigration policies put to the test, in nippon japan's immigration chief optimistic asylum and visa woes will improve in immigration services agency of japan, guidelines for permission for permanent residence sustaining life during the early stages of disaster relief with a frugal information system: learning from the great east japan earthquake. communications magazine, ieee who would be willing to lend their public servants to disaster-impacted local governments? an empirical investigation into public attitudes in disaster message not getting through to foreign residents hokkaido quake reveals japan is woefully unprepared to help foreign tourists in times of disaster media preference, information needs, and the language proficiency of foreigners in japan after the great east japan earthquake social media communication during disease outbreaks: findings and recommendations, in social media use in crisis and risk communication, h. harald and b. klas, editors social media use in crises and risks: an introduction to the collection, in social media use in crisis and risk communication, h. harald and b. klas, editors social media, trust, and disaster: does trust in public and nonprofit organizations explain social media use during a disaster? social media use during disasters: a review of the knowledge base and gaps. , national consortium for the study of terrorism and responses to terrorism building resilience through effective disaster management: an information ecology perspective information ecologies using technology with heart resilience and stability of ecological systems social-ecological resilience to coastal disasters understanding communication ecologies to bridge communication research and community action disaster communication ecology and community resilience perceptions following the central illinois tornadoes the centrality of communication and media in fostering community resilience:a framework for assessment and intervention building resilience: social capital in post-disaster recovery community disaster resilience and the rural resilience index community resilience as a metaphor, theory, set of capacities, and strategy for disaster readiness guidelines for conducting systematic mapping studies in software engineering: an update. information and software technology software product line testing-a systematic mapping study. information and software technology systematic literature reviews in software engineering -a systematic literature review. information and software technology microblogging during two natural hazards events: what twitter may contribute to situational awareness an analysis of the norwegian twitter-sphere during and in the aftermath of the information ecology of a university department understanding social media data for disaster management. natural hazards flows of water and information: reconstructing online communication during the european floods in austria lessons from ebola affected communities: being prepared for future health crises information and communication technology for disaster risk management in japan: how digital solutions are leveraged to increase resilience through improving early warnings and disaster information sharing government-communities collaboration in disaster management activity: investigation in the current flood disaster management policy in thailand twitter in the cross fire-the use of social media in the westgate mall terror attack in kenya municipal government communications: the case of local government communications. strategic communications management media coverage of the ebola virus disease: a content analytical study of the guardian and daily trust newspapers, in the power of the media in health communication blood and security during the norway attacks: authorities' twitter activity and silence, in social media use in crisis and risk communication, h. harald and b. klas, editors social media and disaster communication: a case study of cyclone winston challenges and obstacles in sharing and coordinating information during multi-agency disaster response: propositions from field exercises. information systems frontiers indigenous institutions and their role in disaster risk reduction and resilience: evidence from the tsunami in american samoa emergent use of social media: a new age of opportunity for disaster resilience the role of data and information exchanges in transport system disaster recovery: a new zealand case study institutional vs. non-institutional use of social media during emergency response: a case of twitter in australian bush fire social movements as information ecologies: exploring the coevolution of multiple internet technologies for activism crowdsourced mapping in crisis zones: collaboration, organisation and impact role of women as risk communicators to enhance disaster resilience of bandung, indonesia. natural hazards providing real-time assistance in disaster relief by leveraging crowdsourcing power. personal and ubiquitous computing digitally enabled disaster response: the emergence of social media as boundary objects in a flooding disaster why we twitter: understanding microblogging usage and communities social media usage patterns during natural hazards securing communication channels in severe disaster situations -lessons from a japanese earthquake. in information systems for crisis response and management using thematic analysis in psychology emergency warnings and expat confusion in typhoon hagibis, in nhk the design of a dynamic emergency response management journal of information technology theory and application social media in disaster risk reduction and crisis management social media for knowledge-sharing: a systematic literature review a community-based approach to sharing knowledge before, during, and after crisis events: a case study from thailand thor visits lexington: exploration of the knowledge-sharing gap and risk management learning in social media during multiple winter storms the role of media in crisis management: a case study of azarbayejan earthquake social media and disasters: a functional framework for social media use in disaster planning, response, and research using social and behavioural science to support covid- pandemic response crisis communication, race, and natural disasters emergency knowledge management and social media technologies: a case study of the haitian earthquake exploring the use of social media during the flood in malaysia understanding the efficiency of social media based crisis communication during hurricane sandy the role of social media for collective behavior development in response to natural disasters understanding the behavior of filipino twitter users during disaster communicating on twitter during a j o u r n a l p r e -p r o o f disaster: an analysis of tweets during typhoon haiyan in the philippines the instructional dynamic of risk and crisis communication: distinguishing instructional messages from dialogue. review of communication conceptualizing crisis communication, in handbook of risk and crisis communication social media and disaster management: case of the north and south kivu regions in the democratic republic of the congo social media and crisis management: cerc, search strategies, and twitter content this work was supported by jsps kakenhi grant number jp k . key: cord- -e jb sex authors: fourcade, marion; johns, fleur title: loops, ladders and links: the recursivity of social and machine learning date: - - journal: theory soc doi: . /s - - -x sha: doc_id: cord_uid: e jb sex machine learning algorithms reshape how people communicate, exchange, and associate; how institutions sort them and slot them into social positions; and how they experience life, down to the most ordinary and intimate aspects. in this article, we draw on examples from the field of social media to review the commonalities, interactions, and contradictions between the dispositions of people and those of machines as they learn from and make sense of each other. a fundamental intuition of actor-network theory holds that what we call "the social" is assembled from heterogeneous collections of human and non-human "actants." this may include human-made physical objects (e.g., a seat belt), mathematical formulas (e.g., financial derivatives), or elements from the natural world-such as plants, microbes, or scallops (callon ; latour latour , . in the words of bruno latour ( , p . ) sociology is nothing but the "tracing of associations." "tracing," however, is a rather capacious concept: socio-technical associations, including those involving non-human "actants," always crystallize in concrete places, structural positions, or social collectives. for instance, men are more likely to "associate" with video games than women (bulut ) . furthermore, since the connection between men and video games is known, men, women and institutions might develop strategies around, against, and through it. in other words, techno-social mediations are always both objective and subjective. they "exist … in things and in minds … outside and inside of agents" (wacquant , p. ) . this is why people think, relate, and fight over them, with them, and through them. all of this makes digital technologies a particularly rich terrain for sociologists to study. what, we may wonder, is the glue that holds things together at the automated interface of online and offline lives? what kind of subjectivities and relations manifest on and around social network sites, for instance? and how do the specific mediations these sites rely upon-be it hardware, software, human labor-concretely matter for the nature and shape of associations, including the most mundane? in this article, we are especially concerned with one particular kind of associative practice: a branch of artificial intelligence called machine learning. machine learning is ubiquitous on social media platforms and applications, where it is routinely deployed to automate, predict, and intervene in human and non-human behavior. generally speaking, machine learning refers to the practice of automating the discovery of rules and patterns from data, however dispersed and heterogeneous it may be, and drawing inferences from those patterns, without explicit programming. using examples drawn from social media, we seek to understand the kinds of social dispositions that machine learning techniques tend to elicit or reinforce; how these social dispositions, in turn, help to support according to pedro domingos's account, approaches to machine learning may be broken down into five "tribes." symbolists proceed through inverse deduction, starting with received premises or known facts and working backwards from those to identify rules that would allow those premises or facts to be inferred. the algorithm of choice for the symbolist is the decision tree. connectionists model machine learning on the brain, devising multilayered neural networks. their preferred algorithm is backpropagation, or the iterative adjustment of network parameters (initially set randomly) to try to bring that network's output closer and closer to a desired result (that is, towards satisfactory performance of an assigned task). evolutionaries canvas entire "populations" of hypotheses and devise computer programs to combine and swap these randomly, repeatedly assessing these combinations' "fitness" by comparing output to training data. their preferred kind of algorithm is the so-called genetic algorithm designed to simulate the biological process of evolution. bayesians are concerned with navigating uncertainty, which they do through probabilistic inference. bayesian models start with an estimate of the probability of certain outcomes (or a series of such estimates comprising one or more hypothetical bayesian network(s)) and then update these estimates as they encounter and process more data. analogizers focus on recognizing similarities within data and inferring other similarities on that basis. two of their go-to algorithms are the nearest-neighbor classifier and the support vector machine. the first makes predictions about how to classify unseen data by finding labeled data most similar to that unseen data (pattern matching). the second classifies unseen data into sets by plotting the coordinates of available or observed data according to their similarity to one another and inferring a decision boundary that would enable their distinction. machine learning implementations; and what kinds of social formations these interactions give rise to-all of these, indicatively rather than exhaustively. our arguments are fourfold. in the first two sections below, we argue that the accretive effects of social and machine learning are fostering an ever-more-prevalent hunger for data, and searching dispositions responsive to this hunger -"loops" in this paper's title. we then show how interactions between those so disposed and machine learning systems are producing new orders of stratification and association, or "ladders" and "links", and new stakes in the struggle in and around these orders. the penultimate section contends that such interactions, through social and mechanical infrastructures of machine learning, tend to engineer competition and psycho-social and economic dependencies conducive to evermore intensive data production, and hence to the redoubling of machine-learned stratification. finally, before concluding, we argue that machine learning implementations are inclined, in many respects, towards the degradation of sociality. consequently, new implementations are been called upon to judge and test the kind of solidaristic associations that machine learned systems have themselves produced, as a sort of second order learning process. our conclusion is a call to action: to renew, at the social and machine learning interface, fundamental questions of how to live and act together. the things that feel natural to us are not natural at all. they are the result of long processes of inculcation, exposure, and training that fall under the broad concept of "socialization" or "social learning." because the term "social learning" helps us better draw the parallel with "machine learning," we use it here to refer to the range of processes by which societies and their constituent elements (individuals, institutions, and so on) iteratively and interactively take on certain characteristics, and exhibit change-or not-over time. historically, the concept is perhaps most strongly associated with theories of how individuals, and specifically children, learn to feel, act, think, and relate to the world and to each other. theories of social learning and socialization have explained how people come to assume behaviors and attitudes in ways not well captured by a focus on internal motivation or conscious deliberation (miller and dollard ; bandura ; mauss ; elias ) . empirical studies have explored, for instance, how children learn speech and social grammar through a combination of direct experience (trying things out and experiencing rewarding or punishing consequences) and modeling (observing and imitating others, especially primary associates) (gopnik ). berger and luckmann ( ) , relying on the work of george herbert mead, discuss the learning process of socialization as one involving two stages: in the primary stage, children form a self by internalizing the attitudes of those others with whom they entertain an emotional relationship (typically their parents); in the secondary stage, persons-in-becoming learn to play appropriate roles in institutionalized subworlds, such as work or school. pierre bourdieu offers a similar approach to the formation of habitus. as a system of dispositions that "generates meaningful practices and meaning-giving perceptions," habitus takes shape through at least two types of social learning: "early, imperceptible learning" (as in the family) and "scholastic...methodical learning" (within educational and other institutions) (bourdieu , pp. , ) . organizations and collective entities also learn. for instance, scholars have used the concept of social learning to understand how states, institutions, and communities (at various scales) acquire distinguishing characteristics and assemble what appear to be convictions-in-common. ludwik fleck ( fleck ( [ ) and later thomas kuhn ( ) famously argued that science normally works through adherence to common ways of thinking about and puzzling over problems. relying explicitly on kuhn, hall ( ) makes a similar argument about elites and experts being socialized into long lasting political and policy positions. collective socialization into policy paradigms is one of the main drivers of institutional path dependency, as it makes it difficult for people to imagine alternatives. for our purposes, social learning encapsulates all those social processes-material, institutional, embodied, and symbolic-through which particular ways of knowing, acting, and relating to one another as aggregate and individuated actants are encoded and reproduced, or by which "[e]ach society [gains and sustains] its own special habits" (mauss , p. ) . "learning" in this context implies much more than the acquisition of skills and knowledge. it extends to adoption through imitation, stylistic borrowing, riffing, meme-making, sampling, acculturation, identification, modeling, prioritization, valuation, and the propagation and practice of informal pedagogies of many kinds. understood in this way, "learning" does not hinge decisively upon the embodied capacities and needs of human individuals because those capacities and needs are only ever realized relationally or through "ecological interaction," including through interaction with machines (foster ) . it is not hard to see why digital domains, online interactions, and social media networks have become a privileged site of observation for such processes (e.g., greenhow and robelia ), all the more so since socialization there often starts in childhood. this suggests that (contra dreyfus ) social and machine learning must be analyzed as co-productive of, rather than antithetical to, one another. machine learning is, similarly, a catch-all term-one encompassing a range of ways of programming computers or computing systems to undertake certain tasks (and satisfy certain performance thresholds) without explicitly directing the machines in question how to do so. instead, machine learning is aimed at having computers learn (more or less autonomously) from preexisting data, including the data output from prior attempts to undertake the tasks in question, and devise their own ways of both tackling those tasks and iteratively improving at them (alpaydin ) . implementations of machine learning now span all areas of social and economic life. machine learning "has been turning up everywhere, driven by exponentially growing mountains of [digital] data" (domingos ) . in this article, we take social media as one domain in which machine learning has been widely implemented. we do so recognizing that not all data analysis in which social media platforms engage is automated, and that those aspects that are automated do not necessarily involve machine learning. two points are important for our purposes: most "machines" must be trained, cleaned, and tested by humans in order to "learn." in implementations of machine learning on social media platforms, for instance, humans are everywhere "in the loop"-an immense, poorly paid, and crowdsourced workforce that relentlessly labels, rates, and expunges the "content" to be consumed (gillespie ; gray and suri ) . and yet, both supervised and unsupervised machines generate new patterns of interpretation, new ways of reading the social world and of intervening in it. any reference to machine learning throughout this article should be taken to encapsulate these "more-thanhuman" and "more-than-machine" qualities of machine learning. cybernetic feedback, data hunger, and meaning accretion analogies between human (or social) learning and machine-based learning are at least as old as artificial intelligence itself. the transdisciplinary search for common properties among physical systems, biological systems, and social systems, for instance, was an impetus for the macy foundation conferences on "circular causal and feedback mechanisms in biology and social systems" in the early days of cybernetics ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) . in the analytical model developed by norbert wiener, "the concept of feedback provides the basis for the theoretical elimination of the frontier between the living and the non-living" (lafontaine , p. ) . just as the knowing and feeling person is dynamically produced through communication and interactions with others, the ideal cybernetic system continuously enriches itself from the reactions it causes. in both cases, life is irrelevant: what matters, for both living and inanimate objects, is that information circulates in an everrenewed loop. put another way, information/computation are "substrate independent" (tegmark , p. ). wiener's ambitions (and even more, the exaggerated claims of his posthumanist descendants, see, e.g., kurzweil ) were immediately met with criticism. starting in the s, philosopher hubert dreyfus arose as one of the main critics of the claim that artificial intelligence would ever approach its human equivalent. likening the field to "alchemy" ( ), he argued that machines would never be able to replicate the unconscious processes necessary for the understanding of context and the acquisition of tacit skills ( , )-the fact that, to quote michael polanyi ( ) , "we know more than we can tell." in other words, machines cannot develop anything like the embodied intuition that characterizes humans. furthermore, machines are poorly equipped to deal with the fact that all human learning is cultural, that is, anchored not in individual psyches but in collective systems of meaning and in sedimented, relational histories (vygotsky ; bourdieu ; durkheim ; hasse ) . is this starting to change today when machines successfully recognize images, translate texts, answer the phone, and write news briefs? some social and computational scientists believe that we are on the verge of a real revolution, where machine learning tools will help decode tacit knowledge, make sense of cultural repertoires, and understand micro-dynamics at the individual level (foster ). our concern is not, however, with confirming or refuting predictive claims about what computation can and cannot do to advance scholars' understanding of social life. rather, we are interested in how social and computational learning already interact. not only may social and machine learning usefully be compared, but they are reinforcing and shaping one another in practice. in those jurisdictions in which a large proportion of the population is interacting, communicating, and transacting ubiquitously online, social learning and machine learning share certain tendencies and dependencies. both practices rely upon and reinforce a pervasive appetite for digital input or feedback that we characterize as "data hunger." they also share a propensity to assemble insight and make meaning accretively-a propensity that we denote here as "world or meaning accretion." throughout this article, we probe the dynamic interaction of social and machine learning by drawing examples from one genre of online social contention and connection in which the pervasive influence of machine learning is evident: namely, that which occurs across social media channels and platforms. below we explain first how data hunger is fostered by both social and computing systems and techniques, and then how world or meaning accretion manifests in social and machine learning practices. these explanations set the stage for our subsequent discussion of how these interlocking dynamics operate to constitute and distribute power. data hunger: searching as a natural attitude as suggested earlier, the human person is the product of a long, dynamic, and never settled process of socialization. it is through this process of sustained exposure that the self (or the habitus, in pierre bourdieu's vocabulary) becomes adjusted to its specific social world. as bourdieu puts it, "when habitus encounters a social world of which it is the product, it is like a 'fish in water': it does not feel the weight of the water, and it takes the world about itself for granted" (bourdieu in wacquant , p. ) . the socialized self is a constantly learning self. the richer the process-the more varied and intense the interactions-the more "information" about different parts of the social world will be internalized and the more socially versatile-and socially effective, possibly-the outcome. (this is why, for instance, parents with means often seek to offer "all-round" training to their offspring [lareau ]) . machine learning, like social learning, is data hungry. "learning" in this context entails a computing system acquiring capacity to generalize beyond the range of data with which it has been presented in the training phase. "learning" is therefore contingent upon continuous access to data-which, in the kinds of cases that preoccupy us, means continuous access to output from individuals, groups, and "bots" designed to mimic individuals and groups. at the outset, access to data in enough volume and variety must be ensured to enable a particular learnermodel combination to attain desired accuracy and confidence levels. thereafter, data of even greater volume and variety is typically (though not universally) required if machine learning is to deliver continuous improvement, or at least maintain performance, on assigned tasks. the data hunger of machine learning interacts with that of social learning in important ways. engineers, particularly in the social media sector, have structured machine learning technologies not only to take advantage of vast quantities of behavioral traces that people leave behind when they interact with digital artefacts, but also to solicit more through playful or addictive designs and cybernetic feedback loops. the machine-learning self is not only encouraged to respond more, interact more, and volunteer more, but also primed to develop a new attitude toward the acquisition of information (andrejevic , p. ). with the world's knowledge at her fingertips, she understands that she must "do her own research" about everything-be it religion, politics, vaccines, or cooking. her responsibility as a citizen is not only to learn the collective norms, but also to know how to search and learn so as to make her own opinion "for herself," or figure out where she belongs, or gain new skills. the development of searching as a "natural attitude" (schutz ) is an eminently social process of course: it often means finding the right people to follow or emulate (pariser ) , using the right keywords so that the search process yields results consistent with expectations (tripodi ) , or implicitly soliciting feedback from others in the form of likes and comments. the social media user also must extend this searching disposition to her own person: through cybernetic feedback, algorithms habituate her to search for herself in the data. this involves looking reflexively at her own past behavior so as to inform her future behavior. surrounded by digital devices, some of which she owns, she internalizes the all-seeing eye and learns to watch herself and respond to algorithmic demands (brubaker ). data hunger transmutes into self-hunger: an imperative to be digitally discernible in order to be present as a subject. this, of course, exacts a kind of self-producing discipline that may be eerily familiar to those populations that have always been under heavy institutional surveillance, such as the poor, felons, migrants, racial minorities (browne ; benjamin ) , or the citizens of authoritarian countries. it may also be increasingly familiar to users of health or car insurance, people living in a smart home, or anyone being "tracked" by their employer or school by virtue of simply using institutionally licensed it infrastructure. but the productive nature of the process is not a simple extension of what michel foucault called "disciplinary power" nor of the self-governance characteristic of "governmentality." rather than simply adjusting herself to algorithmic demands, the user internalizes the injunction to produce herself through the machine-learning-driven process itself. in that sense the machine-learnable self is altogether different from the socially learning, self-surveilling, or self-improving self. the point for her is not simply to track herself so she can conform or become a better version of herself; it is, instead, about the productive reorganization of her own experience and self-understanding. as such, it is generative of a new sense of selfhood-a sense of discovering and crafting oneself through digital means that is quite different from the "analog" means of self-cultivation through training and introspection. when one is learning from a machine, and in the process making oneself learnable by it, mundane activities undergo a subtle redefinition. hydrating regularly or taking a stroll are not only imperatives to be followed or coerced into. their actual phenomenology morphs into the practice of feeding or assembling longitudinal databases and keeping track of one's performance: "step counting" and its counterparts (schüll ; adams ) . likewise, what makes friendships real and defines their true nature is what the machine sees: usually, frequency of online interaction. for instance, snapchat has perfected the art of classifying-and ranking-relationships that way, so people are constantly presented with an ever-changing picture of their own dyadic connections, ranked from most to least important. no longer, contra foucault ( ) , is "permanent self-examination" crucial to self-crafting so much as attention to data-productive practices capable of making the self learnable and sustaining its searching process. to ensure one's learnability-and thereby one's selfhood-one must both feed and reproduce a hunger for data on and around the self. human learning is not only about constant, dynamic social exposure and world hunger, it is also about what we might call world or meaning accretion. the self is constantly both unsettled (by new experiences) and settling (as a result of past experiences). people take on well institutionalized social roles (berger and luckmann ) . they develop habits, styles, personalities-a "system of dispositions" in bourdieu's vocabulary-by which they become adjusted to their social world. this system is made accretively, through the conscious and unconscious sedimentation of social experiences and interactions that are specific to the individual, and variable in quality and form. accretion here refers to a process, like the incremental build-up of sediment on a riverbank, involving the gradual accumulation of additional layers or matter. even when change occurs rapidly and unexpectedly, the ongoing process of learning how to constitute and comport oneself and perform as a social agent requires one to grapple with and mobilize social legacies, social memory, and pre-established social norms (goffman ) . the habitus, bourdieu would say, is both structured and structuring, historical and generative. social processes of impression formation offer a good illustration of how social learning depends upon accreting data at volume, irrespective of the value of any particular datum. the popular insight that first impressions matter and tend to endure is broadly supported by research in social psychology and social cognition (uleman and kressel ) . it is clear that impressions are formed cumulatively and that early-acquired information tends to structure and inform the interpretation of information later acquired about persons and groups encountered in social life (hamilton and sherman ) . this has also been shown to be the case in online environments (marlow et al. ) . in other words, social impressions are constituted by the incremental build-up of a variegated mass of data. machine learning produces insight in a somewhat comparable way-that is, accretively. insofar as machine learning yields outputs that may be regarded as meaningful (which is often taken to mean "useful" for the task assigned), then that "meaning" is assembled through the accumulation of "experience" or from iterative exposure to available data in sufficient volume, whether in the form of a stream or in a succession of batches. machine learning, like social learning, never produces insight entirely ab initio or independently of preexisting data. to say that meaning is made accretively in machine learning is not to say that machine learning programs are inflexible or inattentive to the unpredictable; far from it. all machine learning provides for the handling of the unforeseen; indeed, capacity to extend from the known to the unknown is what qualifies machine learning as "learning." moreover, a number of techniques are available to make machine learning systems robust in the face of "unknown unknowns" (that is, rare events not manifest in training data). nonetheless, machine learning does entail giving far greater weight to experience than to the event. the more data that has been ingested by a machine learning system, the less revolutionary, reconfigurative force might be borne by any adventitious datum that it encounters. if, paraphrasing marx, one considers that people make their own history, but not in circumstances they choose for themselves, but rather in present circumstances given and inherited, then the social-machine learning interface emphasizes the preponderance of the "given and inherited" in present circumstances, far more than the potentiality for "mak[ing]" that may lie within them (marx (marx [ ). one example of the compound effect of social and automated meaning accretion in the exemplary setting to which we return throughout this articlesocial media-is the durability of negative reputation across interlocking platforms. for instance, people experience considerable difficulty in countering the effects of "revenge porn" online, reversing the harms of identity theft, or managing spoiled identities once they are digitally archived (lageson and maruna ) . as langlois and slane have observed, "[w]hen somebody is publicly shamed online, that shaming becomes a live archive, stored on servers and circulating through information networks via search, instant messaging, sharing, liking, copying, and pasting" (langlois and slane ) . in such settings, the data accretion upon which machine learning depends for the development of granular insights-and, on social media platforms, associated auctioning and targeting of advertising-compounds the cumulative, sedimentary effect of social data, making negative impressions generated by "revenge porn," or by one's online identity having been fraudulently coopted, hard to displace or renew. the truth value of later, positive data may be irrelevant if enough negative data has accumulated in the meantime. data hunger and the accretive making of meaning are two aspects of the embedded sociality of machine learning and of the "mechanical" dimensions of social learning. together, they suggest modes of social relation, conflict, and action that machine learning systems may nourish among people on whom those systems bear, knowingly or unknowingly. this has significant implications for social and economic inequality, as we explore below. what are the social consequences of machine learning's signature hunger for diverse, continuous, ever more detailed and "meaningful" data and the tendency of many automated systems to hoard historic data from which to learn? in this section, we discuss three observable consequences of data hunger and meaning accretion. we show how these establish certain non-negotiable preconditions for social inclusion; we highlight how they fuel the production of digitally-based forms of social stratification and association; and we specify some recurrent modes of relation fostered thereby. all three ordering effects entail the uneven distribution of power and resources and all three play a role in sustaining intersecting hierarchies of race, class, gender, and other modes of domination and axes of inequality. machine learning's data appetite and the "digestive" or computational abilities that attend it are often sold as tools for the increased organizational efficiency, responsiveness, and inclusiveness of societies and social institutions. with the help of machine learning, the argument goes, governments and non-governmental organizations develop an ability to render visible and classify populations that are traditionally unseen by standard data infrastructures. moreover, those who have historically been seen may be seen at a greater resolution, or in a more finelygrained, timely, and difference-attentive way. among international organizations, too, there is much hope that enhanced learning along these lines might result from the further utilization of machine learning capacities (johns ) . for instance, machine learning, deployed in fingerprint, iris, or facial recognition, or to nourish sophisticated forms of online identification, is increasingly replacing older, document-based ones (torpey )-and transforming the very concept of citizenship in the process (cheney-lippold ). whatever the pluses and minuses of "inclusiveness" in this mode, it entails a major infrastructural shift in the way that social learning takes place at the state and inter-state level, or how governments come to "know" their polities. governments around the world are exploring possibilities for gathering and analysing digital data algorithmically, to supplement-and eventually, perhaps, supersede-household surveys, telephone surveys, field site visits, and other traditional data collection methods. this devolves the process of assembling and representing a polity, and understanding its social and economic condition, down to agents outside the scope of public administration: commercial satellite operators (capturing satellite image data being used to assess a range of conditions, including agricultural yield and poverty), supermarkets (gathering scanner data, now widely used in cpi generation), and social media platforms. if official statistics (and associated data gathering infrastructures and labor forces) have been key to producing the modern polity, governmental embrace of machine learning capacities signals a change in ownership of that means of production. social media has become a key site for public and private parties-police departments, immigration agencies, schools, employers and insurers among others-to gather intelligence about the social networks of individuals, their health habits, their propensity to take risks or the danger they might represent to the public, to an organization's bottom line or to its reputation (trottier ; omand ; bousquet ; amoore ; stark ) . informational and power asymmetries characteristic of these institutions are often intensified in the process. this is notwithstanding the fact that automated systems' effects may be tempered by manual work-arounds and other modes of resistance within bureaucracies, such as the practices of frontline welfare workers intervening in automated systems in the interests of their clients, and strategies of foot-dragging and data obfuscation by legal professionals confronting predictive technologies in criminal justice (raso ; brayne and christin ) . the deployment of machine learning to the ends outlined in the foregoing paragraph furthers the centrality of data hungry social media platforms to the distribution of all sorts of economic and social opportunities and scarce public resources. at every scale, machine-learning-powered corporations are becoming indispensable mediators of relations between the governing and the governed (a transition process sharply accelerated by the covid- pandemic). this invests them with power of a specific sort: the power of "translating the images and concerns of one world into that of another, and then disciplining or maintaining that translation in order to stabilize a powerful network" and their own influential position within it (star , p. ) . the "powerful network" in question is society, but it is heterogeneous, comprising living and non-living, automated and organic elements: a composite to which we can give the name "society" only with impropriety (that is, without adherence to conventional, anthropocentric understandings of the term). for all practical purposes, much of social life already is digital. this insertion of new translators, or repositioning of old translators, within the circuits of society is an important socio-economic transformation in its own right. and the social consequences of this new "inclusion" are uneven in ways commonly conceived in terms of bias, but not well captured by that term. socially disadvantaged populations are most at risk of being surveilled in this way and profiled into new kinds of "measurable types" (cheney-lippold ). in addition, social media user samples are known to be non-representative, which might further unbalance the burden of surveillant attention. (twitter users, for instance, are skewed towards young, urban, minority individuals (murthy et al. ).) consequently, satisfaction of data hunger and practices of automated meaning accretion may come at the cost of increased social distrust, fostering strategies of posturing, evasion, and resistance among those targeted by such practices. these reactions, in turn, may undermine the capacity of state agents to tap into social data-gathering practices, further compounding existing power and information asymmetries (harkin ) . for instance, sarah brayne ( ) finds that government surveillance via social media and other means encourages marginalized communities to engage in "system avoidance," jeopardizing their access to valuable social services in the process. finally, people accustomed to being surveilled will not hesitate to instrumentalize social media to reverse monitor their relationships with surveilling institutions, for instance by taping public interactions with police officers or with social workers and sharing them online (byrne et al. ) . while this kind of resistance might further draw a wedge between vulnerable populations and those formally in charge of assisting and protecting them, it has also become a powerful aspect of grassroots mobilization in and around machine learning and techno-social approaches to institutional reform (benjamin ) . in all the foregoing settings, aspirations for greater inclusiveness, timeliness, and accuracy of data representation-upon which machine learning is predicated and which underlie its data hunger-produce newly actionable social divisions. the remainder of this article analyzes some recurrent types of social division that machine learning generates, and types of social action and experience elicited thereby. there is, of course, no society without ordering-and no computing either. social order, like computing order, comes in many shapes and varieties but generally "the gap between computation and human problem solving may be much smaller than we think" (foster , p. ) . in what follows, we cut through the complexity of this socialcomputational interface by distinguishing between two main ideal types of classification: ordinal (organized by judgments of positionality, priority, probability or value along one particular dimension) and nominal (organized by judgments of difference and similarity) (fourcade ) . social processes of ordinalization in the analog world might include exams, tests, or sports competitions: every level allows one to compete for the next level and be ranked accordingly. in the digital world, ordinal scoring might take the form of predictive analytics-which, in the case of social media, typically means the algorithmic optimization of online verification and visibility. by contrast, processes of nominalization include, in the analog world, various forms of homophily (the tendency of people to associate with others who are similar to them in various ways) and institutional sorting by category. translated for the digital world, these find an echo in clustering technologies-for instance a recommendation algorithm that works by finding the "nearest neighbors" whose taste is similar to one's own, or one that matches people based on some physical characteristic or career trajectory. the difference between ordinal systems and nominal systems maps well onto the difference between bayesian and analogical approaches to machine learning, to reference pedro domingos's ( ) useful typology. it is, however, only at the output or interface stage that these socially ubiquitous machine learning orderings become accessible to experience. what does it mean, and what does it feel like, to live in a society that is regulated through machine learning systems-or rather, where machine learning systems are interacting productively with social ordering systems of an ordinal and nominal kind? in this section, we identify some new, or newly manifest, drivers of social structure that emerge in machine learning-dominated environments. let us begin with the ordinal effects of these technologies (remembering that machine learning systems comprise human as well as non-human elements). as machine learning systems become more universal, the benefits of inclusion now depend less on access itself, and more on one's performance within each system and according to its rules. for instance, visibility on social media depends on "engagement," or how important each individual is to the activity of the platform. if one does not post frequently and consistently, comment or message others on facebook or instagram, or if others do not interact with one's posts, one's visibility to them diminishes quickly. if one is not active on the dating app tinder, one cannot expect one's profile to be shown to prospective suitors. similarly, uber drivers and riders rank one another on punctuality, friendliness, and the like, but uber (the company) ranks both drivers and riders on their behavior within the system, from canceling too many rides to failing to provide feedback. uber egypt states on its website: "the rating system is designed to give mutual feedback. if you never rate your drivers, you may see your own rating fall." even for those willing to incur the social costs of disengagement, opting out of machine learning may not be an option. failure to respond to someone's tag, or to like their photo, or otherwise maintain data productivity, and one might be dropped from their network, consciously or unconsciously, a dangerous proposition in a world where self-worth has become closely associated with measures of network centrality or social influence. as bucher has observed, "abstaining from using a digital device for one week does not result in disconnection, or less data production, but more digital data points … to an algorithm, … absence provides important pieces of information" (bucher , p. ) . engagement can also be forced on non-participants by the actions of other users-through tagging, rating, commenting, and endorsing, for instance (casemajor et al. ) . note that none of this is a scandal or a gross misuse of the technology. on the contrary, this is what any system looking for efficiency and relevance is bound to look like. but any ordering system that acts on people will generate social learning, including action directed at itself in return. engagement, to feed data hunger and enable the accretion of "meaningful" data from noise, is not neutral, socially or psychologically. the constant monitoring and management of one's social connections, interactions, and interpellations places a nontrivial burden on one's life. the first strategy of engagement is simply massive time investment, to manage the seemingly ever-growing myriad of online relationships (boyd ). to help with the process, social media platforms now bombard their users constantly with notifications, making it difficult to stay away and orienting users' behavior toward mindless and unproductive "grinding" (for instance, repetitively "liking" every post in their feed). but even this intensive "nudging" is often not enough. otherwise, how can we explain the fact that a whole industry of social media derivatives has popped up, to help people optimize their behavior vis-a-vis the algorithm, manage their following, and gain an edge so that they can climb the priority order over other, less savvy users? now users need to manage two systems (if not more): the primary one and the (often multiple) analytics apps that help improve and adjust their conduct in it. in these ways, interaction with machine learning systems tends to encourage continuous effort towards ordinal self-optimization. however, efforts of ordinal optimization, too, may soon become useless: as marilyn strathern (citing british economist charles goodhardt) put it, "when a measure becomes a target, it ceases to be a good measure" (strathern , p. ) . machine learning systems do not reward time spent on engagement without regard to the impact of that engagement across the network as a whole. now, in desperation, those with the disposable income to do so may turn to money as the next saving grace to satisfy the imperative to produce "good" data at volume and without interruption, and reap social rewards for doing so. the demand for maximizing one's data productivity and machine learning measurability is there, so the market is happy to oblige. with a monthly subscription to a social media platform, or even a social media marketing service, users can render themselves more visible. this possibility, and the payoffs of visibility, are learned socially, both through the observation and mimicry of models (influencers, for instance) or through explicit instruction (from the numerous online and offline guides to maximizing "personal brand"). one can buy oneself instagram or twitter followers. social media scheduling tools, such as tweetdeck and post planner, help one to plan ahead to try to maximize engagement with one's postings, including by strategically managing their release across time zones. a paying account on linkedin dramatically improves a user's chance of being seen by other users. the same is true of tinder. if a user cannot afford the premium subscription, the site still offers them one-off "boosts" for $ . that will send their profile near the top of their potential matches' swiping queue for min. finally, wealthier users can completely outsource the process of online profile management to someone else (perhaps recruiting a freelance social media manager through an online platform like upwork, the interface of which exhibits ordinal features like client ratings and job success scores). in all the foregoing ways, the inclusionary promise of machine learning has shifted toward more familiar sociological terrain, where money and other vectors of domination determine outcomes. in addition to economic capital, distributions of social and cultural capital, as well as traditional ascriptive characteristics, such as race or gender, play an outsized role in determining likeability and other outcomes of socially learned modes of engagement with machine learning systems. for instance, experiments with mechanical turkers have shown that being attractive increases the likelihood of appearing trustworthy on twitter, but being black creates a contrarian negative effect (groggel et al. ) . in another example, empirical studies of social media use among those bilingual in hindi and english have observed that positive modes of social media engagement tend to be expressed in english, with negative emotions and profanity more commonly voiced in hindi. one speculative explanation for this is that english is the language of "aspiration" in india or offers greater prospects for accumulating social and cultural capital on social media than hindi (rudra et al. ) . in short, wellestablished off-platform distinctions and social hierarchies shape the extent to which on-platform identities and forms of materialized labor will be defined as valuable and value-generating in the field of social media. in summary, ordinality is a necessary feature of all online socio-technical systems and it demands a relentless catering to one's digital doppelgängers' interactions with others and with algorithms. to be sure, design features tend to make systems addictive and feed this sentiment of oppression (boyd ). what really fuels both, however, is the work of social ordering and the generation of ordinal salience by the algorithm. in the social world, any type of scoring, whether implicit or explicit, produces tremendous amounts of status anxiety and often leads to productive resources (time and money) being diverted in an effort to better one's odds (espeland and sauder ; mau ) . those who are short on both presumably fare worse, not only because that makes them less desirable in the real world, but also because they cannot afford the effort and expense needed to overcome their disadvantage in the online world. the very act of ranking thus both recycles old forms of social inequality and also creates new categories of undesirables. as every teenager knows, those who have a high ratio of following to followers exhibit low social status or look "desperate." in this light, jeff bezos may be the perfect illustration of intertwining between asymmetries of real world and virtual world power: the founder and ceo of amazon and currently the richest man in the world has . million followers on twitter, but follows only one person: his ex-wife. ordinalization has implications not just for hierarchical positioning, but also for belonging-an important dimension of all social systems (simmel ) . ordinal stigma (the shame of being perceived as inferior) often translates into nominal stigma, or the shame of non-belonging. not obtaining recognition (in the form of "likes" or "followers"), in return for one's appreciation of other people, can be a painful experience, all the more since it is public. concern to lessen the sting of this kind of algorithmic cruelty is indeed why, presumably, tinder has moved from a simple elo or desirability score (which depends on who has swiped to indicate liking for the person in question, and their own scores, an ordinal measure) to a system that relies more heavily on type matching (a nominal logic), where people are connected based on taste similarity as expressed through swiping, sound, and image features (carman ) . in addition to employing machine learning to rank users, most social media platforms also use forms of clustering and type matching, which allow them to group users according to some underlying similarity (analogical machine learning in domingos's terms). this kind of computing is just as hungry for data as those we discuss above, but its social consequences are different. now the aim is trying to figure a person out or at least to amplify and reinforce a version of that person that appears in some confluence of data exhaust within the system in question. that is, in part, the aim of the algorithm (or rather, of the socio-technical system from which the algorithm emanates) behind facebook's news feed (cooper ) . typically, the more data one feeds the algorithm, the better its prediction, the more focused the offering, and the more homogeneous the network of associations forged through receipt and onward sharing of similar offerings. homogenous networks may, in turn, nourish better-and more saleablemachine learning programs. the more predictable one is, the better the chances that one will be seen-and engaged-by relevant audiences. being inconsistent or too frequently turning against type in data-generative behaviors can make it harder for a machine learning system to place and connect a person associatively. in both offline and online social worlds (not that the two can easily be disentangled), deviations from those expectations that data correlations tend to yield are often harshly punished by verbal abuse, dis-association, or both. experiences of being so punished, alongside experiences of being rewarded by a machine learning interface for having found a comfortable group (or a group within which one has strong correlations), can lead to some form of social closure, a desire to "play to type." as one heavy social media user told us, "you want to mimic the behavior [and the style] of the people who are worthy of your likes" in the hope that they will like you in return. that's why social media have been variously accused of generating "online echo chambers" and "filter bubbles," and of fueling polarization (e.g., pariser ). on the other hand, being visible to the wrong group is often a recipe for being ostracized, "woke-shamed," "called-out," or even "canceled" (yar and bromwich ) . in these and other ways, implementations of machine learning in social media complement and reinforce certain predilections widely learned socially. in many physical, familial, political, legal, cultural, and institutional environments, people learn socially to feel suspicious of those they experience as unfamiliar or fundamentally different from themselves. there is an extensive body of scholarly work investigating social rules and procedures through which people learn to recognize, deal with, and distance themselves from bodies that they read as strange and ultimately align themselves with and against pre-existing nominal social groupings and identities (ahmed ; goffman ) . this is vital to the operation of the genre of algorithm known as a recommendation algorithm, a feature of all social media platforms. on facebook, such an algorithm generates a list of "people you may know" and on twitter, a "who to follow" list. recommendation algorithms derive value from this social learning of homophily (mcpherson et al. ) . for one, it makes reactions to automated recommendations more predictable. recommendation algorithms also reinforce this social learning by minimizing social media encounters with identities likely to be read as strange or nonassimilable, which in turn improves the likelihood of their recommendations being actioned. accordingly, it has been observed that the profile pictures of accounts recommended on tiktok tend to exhibit similarities-physical and racial-to the profile image of the initial account holder to whom those recommendations are presented (heilweil ) . in that sense, part of what digital technologies do is organize the online migration of existing offline associations. but it would be an error to think that machine learning only reinforces patterns that exist otherwise in the social world. first, growing awareness that extreme type consistency may lead to online boredom, claustrophobia, and insularity (crawford ) has led platforms to experiment with and implement various kinds of exploratory features. second, people willfully sort themselves online in all sorts of non-overlapping ways: through twitter hashtags, group signups, click and purchasing behavior, social networks, and much more. the abundance of data, which is a product of the sheer compulsion that people feel to self-index and classify others (harcourt ; brubaker ) , might be repurposed to revisit common off-line classifications. categories like marriage or citizenship can now be algorithmically parsed and tested in ways that wield power over people. for instance, advertisers' appetite for information about major life events has spurred the application of predictive analytics to personal relationships. speech recognition, browsing patterns, and email and text messages can be mined for information about, for instance, the likelihood of relationships enduring or breaking up (dickson ) . similarly, the us national security agency measures people's national allegiance from how they search on the internet, redefining rights in the process (cheney-lippold ). even age-virtual rather than chronological-can be calculated according to standards of mental and physical fitness and vary widely depending on daily performance (cheney-lippold , p. ). quantitatively measured identities-algorithmic gender, ethnicity, or sexuality-do not have to correspond to discrete nominal types anymore. they can be fully ordinalized along a continuum of intensity (fourcade ) . the question now is: how much of a us citizen are you, really? how latinx? how gay? in a machine learning world, where each individual can be represented as a bundle of vectors, everyone is ultimately a unique combination, a category of one, however "precisely inaccurate" that category's digital content may be (mcfarland and mcfarland ) . changes in market research from the s to the s, aimed at tracking consumer mobility and aspiration through attention to "psychographic variables," constitute a pre-history, of sorts, for contemporary machine learning practices in commercial settings (arvidsson ; gandy ; fourcade and healy ; lauer ) . however, the volume and variety of variables now digitally discernible mean that the latter have outstripped the former exponentially. machine learning techniques have the potential to reveal unlikely associations, no matter how small, that may have been invisible, or muted, in the physically constraining geography of the offline world. repurposed for intervention, disparate data can be assembled to form new, meaningful types and social entities. paraphrasing donald mackenzie ( ), machine learning is an "engine, not a camera." christopher wylie, a former lead scientist at the defunct firm cambridge analytica-which famously matched fraudulently obtained facebook data with consumer data bought from us data brokers and weaponized them in the context of the us presidential election-recalls the experience of searching for-and discovering-incongruous social universes: "[we] spent hours exploring random and weird combinations of attributes.… one day we found ourselves wondering whether there were donors to anti-gay churches who also shopped at organic food stores. we did a search of the consumer data sets we had acquired for the pilot and i found a handful of people whose data showed that they did both. i instantly wanted to meet one of these mythical creatures." after identifying a potential target in fairfax county, he discovered a real person who wore yoga pants, drank kombucha, and held fire-andbrimstone views on religion and sexuality. "how the hell would a pollster classify this woman?" only with the benefit of machine learning-and associated predictive analytics-could wylie and his colleagues claim capacity to microtarget such anomalous, alloyed types, and monetize that capacity (wylie , pp. - ) . to summarize, optimization makes social hierarchies, including new ones, and pattern recognition makes measurable types and social groupings, including new ones. in practice, ordinality and nominality often work in concert, both in the offline and in the online worlds (fourcade ). as we have seen, old categories (e.g., race and gender) may reassert themselves through new, machine-learned hierarchies, and new, machine-learned categories may gain purchase in all sorts of offline hierarchies (micheli et al. ; madden et al. ) . this is why people strive to raise their digital profiles and to belong to those categories that are most valued there (for instance "verified" badges or recognition as a social media "influencer"). conversely, patternmatching can be a strategy of optimization, too: people will carefully manage their affiliations, for instance, so as to raise their score-aligning themselves with the visible and disassociating themselves from the underperforming. we examine these complex interconnections below and discuss the dispositions and sentiments that they foster and nourish. it should be clear by now that, paraphrasing latour ( , p. ), we can expect little from the "social explanation" of machine learning; machine learning is "its own explanation." the social does not lie "behind" it, any more than machine learning algorithms lie "behind" contemporary social life. social relations fostered by the automated instantiation of stratification and association-including in social mediaare diverse, algorithmic predictability notwithstanding. also, they are continually shifting and unfolding. just as latour ( , p. ) reminds us not to confuse technology with the objects it leaves in its wake, it is important not to presume the "social" of social media to be fixed by its automated operations. we can, nevertheless, observe certain modes of social relation and patterns of experience that tend to be engineered into the ordinal and nominal orders that machine learning (re)produces. in this section, we specify some of these modes of relation, before showing how machine learning can both reify and ramify them. our argument here is with accounts of machine learning that envisage social and political stakes and conflicts as exogenous to the practice-considerations to be addressed through ex ante ethics-by-design initiatives or ex post audits or certifications-rather than fundamental to machine learning structures and operations. machine learning is social learning, as we highlighted above. in this section, we examine further the kinds of sociality that machine learning makes-specifically those of competitive struggle and dependency-before turning to prospects for their change. social scientists' accounts of modes of sociality online are often rendered in terms of the antagonism between competition and cooperation immanent in capitalism (e.g., fuchs ). this is not without justification. after all, social media platforms are sites of social struggle, where people seek recognition: to be seen, first and foremost, but also to see-to be a voyeur of themselves and of others (harcourt ; brubaker ) . in that sense, platforms may be likened to fields in the bourdieusian sense, where people who invest in platform-specific stakes and rules of the game are best positioned to accumulate platform-specific forms of capital (e.g., likes, followers, views, retweets, etc.) (levina and arriaga ) . some of this capital may transfer to other platforms through built-in technological bridges (e.g., between facebook and instagram), or undergo a process of "conversion" when made efficacious and profitable in other fields (bourdieu ; fourcade and healy ) . for instance, as social status built online becomes a path to economic accumulation in its own right (by allowing payment in the form of advertising, sponsorships, or fans' gifts), new career aspirations are attached to social media platforms. according to a recent and well-publicized survey, "vlogger/youtuber" has replaced "astronaut" as the most enviable job for american and british children (berger ) . in a more mundane manner, college admissions offices or prospective employers increasingly expect one's presentation of self to include the careful management of one's online personality-often referred to as one's "brand" (e.g., sweetwood ) . similarly, private services will aggregate and score any potentially relevant information (and highlight "red flags") about individuals across platforms and throughout the web, for a fee. in this real-life competition, digitally produced ordinal positions (e.g., popularity, visibility, influence, social network location) and nominal associations (e.g., matches to advertised products, educational institutions, jobs) may be relevant. machine learning algorithms within social media both depend on and reinforce competitive striving within ordinal registers of the kind highlighted above-or in bourdieu's terms, competitive struggles over field-specific forms of capital. as georg simmel observed, the practice of competing socializes people to compete; it "compels the competitor" (simmel (simmel [ ). socially learned habits of competition are essential to maintain data-productive engagement with social media platforms. for instance, empirical studies suggest that motives for "friending" and following others on social media include upward and downward social comparison (ouwerkerk and johnson ; vogel et al. ) . social media platforms' interfaces then reinforce these social habits of comparison by making visible and comparable public tallies of the affirmative attention that particular profiles and posts have garnered: "[b]eing social in social media means accumulating accolades: likes, comments, and above all, friends or followers" (gehl , p. ) . in this competitive "[l]ike economy," "user interactions are instantly transformed into comparable forms of data and presented to other users in a way that generates more traffic and engagement" (gerlitz and helmond , p. )-engagement from which algorithms can continuously learn in order to enhance their own predictive capacity and its monetization through sales of advertising. at the same time, the distributed structure of social media (that is, its multinodal and cumulative composition) also fosters forms of cooperation, gift exchange, redistribution, and reciprocity. redistributive behavior on social media platforms manifests primarily in a philanthropic mode rather than in the equitypromoting mode characteristic of, for instance, progressive taxation. examples include practices like the #followfriday or #ff hashtag on twitter, a spontaneous form of redistributive behavior that emerged in whereby "micro-influencers" started actively encouraging their own followers to follow others. insofar as those so recommended are themselves able to monetize their growing follower base through product endorsement and content creation for advertisers, this redistribution of social capital serves, at least potentially, as a redistribution of economic capital. even so, to the extent that purportedly "free" gifts, in the digital economy and elsewhere, tend to be reciprocated (fourcade and kluttz ) , such generosity might amount to little more than an effective strategy of burnishing one's social media "brand," enlarging one's follower base, and thereby increasing one's store of accumulated social (and potentially economic) capital. far from being antithetical to competitive relations on social media, redistributive practices in a gift-giving mode often complement them (mauss ) . social media cooperation can also be explicitly anti-social, even violent (e.g., patton et al. ) . in these and other ways, digitized sociality is often at once competitive and cooperative, connective and divisive (zukin and papadantonakis ) . whether it is enacted in competitive, redistributive or other modes, sociality on social media is nonetheless emergent and dynamic. no wonder that bruno latour was the social theorist of choice when we started this investigation. but-as latour ( ) himself pointed out-gabriel tarde might have been a better choice. what makes social forms cohere are behaviors of imitation, counter- an exception to this observation would be social media campaigns directed at equitable goals, such as campaigns to increase the prominence and influence of previously under-represented groups-womenalsoknowstuff and pocalsoknowstuff twitter handles, hashtags, and feeds, for example. recommendation in this mode has been shown to increase recommended users' chance of being followed by a factor of roughly two or three compared to a recommendation-free scenario (garcia gavilanes et al. ). for instance, lewis ( , p. ) reports that "how-to manuals for building influence on youtube often list collaborations as one of the most effective strategies." imitation, and influence (tarde ) . social media, powered by trends and virality, mimicry and applause, parody and mockery, mindless "grinding" and tagging, looks quintessentially tardian. even so, social media does not amount simply to transfering online practices of imitation naturally occuring offline. the properties of machine learning highlighted above-cybernetic feedback; data hunger; accretive meaning-making; ordinal and nominal ordering-lend social media platforms and interfaces a distinctive, compulsive, and calculating quality-engineering a relentlessly "participatory subjectivity" (bucher , p. ; boyd ) . how one feels and how one acts when on social media is not just an effect of subjective perceptions and predispositions. it is also an effect of the software and hardware that mediate the imitative (or counter-imitative) process itself-and of the economic rationale behind their implementation. we cannot understand the structural features and phenomenological nature of digital technologies in general, and of social media in particular, if we do not understand the purposes for which they were designed. the simple answer, of course, is that data hunger and meaning accretion are essential to the generation of profit (zuboff ), whether profit accrues from a saleable power to target advertising, commercializable developments in artificial intelligence, or by other comparable means. strategies for producing continuous and usable data flows to profit-making ends vary, but tend to leverage precisely the social-machine learning interface that we highlighted above. social media interfaces tend to exhibit design features at both the back-and front-end that support user dependency and enable its monetization. for example, the "infinite scroll," which allows users to swipe down a page endlessly (without clicking or refreshing) rapidly became a staple of social media apps after its invention in , giving them an almost hypnotic feel and maximizing the "time on device" and hence users' availability to advertisers (andersson ) . similarly, youtube's recommendation algorithm was famously optimized to maximize users' time on site, so as to serve them more advertisements (levin ; roose ) . social media platforms also employ psycho-social strategies to this end, including campaigns to draw people in by drumming up reciprocity and participation-the notifications, the singling out of trends, the introduction of "challenges"-and more generally the formation of habits through gamification. prominent critics of social media, such as tristan harris (originally from google) and sandy parakilas (originally from facebook), have denounced apps that look like "slot machines" and use a wide range of intermittent rewards to keep users hooked and in the (instagram, tiktok, facebook, …) zone, addicted "by design" (schüll ; fourcade ) . importantly, this dependency has broader social ramifications than may be captured by a focus on individual unfreedom. worries about the "psychic numbing" of the liberal subject (zuboff ) , or the demise of the sovereign consumer, do not preoccupy us so much as the ongoing immiseration of the many who "toil on the invisible margins of the social factory" (morozov ) or whose data traces make them the targets of particularly punitive extractive processes. dependencies engineered into social media interfaces help, in combination with a range of other structural factors, to sustain broader economic dependencies, the burdens and benefits of which land very differently across the globe (see, e.g., taylor and broeders ) . in this light, the question of how amenable these dynamics may be to social change becomes salient for many. recent advances in digital technology are often characterized as revolutionary. however, as well as being addictive, the combined effect of machine learning and social learning may be as conducive to social inertia as it is to social change. data hunger on the part of mechanisms of both social learning and machine learning, together with their dependence on data accretion to make meaning, encourage replication of interface features and usage practices known to foster continuous, data-productive engagement. significant shifts in interface design-and in the social learning that has accreted around use of a particular interface-risk negatively impacting data-productive engagement. one study of users' reactions to changes in the facebook timeline suggested that "major interface changes induce psychological stress as well as technology-related stress" (wisniewski et al. ) . in recognition of these sensitivities, those responsible for social media platforms' interfaces tend to approach their redesign incrementally, so as to promote continuity rather than discontinuity in user behaviour. the emphasis placed on continuity in social media platform design may foster tentativeness in other respects as well, as we discuss in the next section. at the same time, social learning and machine learning, in combination, are not necessarily inimical to social change. machine learning's associative design and propensity to virality have the potential to loosen or unsettle social orders rapidly. and much as the built environment of the new urban economy can be structured to foster otherwise unlikely encounters (hanson and hillier ; zukin ) , so digital space can be structured to similar effect. for example, the popular chinese social media platform wechat has three features, enabled by machine learning, that encourage openended, opportunistic interactions between random users-shake, drift bottle, and people nearby-albeit, in the case of people nearby, random users within one's immediate geographic vicinity. (these are distinct from the more narrow, instrumental range of encounters among strangers occasioned by platforms like tinder, the sexual tenor of which are clearly established in advance, with machine learning parameters set accordingly.) qualitative investigation of wechat use and its impact on chinese social practices has suggested that wechat challenges some existing social practices, while reinforcing others. it may also foster the establishment of new social practices, some defiant of prevailing social order. for instance, people report interacting with strangers via wechat in ways they normally would not, including shifting to horizontallystructured interactions atypical of chinese social structures offline (wang et al. ). this is not necessarily unique to wechat. the kinds of ruptures and reorderings engineered through machine learning do not, however, create equal opportunities for value creation and accumulation, any more than they are inherently liberating or democratizing. social media channels have been shown to serve autocratic goals of "regime entrenchment" quite effectively (gunitsky ) . likewise, they serve economic goals of data accumulation and concentration (zuboff ) . machine-learned sociality lives on corporate servers and must be with regard to wechat in china and vkontakte in russia, as well as to government initiatives in egypt, the ukraine, and elsewhere, seva gunitsky ( ) highlights a number of reasons why, and means by which, nondemocratic regimes have proactively sought (with mixed success) to co-opt social media, rather than simply trying to suppress it, in order to try to ensure central government regimes' durability. meticulously "programmed" (bucher ) to meet specific economic objectives. as such, it is both an extremely lucrative proposition for some and (we have seen) a socially dangerous one for many. it favors certain companies, their shareholders and executives, while compounding conditions of social dependency and economic precarity for most other people. finally, with its content sanitized by underground armies of ghost workers (gray and suri ), it is artificial in both a technical and literal sense-"artificially artificial," in the words of jeff bezos (casilli and posada ). we have already suggested that machine-learned sociality, as it manifests on social media, tends to be competitive and individualizing (in its ordinal dimension) and algorithmic and emergent (in its nominal dimension). although resistance to algorithms is growing, those who are classified in ways they find detrimental (on either dimension) may be more likely to try to work on themselves or navigate algorithmic workarounds than to contest the classificatory instrument itself (ziewitz ) . furthermore, we know that people who work under distributed, algorithmically managed conditions (e.g., mechanical turk workers, uber drivers) find it difficult to communicate amongst themselves and organize (irani and silberman ; lehdonvirta ; dubal ) . these features of the growing entanglement of social and machine learning may imply dire prospects for collective action-and beyond it, for the achievement of any sort of broad-based, solidaristic project. in this section, we tentatively review possibilities for solidarity and mobilization as they present themselves in the field of social media. machine learning systems' capacity to ingest and represent immense quantities of data does increase the chances that those with common experiences will find one another, at least insofar as those experiences are shared online. machine-learned types thereby become potentially important determinants of solidarity, displacing or supplementing the traditional forces of geography, ascribed identities, and voluntary association. those dimensions of social life that social media algorithms have determined people really care about often help give rise to, or supercharge, amorphous but effective forms of offline action, if only because the broadcasting costs are close to zero. examples may include the viral amplification of videos and messages, the spontaneity of flash mobs (molnár ) , the leaderless, networked protests of the arab spring (tufekci ) , or of the french gilets jaunes (haynes ), and the #metoo movement's reliance on public disclosures on social media platforms. nonetheless, the thinness, fleeting character, and relative randomness of the affiliations summoned in those ways (based on segmented versions of the self, which may or may not overlap) might make social recognition and commonality of purpose difficult to sustain in the long run. more significant, perhaps, is the emergence of modes of collective action that are specifically designed not only to fit the online medium, but also to capitalize on its technical features. many of these strategies were first implemented to stigmatize or sow division, although there is no fatality that this is their only possible use. examples include the anti-semitic (((echo))) tagging on twitter-originally devised to facilitate trolling by online mobs (weisman ) but later repurposed by non-jews as an expression of solidarity; the in-the-wild training of a microsoft chatter bot, literally "taught" by well-organized users to tweet inflammatory comments; the artificial manipulation of conversations and trends through robotic accounts; or the effective delegation, by the trump campaign, of the management of its ad-buying activities to facebook's algorithms, optimized on the likelihood that users will take certain campaign-relevant actions-"signing up for a rally, buying a hat, giving up a phone number" (bogost and madrigal ) . the exploitation of algorithms for divisive purposes often spurs its own reactions, from organized counter-mobilizations to institutional interventions by platforms themselves. during the black lives matter protests, for instance, kpop fans flooded rightwing hashtags on instagram and twitter with fancams and memes in order to overwhelm racist messaging. even so, often the work of "civilizing" the social media public sphere is left to algorithms, supported by human decision-makers working through rules and protocols (and replacing them in especially sensitive cases). social media companies ban millions of accounts every month for inappropriate language or astroturfing (coordinated operations on social media that masquerade as a grassroot movement): algorithms have been trained to detect and exclude certain types of coalitions on the basis of a combination of social structure and content. in , the british far right movement "britain first" moved to tiktok after being expelled from facebook, twitter, instagram, and youtube-and then over to vkontakte or vk, a russian platform, after being banned from tiktok (usa news ). chastised in the offline world for stirring discord and hate, the economic engines that gave the movement a megaphone have relegated it to their margins with embarrassment. the episode goes to show that there is nothing inherently inclusive in the kind of group solidarity that machine learning enables, and thus it has to be constantly put to the (machine learning) test. in the end, platforms' ideal of collective action may resemble the tardean, imitative but atomized crowd, nimble but lacking in endurance and capacity (tufekci ) . mimetic expressions of solidarity, such as photo filters (e.g., rainbow), the "blacking out" of one's newsfeed, or the much-bemoaned superficiality of "clicktivism" may be effective at raising consciousness or the profile of an issue, but they may be insufficient to support broader-based social and political transformations. in fact, social media might actually crowd out other solidaristic institutions by also serving as a (feeble, often) palliative for their failures. for example, crowdsourced campaigns, now commonly used to finance healthcare costs, loss of employment, or educational expenses, perform a privatized solidarity that is a far cry from the universal logic of public welfare institutions. up to this point, our emphasis has been on the kinds of sociality that machine learning implementations tend to engender on the social media field, in both vertical (ordinal) and horizontal (nominal) configurations. we have, in a sense, been "reassembling the social" afresh, with an eye, especially, to its computational components and chains of reference (latour ) . throughout, we have stressed, nonetheless, that machine learning and other applications of artificial intelligence must be understood as forces internal to social life-both subject to and integral to its contingent properties-not forces external to it or determinative of it. accordingly, it is just as important to engage in efforts to reassemble "the machine"-that is, to revisit and put once more into contention the associative preconditions for machine learning taking the form that it currently does, in social media platforms for instance. and if we seek to reassemble the machine, paraphrasing latour ( , p. ) , "it's necessary, aside from the circulation and formatting of traditionally conceived [socio-technical] ties, to detect other circulating entities." so what could be some "other circulating entities" within the socio-technical complex of machine learning, or how could we envisage its elements circulating, and associating, otherwise? on some level, our analysis suggests that the world has changed very little. like every society, machine-learned society is powered by two fundamental, sometimes contradictory forces: stratification and association, vertical and horizontal difference. to be sure, preexisting social divisions and inequalities are still very much part of its operations. but the forces of ordinality and nominality have also been materialized and formatted in new ways, of which for-profit social media offer a particularly stark illustration. the machinelearnable manifestations of these forces in social media: these are among the "other circulating entities" now traceable. recursive dynamics between social and machine learning arise where social structures, economic relations and computational systems intersect. central to these dynamics in the social media field are the development of a searching disposition to match the searchability of the environment, the learnability of the self through quantified measurement, the role of scores in the processing of social positions and hierarchies, the decategorization and recategorization of associational identities, automated feedback that fosters compulsive habits and competitive social dispositions, and strategic interactions between users and platforms around the manipulation of algorithms. what, then, of prospects for reassembly of existing configurations? notwithstanding the lofty claims of the it industry, there is nothing inherently democratizing or solidaristic about the kinds of social inclusiveness that machine learning brings about. the effects of individuals and groups' social lives being rendered algorithmically learnable are ambivalent and uneven. in fact, they may be as divisive and hierarchizing as they may be connective and flattening. moreover, the conditions for entry into struggle in the social media field are set by a remarkably small number of corporate entities and "great men of tech" with global reach and influence (grewal ) . a level playing field this most definitely is not. rather, it has been carved up and crenellated by those who happen to have accumulated greatest access to the data processing and storage capacity that machine learning systems require, together with the real property, intellectual property, and personal property rights, and the network of political and regulatory lobbyists that ensure that exclusivity of access is maintained (cohen ) . power in this field is, accordingly, unlikely to be reconfigured or redistributed organically, or through generalized exhortation to commit to equity or ethics (many versions of which are self-serving on the part of major players). instead, political action aimed at building or rebuilding social solidarities across such hierarchies and among such clusters must work with and through them, in ways attentive to the specifics of their instantiation in particular techno-social settings. to open to meaningful political negotiation those allocations and configurations of power that machine learning systems help to inscribe in public and private life-this demands more than encompassing a greater proportion of people within existing practices of ruling and being ruled, and more than tinkering around the edges of existing rules. the greater the change in sociality and social relations-and machine learning is transforming both, as we have recounted-the more arrant and urgent the need for social, political and regulatory action specifically attuned to that change and to the possibility of further changes. social and political action must be organized around the inequalities and nominal embattlements axiomatic to the field of social media, and to all fields shaped in large part by machine learning. and these inequalities and embattlements must be approached not as minor deviations from a prevailing norm of equality (that is, something that can be corrected after the fact or addressed through incremental, technical fixes), but as constitutive of the field itself. this cannot, moreover, be left up to the few whose interests and investments have most shaped the field to date. it is not our aim to set out a program for this here so much as to elucidate some of the social and automated conditions under which such action may be advanced. that, we must recognize, is a task for society, in all its heterogeneity. it is up to society, in other words, to reassemble the machine. how the reification of merit breeds inequality: theory and experimental evidence step-counting in the "health-society strange encounters:embodied others in post-coloniality introduction to machine learning cloud ethics: algorithms and the attributes of ourselves and others social media apps are "deliberately" addictive to users on the 'pre-history of the panoptic sort': mobility in market research social learning through imitation race after technology. abolitionist tools for the new jim code. cambridge: polity american kids would much rather be youtubers than astronauts. ars technica the social construction of reality: a treatise in the sociology of knowledge how facebook works for trump distinction: a social critique of the judgement of taste (r the logic of practice the field of cultural production the forms of capital mining social media data for policing, the ethical way. government technology surveillance and system avoidance: criminal justice contact and institutional attachment technologies of crime prediction: the reception of algorithms in policing and criminal courts dark matters: on the surveillance of blackness digital hyperconnectivity and the self if ... then: algorithmic power and politics nothing to disconnect from? being singular plural in an age of machine learning a precarious game: the illusion of dream jobs in the video game industry social media surveillance in social work: practice realities and ethical implications some elements of a sociology of translation: domestication of the scallops and the fishermen of st brieuc bay tinder says it no longer uses a "desirability" score to rank people. the verge non-participation in digital media: toward a framework of mediated political action: media the platformization of labor and society jus algoritmi: how the national security agency remade citizenship we are data. algorithms and the making of our digital selves between truth and power: the legal constructions of informational capitalism how the facebook algorithm works in and how to work with it following you: disciplines of listening in social media can alexa and facebook predict the end of your relationship accessed the master algorithm: how the quest for the ultimate learning machine will remake our world alchemy and artificial intelligence. rand corporation artificial intelligence the drive to precarity: a political history of work, regulation, & labor advocacy in san francisco's taxi & uber economics the elementary forms of religious life the civilizing process:sociogenetic and psychogenetic investigations engines of anxiety: academic rankings, reputation, and accountability genesis and development of a scientific fact culture and computation: steps to a probably approximately correct theory of culture technologies of the self. lectures at university of vermont the fly and the cookie: alignment and unhingement in st-century capitalism seeing like a market a maussian bargain: accumulation by gift in the digital economy internet and society: social theory in the information age the panoptic sort: a political economy of personal information follow my friends this friday! an analysis of human-generated friendship recommendations the case for alternative social media the like economy: social buttons and the data-intensive web custodians of the internet: platforms, content moderation, and the hidden decisions that shape social media stigma: notes on the management of spoiled identity the interaction order the philosophical baby: what children's minds tell us about truth, love, and the meaning of life ghost work. how to stop sillicon valley from building a new underclass old communication, new literacies: social network sites as social learning resources network power. the social dynamics of globalization race and the beauty premium: mechanical turk workers' evaluations of twitter accounts. information corrupting the cyber-commons: social media as a tool of autocratic stability policy paradigms, social learning, and the state: the case of economic policymaking in britain perceiving persons and groups the architecture of community: some new proposals on the social consequences of architectural and planning decisions exposed: desire and disobedience in the digital age simmel, the police form and the limits of democratic policing posthuman learning: ai from novice to expert? gilets jaunes and the two faces of facebook our weird behavior during the pandemic is messing with ai models there's something strange about tiktok recommendations turkopticon: interrupting worker invisibility in amazon mechanical turk from planning to prototypes: new ways of seeing like a state the structure of scientific revolutions the age of spiritual machines: when computers exceed human intelligence the cybernetic matrix of`french theory digital degradation: stigma management in the internet age economies of reputation: the case of revenge porn unequal childhoods: class, race, and family life the moral dilemmas of a safety belt the pasteurization of france reassembling the social: an introduction to actor-network-theory gabriel tarde and the end of the social an inquiry into modes of existence: an anthropology of the moderns creditworthy: a history of consumer surveillance and financial identity in america algorithms that divide and unite: delocalisation, identity and collective action in 'microwork google to hire thousands of moderators after outcry over youtube abuse videos | technology. the guardian distinction and status production on user-generated content platforms: using bourdieu's theory of cultural production to understand social dynamics in online fields alternative influence: broadcasting the reactionary right on youtube an engine, not a camera: how financial models shape markets privacy, poverty, and big data: a matrix of vulnerabilities for poor americans impression formation in online peer production: activity traces and personal profiles in github the eighteenth brumaire of louis bonaparte the metric society: on the quantification of the social techniques of the body the gift. the form and reason of exchange in archaic societies big data and the danger of being precisely inaccurate birds of a feather: homophily in social networks digital footprints: an emerging dimension of digital inequality social learning and imitation (pp. xiv, ) reframing public space through digital mobilization: flash mobs and contemporary urban youth culture capitalism's new clothes. the baffler urban social media demographics: an exploration of twitter use in major american cities the palgrave handbook of security, risk and intelligence motives for online friending and following: the dark side of social network site connections the filter bubble: what the internet is hiding from you when twitter fingers turn to trigger fingers: a qualitative study of social media-related gang violence the tacit dimension displacement as regulation: new regulatory technologies and front-line decision-making in ontario works the making of a youtube radical understanding language preference for expression of opinion and sentiment: what do hindi-english speakers do on twitter? addiction by design: machine gambling in las vegas data for life: wearable technology and the design of self-care alfred schutz on phenomenology and social relations how is society possible? sociology of competition power, technology and the phenomenology of conventions: on being allergic to onions testing and being tested in pandemic times improving ratings': audit in the british university system social media tips for students to improve their college admission chances the laws of imitation in the name of development: power, profit and the datafication of the global south life . : being human in the age of artificial intelligence the invention of the passport: surveillance, citizenship, and the state analyzing scriptural inference in conservative news practices policing social media twitter and tear gas: the power and fragility of networked protest a brief history of theory and research on impression formation far-right activists tommy robinson and britain first turn to russia's vk after being banned from tiktok and every big social platform social comparison, social media, and selfesteem mind in society: the development of higher psychological processes towards a reflexive sociology: a workshop with pierre bourdieu. sociological theory space collapse: reinforcing, reconfiguring and enhancing chinese social practices through wechat. conference on web and social media (icwsm ) semitism))): being jewish in america in the age of trump understanding user adaptation strategies for the launching of facebook timeline mindf*ck. cambridge analytica and the plot to break america tales from the teenage cancel culture. the new york times rethinking gaming. the ethical work of optimization in web search engines the age of surveillance capitalism. the fight for a human future at the new frontier of power the innovation complex: cities, tech and the new economy hackathons as co-optation ritual: socializing workers and institutionalizing innovation in the publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations as well as numerous articles that explore and theorize national variations in political mores, valuation cultures, economic policy, and economic knowledge. more recently, she has written extensively on the political economy of digitality, looking specifically at the changing nature of inequality and stratification in the digital era since , she has been conducting fieldwork on the role of digital technology and digital data in development, humanitarian aid, and disaster relief-work funded, since , by the australian research council. relevant publications include "global governance through the pairing of list and algorithm" (environment and planning d: society and space ); "data, detection, and the redistribution of the sensible acknowledgments we are grateful to kieran healy, etienne ollion, john torpey, wayne wobcke, and sharon zukin for helpful comments and suggestions. we also thank the institute for advanced study for institutional support. an earlier version of this article was presented at the "social and ethical challenges of machine learning" workshop at the institute for advanced study, princeton, november .