key: cord-0632123-qtmie8jv
authors: Zannettou, Savvas
title: "I Won the Election!": An Empirical Analysis of Soft Moderation Interventions on Twitter
date: 2021-01-18
journal: nan
DOI: nan
sha: 844c6fea98157a79f178e7260643068cb25466d0
doc_id: 632123
cord_uid: qtmie8jv

Over the past few years, there is a heated debate and serious public concerns regarding online content moderation, censorship, and the basic principle of free speech on the Web. To ease some of these concerns, mainstream social media platforms like Twitter and Facebook refined their content moderation systems to support soft moderation interventions. Soft moderation interventions refer to warning labels that are attached to potentially questionable or harmful content with the goal of informing other users about the content and its nature, while the content remains accessible, hence alleviating concerns related to censorship and free speech. In this work, we perform one of the first empirical studies on soft moderation interventions on Twitter. Using a mixed-methods approach, we study the users that are sharing tweets with warning labels on Twitter and their political leaning, the engagement that these tweets receive, and how users interact with tweets that have warning labels. Among other things, we find that 72% of the tweets with warning labels are shared by Republicans, while only 11% are shared by Democrats. By analyzing content engagement, we find that tweets with warning labels tend to receive more engagement. Also, we qualitatively analyze how users interact with content that has warning labels finding that the most popular interactions are related to further debunking false claims, mocking the author or content of the disputed tweet, and further reinforcing or resharing false claims. Finally, we describe concrete examples of inconsistencies such as warning labels that are incorrectly added or warning labels that are not added on tweets despite sharing questionable and potentially harmful information.

Social media platforms like Twitter and Facebook are under pressure from the public to address issues related to the spread of harmful content like hate speech [9] and online misinformation [30] , in particular during major events like elections. To ease public's concerns and mitigate the effects of these important issues, platforms are continuously refining their guidelines and improving their content moderation systems [10] .

Designing and implementing an ideal content moderation system is not straightforward as there are many challenges and aspects to be considered [8] . First, content moderation should be performed in a timely manner to ensure that harmful content is removed fast and only a small number of users are ex- posed to the harmful content. This is a particularly hard challenge given the scale of modern social media platforms like Twitter and Facebook. Second, content moderation should be consistent and fair across their user base. Finally, content moderation should ensure that it is in accordance with basic principles of our society like the freedom of speech.

To ease concerns related to freedom of speech and censorship, recently, Facebook and Twitter introduced a new feature in their content moderation systems; a type of soft moderation intervention that attaches warning labels and relevant information to content that is questionable or potentially harmful or misleading [25, 24] . An example of a soft moderation intervention is depicted in Fig. 1 , where Twitter moderators attached a warning label to a tweet, from President Trump, related to the outcome of the 2020 US elections. These warning labels are designed to "correct" the content of the tweet and provide necessary related information, while ensuring that the freedom of speech principle is not violated.

Previous work investigated how users perceive these warning labels [13, 6, 27, 28] , assessing their effectiveness and how their design can affect their effectiveness [1, 11, 16] , and possible unintended consequences from the use of warning labels [20, 19] . Despite this rich body of research work, the majority of it investigates these warning labels in artificial environments either through interviews, surveys, or crowdsourcing studies. While these studies are useful and important, they do not consider platform-specific affordances such as user interactions with posts that have warning labels (e.g., retweets, likes, etc.) As a research community, we lack empirical evidence to understand how these warning labels are used on social media platforms like Twitter and how users interact and engage with them.

In this work, we aim to bridge this research gap by performing an empirical analysis of soft moderation interventions on Twitter. We focus on answering the following research questions:

• RQ1: What are the types of warning labels on Twitter and what kind of users have their tweets flagged more frequently? Are there differences across political leanings?

• RQ2: Is the engagement of content that includes warning labels significantly different compared to content without warning labels?

• RQ3: How do users on Twitter interact with content that includes warning labels?

To answer these research questions, we collect a dataset of tweets, shared between March 2020 and December 2020, which include soft moderation interventions (i.e., warning labels). To do this, we use Twitter's API and we collect the timeline of popular verified users. We mainly focus on verified users as they usually have a large audience and their content can receive considerable engagement. Overall, we collect a set of 18K tweets that had warning labels on them shared from 8.1K users between March 2020 and December 2020. Then, we follow a mixed-methods approach to analyze the engagement of tweets with warning labels and the users that share them (quantitative analysis), as well as how users interact with tweets and warning labels using qualitative analysis. Findings. Our main findings are:

• We find that 72.8% of the tweets that include warning labels were shared by Republicans, while only 11.6% of the tweets were shared by Democrats. This likely indicates that Republicans tend to disseminate more questionable or potentially harmful information that is eventually flagged by Twitter. Another possible explanation is that due to the result of the 2020 US elections and claims about election fraud, Twitter's moderation team devotes more resources to moderating politics-related content coming from Republican users (RQ1).

• By analyzing the engagement of tweets, we find that tweets that have warning labels receive more engagement compared to tweets without warning labels. Also, by looking into the users that have increased engagement in tweets with warning labels, we find that most of the users that have high engagement in general, have increased engagement on tweets with warning labels as well (RQ2).

• Our qualitative analysis indicates that a lot of users interact with content that has warning labels by further debunking false claims, mocking or sharing emotions about the author/content of the questionable tweet, or by reinforcing the false claims that are included in tweets with warning labels. Also, we shed light into some of the challenges and issues that exist when designing and developing large-scale soft moderation intervention systems. We find instances where the warning labels were incorrectly added (e.g., see Fig. 7 ) and cases where the moderation system is inconsistent (i.e., content should be flagged but it is not). Some of these cases are likely due to the dissemination of similar information across different languages (e.g., see Fig. 8 ) and across various formats of information like text and videos (RQ3).

Contributions. The contributions of this work are three-fold. First, to the best of our knowledge, we perform one of the first characterizations of soft moderation interventions based on empirical data from Twitter. Also, we plan to make our dataset publicly available (upon request), hence assisting the research community in conducting further studies on soft moderation interventions based on empirical data. Second, our quantitative analysis quantifies the effectiveness of soft moderation interventions on Twitter through the lens of the engagement they receive (e.g., likes, retweets, etc.). This analysis encapsulates engagement from real users interacting with timely content on Twitter, hence it complements and strengths the findings from studies undertaken in controlled experiments (e.g., via surveys). Finally, our qualitative analysis sheds light into how users interact with content that includes warning labels and it helps us understand some of the real-world challenges that exist when designing soft moderation intervention systems.

Moderation interventions on social media platforms can be applied on various levels. First, there are interventions that are applied on the post level (e.g., post removal). Second, there are interventions that exist on the user level [17, 14] like user bans or shadow banning (i.e., limiting the visibility of their activity). Finally, community-wide moderation interventions exist where platforms moderate specific sub-communities within their platforms (e.g., banning Facebook groups or subreddits) [2, 3, 18, 22, 26] . For each of the above mentioned levels, there are two different types of interventions: hard and soft interventions. Hard moderation interventions refer to moderation actions that remove content or entities from social media platforms (posts, users, or communities). On the other hand, soft moderation interventions do not remove any content and they aim to inform other users about potential issues with the content (e.g., by adding warning labels) or limiting the visibility of questionable content (shadow banning). Below, we review relevant previous work that studies post-level soft moderation interventions as they are the most relevant to our work.

A rich body of previous work investigates soft moderation interventions mainly through interviews, surveys, and crowdsourcing studies. Specifically, Mena [13] performs an experiment using Amazon Mechanical Turk (AMT) workers to understand user perceptions on content that includes warning labels. By recruiting Facebook users and performing crowdsourcing studies, they find that the warning label had a significant effect in users' sharing intentions; that is, participants were less willing to share content with warning labels. Geeng et al. [6] focus on warning labels that are added on Twitter, Facebook, and Instagram, related to COVID-19 misinformation. Through surveys they find that users have a positive attitude towards warning labels, however, they highlight that users verify misinformation through other means as well like searching on the Web for relevant information. Saltz et al. [27] focus on warning labels added on visual misinformation related to COVID-19. By conducting in-depth interviews, they find that participants had different opinions regarding warning labels, with many participants perceiving them as politically-biased and an act of censorship from the platforms.

Kaiser et al. [11] use methods from information security work to evaluate the effectiveness and the design of warning labels. Through controlled experiments, they find that despite the existence of warning labels, users seek information via other means, thus confirming the findings from [6] . Also, by performing crowdsourcing studies and asking users about 8 warning labels designs, they conclude that users' information seeking behavior is significantly affected by the design of the warning label. Seo et al. [28] investigate user perceptions when they are exposed to fact checking and machine learning generated warning labels. Through experiments on AMT, they find that users trust more fact checking warning labels compared to machine learning generated ones. Moravec et al. [16] highlight that the design of warning labels (i.e., how warnings are presented to users) can significantly change their effectiveness. Also, they emphasize that clearly explaining the warning labels to users can lead to increased effectiveness. Bode et al. [1] study the related stories functionality on Facebook as a means to detect or debunk misinformation. By conducting surveys, they find that when related stories debunk a misinformation story, it significantly reduces the participants' misperceptions (beliefs that are not supported by evidence or expert opinion [19] ).

Other previous work demonstrates some unintentional consequences from the use of warning labels. Specifically, Pennycook et al. [20] conduct Amazon Mechanical Turk studies and they show an implied truth effect, where posts that include misinformation and are not accompanied with a warning label are considered credible. Also, Nyhan and Reifler [19] conduct controlled experiments to assess the effectiveness of warning labels to political false information. They highlight that there is a backfire effect, where participants strengthen their support to false political stories after seeing the warning label that includes a correction. Pennycook et al. [21] emphasize the existence of the illusory truth effect where users tend to believe false information after getting exposed to it multiple times or for an extended time period, despite the fact that the false information is accompanied with a warning label.

Remarks. Previous work investigated soft moderation interventions in artificial testing environments like interviews, surveys, and crowdsourcing studies. This previous work is particularly important as it helps us understand how people intend to interact and engage with content that includes warning labels or corrections. However, they do not capture platform-specific peculiarities and they do not adequately capture how people interact and engage with warning labels in realistic scenarios #Tweets #Users Tweets with warning labels in them (e.g., Fig. 1) 2,244 853

Quoted -Warning on quoted tweet (e.g., Fig. 9 ) 16,571 7,651 Quoted -Warning on comment above (e.g., Fig. 8 (e.g., when reading a tweet from the US President). In this work, we address these limitations by performing, to the best of our knowledge, one of the first empirical analyses of soft moderation interventions on Twitter.

We start our data collection on Twitter and in particular on verified users, which are users who have an "especially large audience or are notable in government, news, entertainment, or another designated category." 1 We mainly focus on verified users as they usually have a large audience and can have substantial impact on online discussions, hence moderating content from these users is important.

We collect the dataset of Twitter verified users from Pushshift. 2 The dataset includes Twitter account metadata for 351,655 verified users. Then, for each user, we use Twitter's API to obtain recent tweets/retweets shared by these users (i.e., their timeline). We also collect soft moderation specific metadata for each tweet: these include whether a tweet is accompanied with a warning label and relevant metadata (e.g., label text, landing URL, etc.). Note, that due to rate-limiting of the Twitter API, we tried to collect activity only from the top 170,506 users based on the number of their followers (corresponding to 48.4% of all the Twitter verified accounts in the Pushshift dataset). We managed to collect data for 168,126 users, as the rest of them were either deleted, suspended, or accounts set to private. Our dataset collection process was conducted between December 7, 2020 and December 31, 2020. Overall, we collect 79,361,081 tweets shared during 2020 from 168,126 users.

Next, we filter all tweets that had soft moderation interventions (i.e., warning labels) from our dataset; we find 29,232 tweets from 9,334 verified users. This dataset also includes retweets of tweets with warning labels as well as tweets that quote a tweet with a warning label. Due to this, we rehydrate, using the Twitter API, all quoted and retweeted tweets that had a warning label; we get an additional 3,106 tweets from 1,888 users. Note that this procedure resulted in the acquisition of tweets from unverified users. This is because verified users in our dataset retweeted or quoted tweets from unverified users. Given that this content appears on the followers of verified users, we keep in our dataset tweets from unverified users as well.

After excluding all retweets, our final dataset includes 18,765 tweets that include warning labels (either on the tweet itself or on referenced tweets like quoted tweets) from 8,143 users (see Table 1 ). We split our dataset into two parts; 1) tweets that have warning labels attached to them (first row in Table 1 ); and 2) tweets that quote other tweets and any (or both) of the tweets have warning labels (see second-fourth row in Table 1 ). For the remainder of this paper, we call the first part of our dataset tweets with warning labels and the second part of our dataset quoted tweets. Ethical considerations and data availability. We emphasize that we collect and work entirely with publicly available data as we do not collect any data from users who have a private account. Overall, we follow standard ethical research standards [23] like refraining from tracking users across sites and compromising user privacy. Also, to help advancing empirical research related to soft moderation interventions on Twitter, we will make publicly available (upon request) the tweet IDs and their corresponding warning labels.

In this section, we analyze the different types of warning labels and how they are shared over time. Also, we perform a userbased analysis on users who shared tweets with warning labels or quoted tweets, aiming to uncover differences across users that have opposing political leaning.

We start by looking into the different types of warning labels that exist in our dataset. To do this, we focus on tweets that include warning labels (see first row in Table 1) , specifically, 2,244 tweets posted by 853 users between March 7, 2020 and December 30, 2020. Table 2 shows all warning labels in our dataset along with their respective frequency and percentage over all the tweets. Overall, we find 13 different warning labels with the majority of them being related to the 2020 US elections. For instance, the most popular warning label in our dataset is "This claim about election fraud is disputed" with 58% of all tweets. Other 2020 US election warning labels are related to the security of the elections like "Learn about US 2020 election security efforts" (12%) and "Learn how voting by mail is safe and secure" (5.8%), as well as related to the outcome of the elections like "Multiple sources called this election differently" (4.2%) and "Election officials have certified Joe Biden as the winner of the U.S. Presidential election" (2.8%). Interestingly, we find also warning labels referring to the 2020 US elections written in other languages (i.e., Portuguese). We find 0.49% tweets including "Esta reivindicação de fraude é contestada" (translates to "This fraud claim is disputed") and "Saiba por que urnas eletrônicas são seguras" (translates to "Find out why electronic voting machines are safe"). Apart from politics-related warning labels, we find a general-purpose warning label that aims to inform users about manipulated media (e.g., images or videos) with 8.7% of all tweets in our dataset. Finally, we find a COVID-19 specific warning label: "Get the facts about COVID-19" (1.15%) that aims to inform users about healthrelated issues and in particular the COVID-19 pandemic.

Next, we analyze how these warning labels are shared over time. Note that 92.8% of the tweets are shared between November 1, 2020 and December 30, 2020. Fig. 2 shows how the top 10 most popular warning labels in our dataset are shared over time (we focus on the period between November 1, 2020 and December 30, 2020 for readability purposes). We plot the frequency of warning labels over time and find two different temporal patterns. First, we find warning labels that are short-lived as the majority of their appearances on tweets happen within a short period of time. Concretely, both "Learn about US 2020 election security efforts" and "Official sources may not have called the race when this was Tweeted" are exclusively used during the first week of November 2020. On the other hand, we find warning labels that are long-lived. E.g., the label "This claim about election fraud is disputed" is used for the entirety of the period between November and December 2020. Overall, these results indicate that warning labels are time and context dependent, with some of them being shortlived (few days) and some of them being long-lived (several months).

Here, we look into the users who share tweets with warning labels. Recall that our data collection involves 168K users and only 853 of them share tweets that have warning labels, hence indicating that only a small percentage (0.5%) of Twitter users have warning labels attached to their content. As per Fig. 3 , out of the 853 users, 70% of them had only one tweet with a warning label, while only 3.6% of these users had at least 10 tweets with warning labels. Overall, only a small percentage of users have warning labels on multiple of their tweets.

Users' political leaning. As we described above, our dataset has a strong political nature and the majority of the warning labels refer to claims about the 2020 US elections (e.g., claims about election fraud, see Table 2 ). Motivated by this, we augment our dataset with information about the political leaning of each user that shared tweets with warning labels. To infer users' political leaning, we use the methodology presented by [12] and in particular the Political Bias Inference API that is publicly available by [15] . The API generates a vector with the topical interests of each user and their frequency. To do this, the API collects all the friends of the user (i.e., people that the user follows), generates all the topics inferred for each friend using the methodology in [7, 29] , and calculates a vector with all the topics and their frequencies. Finally, by comparing the topical vectors to a ground truth dataset of Republican and Democrat Twitter users, the API infers whether a Twitter user has a Republican, Democrat, or Neutral political leaning.

In this work, we use the Political Bias Inference API, between January 3 and January 10, 2021, to infer the political leaning of the 8,142 Twitter users in our dataset. Table 3 reports the number of tweets and users per inferred political leaning for the entire dataset, and broken down into the tweets that had warning labels and quoted tweets. We observe that for the entire dataset, 51% of the users are Democrats, 13.4% are Republicans, almost 32% are inferred as neutral, while for the rest 1.4% we were unable to infer their political leaning. This is because some users were either suspended or made their accounts private by the time we were collecting their friend list, hence the Political Bias Inference API was unable to make an inference.

Interestingly, when looking at the tweets with warning labels in Table 3 , we find that the majority of the tweets with Table 4 : Top 20 users who had the most warning labels on their tweets. (U) refers to unverified users who exist in our dataset because verified users retweeted or quoted tweets of them that had warning labels. We also report the account status of each user as of January 9, 2021.

warning labels are shared by Republicans (72% of all tweets vs 11% for Democrats). This likely indicates that due to the context and developments related to the 2020 US elections, Republicans tend to share more questionable content that is more likely to receive warning labels by Twitter. Another possible explanation is that Twitter devotes more resources to moderating content coming from Republican users. For the quoted tweets, we observe that Democrats tend to comment on tweets with warning labels more often than Republicans (56% vs 16.5% for Republicans).

Top users. But who are the users who are the most "prolific" with regards to tweets that include warning labels or in the quoted tweets? Table 4 and Table 5 show the top 20 users in our dataset based on the number of tweets that had warning labels and the quoted tweets, respectively. For each user, we report the inferred political leaning and whether the account was active or suspended on January 9, 2021. We make several observations. First, in both cases, the most prolific user is President Trump with 14.3% of all tweets that had warning labels and 0.4% of all quoted tweets. The account of President Trump was permanently suspended by Twitter on January 8, 2021, due to the risk of further incitement of violence [31] , after his supporters attacked the US capitol causing the death of five people [5] . Second, we observe that the majority of the top 20 users who shared tweets with warning labels are inferred as Republicans (see Table 4 ). This is not the case for the quoted dataset (see Table 5 Table 5 : Top 20 users who quoted tweets that had warning labels. We also report the account status of each user as of January 9, 2021. existence of three unverified accounts among the top 20 users who shared tweets with warning labels. 3 This indicates that Twitter's moderation mechanism is not only limited to verified users. Finally, we note that 6 out of the top 20 users with tweets that had warning labels were suspended by Twitter (as of January 9, 2021). This highlights that the continuous dissemination of questionable content that leads to the addition of warning labels is likely to result in hard moderation interventions (i.e., user suspensions). Take-aways. The main take-away points from our analysis on warning labels and Twitter users are: 1. Most of the warning labels on Twitter, between November 2020 and December 2020, were related to the 2020 US elections. Also, we find different temporal patterns in the use of warning labels, with a few of them being short-lived (less than a week) and some of them being long-lived (across several months). 2. We find warning labels used to inform users about manipulated multimedia, while some warning labels are in languages other than English (i.e., Portuguese). This highlights the efforts put in soft moderation interventions and some of the challenges that exist (e.g., tracking claims across multiple information formats or languages). 3. The majority of tweets with warning labels (72%) are shared by Republicans, while Democrats are more likely to comment on tweets with warning labels using Twitter's quoting functionality (56% of the tweets compared to 16% for Republicans). These results likely indicate that Republicans are sharing more questionable content that is eventually flagged or that Twitter devotes more resources to moderating content shared by Republicans, likely due to claims about the safety and result of the 2020 US election. 4. The continuous dissemination of potentially harmful information that is annotated with warning labels can lead to hard moderation interventions like permanent user suspensions. We find that 6 out of the 20 top users, in terms of sharing tweets with warning labels, were permanently suspended by Twitter as of January 9, 2021.

The goal of warning labels is to provide adequate information on tweets that include questionable content and might be harmful for users or society. Thus, we expect that users who see content that is annotated with warning labels is likely to cause them be less willing to engage with or reshare such content [13] . In this section, we aim to quantify the differences on the engagement between tweets that include warning labels and tweets that do not. Our empirical analysis can quantify how effective are the warning labels on Twitter, through the lens of engagement. For each user in our dataset, we extract two sets of tweets:

1) tweets that have warning labels; 2) a control dataset of tweets that do not have warning labels. Note, that we limit our analysis to the 115 users that had at least three tweets with warning labels to make sure that our user analysis is not influenced by one or two tweets. Then for each engagement signal in our dataset, we calculate the mean number that each group of tweets (warning label tweets and control) had for each user. Our analysis takes into account four engagement signals: 1) Likes (how many times the tweet was liked by other users); 2) Retweets (how many times the tweet was retweeted by other users); 3) Quotes (number of other tweets that retweeted the tweet with a comment); and 4) Replies (number of replies that the tweet received). Fig. 4 shows the CDF of the average number of likes/retweets/quotes/replies of tweets with and without warning labels per user. For each engagement signal, we perform two-sample Kolmogorov-Smirnov statistical significance tests, finding that in all cases the engagement of tweets with warning labels is significantly different compared to tweets without warning labels (p < 0.01) We observe that, for all four engagement signals, users receive increased engagement on tweets that have warning labels.

For likes (see Fig. 4(a) ), we find a median value of 10,303.9 average likes per user for tweets with warning labels, whereas for the control dataset we find a median value of 3,834.3 (2.6x less than warning labels). For retweets (see Fig. 4(b) ), we find a median value of 3,533 average retweets per user for tweets with warning labels, while for the control dataset the median value is only 1,129.2 (3.1x decrease compared to the warning labels). For replies (see Fig. 4(c) ), we find a median value of 235.7 replies for the control dataset, while for warning labels the median value increases to 494 (2.1x increase over the control dataset). For quotes (see Fig. 4(d) ), we find a median value of 350.6 average quotes per user for the warning labels datasets, whereas for the control dataset we find a median value of 122.9 quotes (2.8x decrease compared to warning labels). Also, from Fig. 4 , we can observe that there is a small proportion of users who have less engagement on the warning labels dataset. To quantify the proportion of users who have more engagement on control tweets over the tweets that had warning labels, we plot the fraction of the mean number of each engagement metric on tweets with warning labels over the control dataset (see Fig. 5 ). When this fraction is below 1, it means that the user's control dataset had more engagement compared to the user's warning labels dataset. We find that 26%, 23%, 21%, 35% of the users had more engagement on their control tweets over the ones with warning labels for likes, retweets, quotes, and replies, respectively.

From our analysis thus far, it is unclear which set of users have increased vs decreased engagement on tweets with warning labels over the control dataset. To assess whether there is a correlation between the overall engagement that user receives and whether a user will receive increased or decreased engagement on tweets with warning labels, we plot the overall engagement (i.e., mean engagement metric for all the user's tweets) and the fraction of engagement on warning labels over the control dataset (see Fig. 6 ). We observe that for all engagement metrics, most of the users that have on average high engagement on their content (i.e., over 1K likes, over 100 retweets, over 100 quotes and over 100 replies) also receive an increased engagement on tweets with warning labels over the control (note that the fraction for these users are in most of the times between 1 and 10).

Take-aways. The key take-away points from our engagement analysis are:

1. Tweets with warning labels tend to receive more engagement compared to tweets without warning labels. 2. We find that 65%-79% (depending on engagement metric) of the users receive increased engagement on their tweets that have warning labels compared to tweets without warning labels. 3. By looking at the users that have increased vs decreased engagement on tweets with warning labels compared to the control dataset, we find that most users that in general have high engagement have also increased engagement on tweets with warning labels.

In this section, we study how users interact with tweets that have warning labels. To do this, we use Twitter's quote functionality, where users can retweet a tweet with a comment. Specifically, we qualitatively analyze three sets of tweets; 1) the 50 tweets that quote other tweets and Twitter includes warning labels on both tweets; 2) 122 tweets (out of the 169) that quote other tweets and Twitter includes a warning label only on the top tweet (i.e., user's comment). The 47 other tweets had a quoted tweet that was deleted when we tried to qualitatively assess them; 3) 150 randomly selected tweets that quote another tweet that includes a warning label. We qualitatively analyze all three set of tweets to understand how users interact with people that share content that is annotated with warning labels, how users interact with questionable content (e.g., false claims), and how users discuss or perceive the existence of warning labels on Twitter.

6.1 Quoted tweets where both tweets include warning labels.

Intuitively, when both the quoted tweet and the comment tweet above include warning labels (e.g., Fig. 7 ), one expects that both tweets include information that is questionable or potentially harmful. Here, we qualitatively analyze the tweets in our dataset to verify if this is true and what are other cases where both the quoted and the comment tweet include warning labels.

Reinforcing false claims. The majority of the comments above the quoted tweets aim to retweet and reinforce the false claim that is included in the quoted tweet (86%, 43 out of the 50). Two of them achieve this using a single word ("this" or "true"), two of them use videos, five of them achieve it by tweeting a single hashtag (#stopthesteal and #ExposeDominion that both refer to election fraud claims), while the rest of the comment tweets use text to reinforce the claim. The fact that some of the comments comprise only of a single word shows that adding warning labels to tweets requires considering the context and other quoted tweets and not focusing on the tweet in isolation. Also, in 2 out of the 43 comments that reinforce the claim of the quoted tweet, the users share their anti-censorship opinions or disputing the fact that the con- tent should be labeled (i.e., "Say NO to Big Tech censorship!" and "Twitter labeled this tweet as disputed.... What exactly is Twitter disputing here?"). These results further compound the findings from [27] . Testing warning labels. We find one tweet where the user commented with exactly the same content as the quoted tweet, likely to verify if his comment will eventually get a warning label. Incorrect warning labels. We find one specific case where the warning labels were seemingly incorrectly put (see Fig. 7 ). Both the comment and the quoted tweet had the warning label "Get the facts about COVID-19" and both were including the terms oxygen and frequency/frequently. This likely indicates that Twitter employs automated means to attach warning labels and in some cases warning labels are incorrectly added to some content.

6.2 Warning labels on the comment above the quoted tweet

Next, we investigate cases where users quote a tweet that has no warning label and subsequently their comment tweet receives a warning label (e.g., Fig. 8 ).

Commenting on news or real-world events and make false claims about the 2020 US elections. We find 45 tweets (36%) that comment on real-world events, news, or facts about the election, and make false claims about the election (e.g., claims about election fraud).

Reinforcing questionable content. In 18 tweets (14%) the comment above reinforces questionable content that is included in the quoted tweet and makes the claim even more questionable or harmful, hence getting flagged by Twitter.

Inconsistencies on warning labels. We find several cases where there are inconsistencies with the inclusion of warning labels. Specifically, we find 28 cases (23%) where both the quoted tweet and the comment hint to election fraud during the 2020 US elections, yet only the quoted tweet includes a warning label. In 7 of these cases, the comment makes a similar claim with the quoted tweet with the difference that it uses a video instead of text. This highlights the challenges in flagging content on social media platforms and in particular flagging the same information across multiple diverse format (i.e., text, images, videos). Also, we find another case with inconsistencies related to the use of language. In this case, the quoted tweet and the comment above share the same informa- tion but on different languages (quoted tweet in French and comment above in English), yet only the English comment includes a warning label (see Fig. 8 ).

Updates on warning labels. During our qualitative analysis, we observed that Twitter occasionally updates the warning labels on some tweets. In particular, we find many instances where Twitter changed the warning label from "Multiple sources called this election differently" to "Election officials have certified Joe Biden as the winner of the U.S. Presidential election". This highlights that Twitter continuously refines the use of warning labels and that is likely that they update warning labels on content to make the warning label more clear or stronger. Figure 9 : Example of a tweet that is mocking the author or the content of the quoted tweet.

Here, we aim to understand how users interact with content that includes warning labels by looking into tweets that quote content that has warning labels (e.g., Fig. 9 ). We find various behaviors ranging from mocking the author/content of the quoted tweet, debunking false claims that exist on the quoted tweet, reinforcing the false claims, and sharing opinions on Twitter's warning labels. We provide more details below.

Mocking or sharing emotions about the author/content of the questionable or false claim. We find 37 tweets that mock the content or the author of the tweet that includes a warning label. For instance, when Trump tweeted the tweet in Fig. 1 , several users quoted that tweet and made absurd claims about themselves like "I WON THE NOBEL PRIZE !" (see Fig. 9 ) and "Let me try... I AM BEYONCE!!". Other users quoted tweets with warning labels to express their emotions on the content or the author of the tweet: 4 tweets calling the quoted tweet author a liar, 4 tweets calling the author a loser, 6 tweets expressing their disgrace for the content of the tweet, and 1 tweet expressing embarrassment. Debunking false claims. We find 19 tweets that debunk false claims that are in quoted tweets. For instance, a user quoted a tweet shared by President Trump and wrote: "President Trump just tweeted again about claims of "secretly dumped ballots" for Biden in Michigan. This is false. These claims are based on screenshots of a mistaken unofficial tally on one site's election map that was caused by a typo that was corrected in about 30 minutes." Reinforcing false claims. Similarly to the tweets where both the quoted and the comment above had warning labels, we find 6 tweets that were reinforcing false claims that exist on the quoted tweets. Sharing opinion on warning labels. We find 6 tweets that share users' opinions on warning labels and how effective they are. Specifically, one tweet just indicates that the quoted tweet includes a warning label, two tweets question how effective the warning labels are and they request for stronger and more straightforward labels. Also, we find three tweets that call for hard moderation interventions (i.e., user bans), in particular asking Jack Dorsey (Twitter's CEO) or Twitter Support to ban the account of President Trump due to the spread of false claims (e.g., ".@jack @Twitter make this lying stop! Your warnings of him lying just are not enough. #BanTrump"). Interestingly, we find one tweet where the comment reinforces the false claim included in the quoted comment by claiming that Twitter tries to cover up the election fraud by using warning labels.

Other. The rest of the tweets we qualitatively analyzed are tweets were users shared their personal or political opinion on the content of the quoted tweet or cases where users retweeted the content of the quoted tweet either by paraphrasing or translating the content to other languages.

The main take-away points from our qualitative analysis are:

1. We find various user interactions with tweets that have warning labels such as debunking false claims, mocking users that tweeted questionable content, or reinforcing false claims despite the inclusion of warning labels. 2. Soft moderation intervention systems are not always consistent, as we find several cases where content should have warning labels but it does not. E.g., we find cases where videos share the same information with textual tweets that include warning labels, however the tweet with the video does not include a warning. Another example is with content across various languages. These cases show the challenges that exist on large-scale soft moderation systems. 3. We find a case where warning labels were incorrectly added likely due to the use of automated means. This shows the need to devise systems that rely on human moderators that get signals from automated means (i.e., the human makes the final decision), hence decreasing the likelihood of such cases.

In this work, we performed one of the first characterizations, based on empirical data, of soft moderation interventions on Twitter. Using a mixed-methods approach, we analyzed the warning labels, the users that share tweets that have warning labels, and the engagement that this content receives. Also, we investigated how users interact with such content and what are the challenges and some inconsistencies that exist on largescale soft moderation systems.

Our user analysis showed that 72% of the tweets with warning labels were shared by Republicans. This likely indicate that Republicans were sharing more questionable content during the 2020 US elections or that Twitter devoted more resources to moderating content from Republicans. Nevertheless, this finding prompts the need for greater transparency by social media platforms to ease concerns related to censorship and possible moderation biases towards a specific political party [4] .

Our engagement analysis showed that tweets with warning labels tend to receive more engagement compared to tweets without warning labels. This indicates that warning labels might not be very effective on politics-related content, hence reinforcing the results from [19] . This highlights the need to design stricter soft moderation interventions for content that is more harmful than other, with the goal to reduce the spread of it.

Finally, our qualitative analysis showed that users further debunk false claims using Twitter's quoting mechanism, they mock the user/content of the tweet with warning label, and they reinforce false claims (despite the existence of warning labels). Also, we found some inconsistencies in content that should be flagged across multiple information formats or languages. This highlights the need to further study such moderation systems to fully understand how they work and what are their caveats, with the goal to increase their effectiveness, consistency, fairness, and transparency.

Limitations. Our work has some limitations. First, we analyzed mainly politics-related content, shared during a short period of time (two months), on a single platform (Twitter). Thus, it is unclear whether our results hold in contexts not related to politics or to soft moderation systems that exist on other platforms like Facebook (as it has different platform affordances and design of soft moderation interventions). Also, our engagement analysis does not account for the content of tweets, hence we do not investigate whether the increased engagement on tweets with warning labels is due to dissemination of more controversial or sensationalistic content that is likely to attract more users. Finally, since we do not know exactly when a soft moderation intervention happened and how the engagement changes over time, we do not analyze whether the warning labels were added because the tweets received large engagement in advance.

In related news, that was wrong: The correction of misinformation through related stories functionality in social media

Quarantined! Examining the Effects of a Community-Wide Moderation Intervention on Reddit

You can't stay here: The efficacy of reddit's 2015 ban examined through hate speech

Social media: Is it really biased against US

Capitol attack: the five people who died

Social Media COVID-19 Misinformation Interventions Viewed Positively

Cognos: crowdsourcing search for topic experts in microblogs

Custodians of the Internet: Platforms, content moderation, and the hidden decisions that shape social media

Massive rise" in hate speech on Twitter during presidential election

Twitter Updates Hate Speech Policy to Include Links to

Adapting Security Warnings to Counter Misinformation

Quantifying search bias: Investigating sources of bias for political searches in social media

Cleaning up social media: The effect of warning labels on likelihood of sharing false news on facebook

Setting the record straighter on shadow banning

Political Bias Inference API

Appealing to sense and sensibility: System 1 and system 2 interventions for fake news on social media

Censored, suspended, shadowbanned: User interpretations of content moderation on social media platforms

User Migration in Online Social Networks: A Case Study on Reddit During a Period of Community Unrest

When corrections fail: The persistence of political misperceptions. Political Behavior

The implied truth effect: Attaching warnings to a subset of fake news headlines increases perceived accuracy of headlines without warnings

Prior exposure increases perceived accuracy of fake news

Does Platform Migration Compromise Content Moderation? Evidence from r/The_Donald and r/Incels

Ethical research standards in a world of big data

An Update on Our Work to Keep People Informed and Limit Misinformation About COVID-19

Updating our approach to misleading information

The Aftermath of Disbanding an Online Hateful Community

Encounters with Visual Misinformation and Labels Across Platforms: An Interview and Diary Study to Inform Ecosystem Approaches to Misinformation Interventions

Trust It or Not: Effects of Machine-Learning Warnings in Helping Individuals Mitigate Misinformation

Inferring who-is-who in the twitter social network

Facebook's failure: did fake news and polarized politics get Trump

Permanent suspension of realDonaldTrump

We thank Jeremy Blackburn, Oana Goga, Krishna Gummadi, Shagun Jhaver, and Manoel Horta Ribeiro for fruitful discussions and feedback during this work.