key: cord-0625568-s8obu818
authors: Gnatyshak, Dmitry; Garcia-Gasulla, Dario; Alvarez-Napagao, Sergio; Arjona, Jamie; Venturini, Tommaso
title: Healthy Twitter discussions? Time will tell
date: 2022-03-21
journal: nan
DOI: nan
sha: 1bfc3122f84b6c2d7c1ae4978cccd94c45a5be84
doc_id: 625568
cord_uid: s8obu818

Studying misinformation and how to deal with unhealthy behaviours within online discussions has recently become an important field of research within social studies. With the rapid development of social media, and the increasing amount of available information and sources, rigorous manual analysis of such discourses has become unfeasible. Many approaches tackle the issue by studying the semantic and syntactic properties of discussions following a supervised approach, for example using natural language processing on a dataset labeled for abusive, fake or bot-generated content. Solutions based on the existence of a ground truth are limited to those domains which may have ground truth. However, within the context of misinformation, it may be difficult or even impossible to assign labels to instances. In this context, we consider the use of temporal dynamic patterns as an indicator of discussion health. Working in a domain for which ground truth was unavailable at the time (early COVID-19 pandemic discussions) we explore the characterization of discussions based on the the volume and time of contributions. First we explore the types of discussions in an unsupervised manner, and then characterize these types using the concept of ephemerality, which we formalize. In the end, we discuss the potential use of our ephemerality definition for labeling online discourses based on how desirable, healthy and constructive they are.

As the volume of online content and discussions grows, the amount of misinformation grows with it. The most extreme type of misinformation (content created with malicious intent), which includes fabricated or manipulated data, can be automatically identified in certain domains (e.g., bot detection, image deep fake analysis), and is target of extensive research.

On the other hand, less explicit types of misinformation (harmful content not necessarily produced for that purpose), such as misleading, biased or incomplete content, are much harder to characterise or label, and its characterization remains as an open challenge. Moreover, this sort of misinformation is hard to quantify, as ground truth labels are typically unavailable. The SARS-CoV-2 pandemic illustrated how complicated it can be to identify informative debates in a context of uncertainty, and how necessary a tool is for aiding consumers at managing the overflow of information they are subject to. This context raises the following questions: What discussions can be considered healthy? Do healthy discussions have any specific patterns or properties? Is it possible to define some formal criteria for healthiness of online discussions?

For answering these questions, we collected a significant amount of tweets from the early COVID-19 pandemic period. Within this collection, we gather tweets belonging to a given thematic conversation (a topic) by defining a set of keywords specific to such theme. We characterize these topics by looking solely at the volume and time of their activity, as shown in Figure  1 . Given these topic representations, we cluster them based on their similarity, trying to identify main characteristics that make some topics different from others. Next, under the hypothesis that the ephemerality of conversations is related with their quality, we formalize different measures of ephemerality, and then compute their values for the gathered topics. Lastly, we explore the relation between the ephemerality measures and the unsupervised clusters. Our results indicate that ephemeral topics represent a big family of short-lived and burst-like discussions (unlikely to be informative), while non-ephemeral topics correspond to sustained, persistent and argued discussions that last on time (with potential to be informative and healthy).

Analyzing and increasing the quality of discussions has been a topic of research since the popularization of online platforms. At the most basic, analysis may come down to simply aggregating the events of interest and counting relevant intrinsic measures (like the number of views or reads). For instance, altmetrics approach for the analysis of coverage of scientific publications on social media involves collecting and calculating different metrics across various platforms [14] . The metrics used in these case come directly from the studied platforms, i.e., the number of tweets and retweets for Twitter, number of comments for Facebook, etc.

One possible way for a more in-depth analysis of online discussions is analyzing their participants. There is a number of works in this area, but it appears that although some global dynamics of the discussions can have discernible patterns, the behaviour of individual users is more or less arbitrary. Moreover, a number of bot-detection techniques rely on this property, as bots on contrary have detectable regularities in their behavior [10, 4, 6, 8] .

Restricting the scope to news articles distribution and discussion also allows to apply more sophisticate analysis. One of the most important characteristics of online content studied in the literature is how much attention it attracts and what is its temporal dynamics. As it appears, the amount of attention an online user can distribute between different pieces of content is a limited resource and content creators compete for it, showing specific dynamics [11, 2] . A number of studies has focused on defining the types of news and laws used to generate attention (for instance, baiting high initial response, or building up their audiences and providing in-depth insights on relevant topics) [3, 7, 5, 12, 9, 1] .

Different temporal dynamics of discussion topics bring up the question of whether we can say something about the content, its types and quality (e.g., the aforementioned healthiness) based purely on its temporal dynamics [13] . The first stage in this would be finding ways to mathematically compare different topics based purely on the shape of their attention curves (i.e., based on the dynamics of their number of tweets, views, reads, etc. ). Afterwards a vast array of methods may be used to analyze topics' similarities and differences, like time series analysis or clustering techniques. One of the main focuses of this paper is the latter: comparing temporal distributions of various discussion topics and trying different clustering approaches to find common patterns and groups among them.

Moreover, instead of trying to analyze topics as a group, we might benefit from designing some measures that estimate the characteristic of attracted interest or attention. In this paper we will focus on the ephemerality characteristic, originally proposed for YoutTube videos [11, 2] , that represents how long the video in question can keep viewers' attention. The data collection process used in this work can be divided into three sequential stages. The first and second stages are designed to obtain a large, representative set of online discussions. One that includes a varied set of social behaviours in times of uncertainty. The third stage creates subsets of tweets belonging to a given topic, typically one that generates a discussion sustained through time. Let us define these stages in further detail:

1. Accessing and gathering online tweets through the Twitter API, using a query made up of 30 keywords related with COVID-19 (see those in Table 1 ). We used a MongoDB 1 database for storage and a Python script accessing the Twitter stream for the gathering.

2. Indexing tweets, enabling quick text search. This is necessary considering our dataset contains 829 million tweets, spanning seven months (between August 15, 2020 and

March 24, 2021). The global distribution of gathered tweets on day-by-day basis is presented on Figure 2 , with a mean of 3,735,678 tweets per day, and a standard deviation of 667,488. For this we used a Solr 2 database.

3. Extracting subsets of tweets corresponding to topics, from the general collection of tweets. This is detailed in 3.1. Besides the bodies of tweets, we also gathered all existing metadata. This includes the tweet author information and the number of times that a tweet has been retweeted, quoted or replied.

Once raw tweets have been gathered and stored, we proceed to extract sets of tweets corresponding to specific topics. We identify COVID-19-related controversies, misinformation topics and rumours from Poynter's database 3 . To represent each of these topics, we manually composed a Solr query for each of them based on their description. The goal is to retrieve the relevant tweets for a topic using their most characteristic keywords. Follows an example of Solr query, one written to extract the discussion topic around the combination of Ibuprofen and COVID-19:

In total, we characterized 68 different misinformation topics within a seven month period. See Table 2 for a sample of topics and their statistics. We represented each of those topics as an integer vector of 222 positions, where each element contains the number of tweets for that topic in the corresponding day. We call them topic distribution vectors (TDV) , and intuitively they represent the attention dynamics of society for the corresponding topics. 

According to official Twitter documentation, the real-time stream and filtered API 4 provides data filtered by our original COVID-19 query in a continuously form (each second) with the limitation that the amount of data returned by the query cannot be greater than the 1% of the total number of tweets in that moment. In that case, Twitter sends a notification with the number of exceeding tweets. While this amount is still significant in volume (see Figure 2 ), this may entail a certain amount of arbitrary variance in our data. Furthermore, since we use one day as our atomic unit of measure, holidays, weekends and other special dates may add noise to the TDV .

To mitigate the impact of variance and noise, while minimizing the loss of precision, we apply a smoothing technique on vectors. For that we use a three-day sliding window average, reducing this way the impact of weekly and weekend patterns. To equal the volume of all vectors (e.g., some discussions may last longer and engage a larger amount of contributions) we normalize them so that the content of each vector sums up to one. Figure 3 shows some TDV after applying smoothing and normalization. The doted purple line indicates the date with a significant change of the volume of tweets for the related topic. In the case of the (cure or vaccine) and covid topic, the volume change on November coincides with Pfizer announcement of the vaccine 5 . In the case of the flu and covid topic it is clear that there is a seasonal influence on the volume of tweets as the flu is more common on winter. Lastly, the invermectin and covid topic shows a change in volume in December, it can be related to a video clip posted online were a medical doctor described ivermectin as a "wonder drug" 6 .

According to [11] , the evolution of discussion volume over time is related with the reliability, quality and constructiveness of online debates. We explore this hypothesis by building a set of TDV which represent specific debates during the COVID-19 pandemic. Since we work in a domain without a reliable ground truth, and we wish to apply our findings to any dataset (regardless of size or expert review), we focus on an unsupervised approach with the goal to understand how do temporal dynamics characterize the healthiness of topics.

The first step in the process is to explore the notion of distance between pairs of TDV. In 4.1 we consider different metrics, which illustrates the need for a topic alignment policy. This is discussed in 4.2. Finally, we conduct a series of clustering experiments in 4.3, with which we seek to identify set of discussions with distinct temporal behaviors.

As mentioned in §3, the nature of the data (number of tweets related to a topic for each day within a period of 222 days) makes that the minimum value for each element of a TDV is 0 and, because of the usage of normalization, the sum of its elements sum up to 1. Such representation of the data is equivalent to the definition of probability distribution function (i.e: the minimal probability value for an outcome is 0, all values are between 0 and 1, and the sum of the probabilities of all the outcomes sum up to 1). Considering these similarities, we have adapted distance measures typically used for probability distributions in order to be used to compare the similarities between the temporal behaviour dynamics of the TDV.

Sum of absolute differences (SAD). A straight-forward method to define distances between topic vectors is to calculate the sum of absolute element-wise differences. Its lower bound is naturally 0 (i.e., two topics which perfectly overlap), and if we normalize it by 0.5 coefficient, its upper bound will be 1 as the maximal difference between 2 vectors each of which sums up to 1 is 2 (i.e., two topics with no overlap). Let t i and t j be two topic vectors of length M . Let t im denote m th element of vector t i . Than we can define SAD distance as:

The main drawback of this distance metric is its strong connection to exact dates. Even if two topics have their tweet number curves identical, the distance may be large if they are shifted temporally. This happens naturally in our dataset, as discussion topics start or end on independent dates. Thus, before SAD can be used, we need to align the topic vectors, as discussed in §4.2. Furthermore, this metric is very sensitive to noise and perturbations in the shapes of topic distributions (e.g., punctual outliers carry a lot of weight). For this reason, we consider other distances which are more resilient to spurious variations.

Kolmogorov-Smirnov statistic adaptation (KS). In its original form, the Kolmogorov-Smirnov (KS) statistic is calculated for two empirical distribution functions (EDF) (or on an EDF and cumulative distribution function (CDF)) to determine whether the underlying distributions are the same. Essentially, KS statistics equals to the maximal absolute difference between the components of two EDFs, which is bound to [0, 1]. Since the normalized topic vectors closely resemble empirical probabilities, we may perform the same calculation on them. In order to get an adapted KS distance we need to turn the topic vectors t i and t j into cumulative vectorst i andt j . Then, KS is calculated as follows:

KS is less sensitive to noise and small perturbations than SAD because it focuses on the differences between maximal accumulated tweet masses for the topics. Like SAD, KS suffers from the temporal misalignment of topics.

Hellinger distance adaptation (HDA). The Hellinger distance is a metric used to evaluate the difference between two probability distributions by computing the difference between square roots of probabilities:

This is a metric bound within [0, 1] interval were 0 means that the probability distributions are equal and 1 that they are totally different. In this case, 1 is achieved when for every positive value of the probabilities of one of the distribution the other has 0 probability. One problem of this metric is that the use of square roots makes it sensitive to values close to 0, making the days with less activity most relevant for the distance.

Norm of difference of squares (NDS). Just as the Hellinger distance increases the importance of the low-value elements of the topic vectors by using square roots, we consider the opposite approach: lower the significance of lower values by using powers. This way the values of the vectors' elements get pushed towards 0, and the lower the values -the more they get decreased. In this case the metric will focus more on areas with high peaks and less on areas with low perturbations.

This metric is bounded to the [0, 1] interval and, like the previous metrics, is dependent on the vectors alignment but it serves well to our purposes because lower values on the TDV can be related to noise and are not adding valuable information about the characteristics of the discussion.

As have been observed, most of the proposed distance measures require the topics to be aligned temporally. Otherwise, the distance between vectors will be mostly derived from their relative time of occurrence, which is unrelated with our goal (i.e., characterizing the healthiness of discussions, regardless on when they happen in time). To overcome this issue, next we propose different methods to align the TDV.

Highest peak (max-alignment). One of the simplest ways to align topics is to shift them with zero-padding so that the maximal element of each topic (i.e., the day with most activity) corresponds to the same index. The main issue with max-alignment is its arbitrary nature. Since we can make no assumptions over data distribution, the highest peak can easily shift from one extreme of the temporal range to the other (consider a 'U' shaped topic distribution, and how it would drastically affect the metric to align either by the first or the last peak). 

Another approach used to align different probability mass functions is to compute the center of mass of each of the TDV and make the alignment on those. First, the total mass of each TDV is computed as the sum of each element of the topic distribution vector (frequency for a date) multiplied by its position in the vector. Then it is divided by the length of the topic distribution vector of each topic vector calculated as the mean of dates weighted by their tweet frequency values. This is a more robust solution than the highest peak, since the center of mass may not change as dramatically by adding or subtracting a few data samples from the topic. Nonetheless, misalignment is still possible as the same center of mass can correspond to two vectors with very different shapes, for instance, when one TDV has its center of mass defined by one burst of tweets at the middle of the length of the vector and another one has two burst of tweets at equal distances from that point the result will be the same center of mass but the vectors will still be misaligned. .

Pairwise exhaustive alignment. Finally, it is possible, albeit computationally demanding, to find the alignment that minimizes the distance between each pair of topics. This requires to exhaustively compute the distance between a couple of vectors for every possible alignment setting. Considering the simplicity of our data (i.e., 222-long vectors) computing a pairwise exhaustive alignment is feasible. Notice this solution may potentially use different alignments for each distinct pair of vectors.

An example of the different alignments are shown in Figure 4 for two topics where its burst happens at different days. Due to limitations of the max-alignment and the mean-alignment and the fact that pairwise exhaustive alignment with vectors of length 222 is computational feasible, we have chosen the latter in order to proceed with the analysis. This means that the pairwise exhaustive alignment is used with the NDS distance. The resulting minimal distances are then used as to perform the clustering technique with the goal of obtain clusters that can be representative of healthy and unhealthy conversations.

In order to characterise the different topics based on their temporal dynamics, and without any assumption regarding their validity, we perform clustering using the computed distances. As mentioned, we used the NDS distance on the vectors aligned exhaustively, after normalization and smoothing these using a three day sliding window.

The chosen clustering method is HDBSCAN, a density and hierarchy method which uses a precomputed distance matrix to find the best number of clusters. This algorithm was chosen because it permits to use a distance matrix instead of a dataset and provides the optimal number of clusters with the only requirement, apart from the distance matrix, of choosing a minimal number of elements in each cluster (which we set up to 3). HDBSCAN internally computes the minimum spanning tree and converts it into the hierarchy of connected components, sorting the distances of the tree in increasing order from individuals to a single cluster. With the obtained hierarchy and using the minimal number of elements for forming a cluster the hierarchy tree becomes much smaller. The last step is to extract the number of clusters, HDBSCAN defines internally the stability of each of the clusters as the summation of the time that points belongs to the cluster, this is computed as the difference of the inverse of the distance in which a point stops being part of a cluster with the inverse of the distance in which a point started to belong to that cluster. If the stability of a cluster is greater than their children clusters, then it is a candidate to be a selected cluster. Once arriving at the root, we get the clusters of our distance matrix as those with maximum stability.

Results show three communities: two with distinct visual features and sizes (51 and 6 topics) and one with outliers topic vectors. The distinct clusters correspond to either topics with one large peak or burst of activity (the small one), and topics that are generally more uniform in its distribution of activity (the large one). The outliers cluster includes everything that does not fit these two definitions.

The clusters obtained by HDBSCAN are shown in Figure 5 along with some additional statistics. The color of each topic denotes the normalized distance from it to the cluster centroid (with 0 being the cluster center and distance 1 being the most distant cluster member from it). Visually, burst topics seem to be characterized by a large, central peak of activity, surrounded by an area of diminished contributions. On the other hand, uniform topics show a more balanced distribution. Notice how, within the uniform cluster, those topics that have a bigger peak are those further from the uniform cluster centroid (i.e., those in a clearer color, deeper into the figure). These borderline cases further reinforces the intuitive difference between burst and uniform clusters.

After having identified two distinct clusters of online discussions based on the topic distribution vectors, we visually hypothesize that these clusters are composed by either burst-like topics on one hand, and more uniform topics on the other. To reinforce this notion, while exploring the definition of the term, we now consider the ephemerality of topics and its relation with the found clusters.

Among the works exploring online discussions, a few have focused on the temporal aspects of it[], i.e., looking at how discussion evolve over time in volume and quality. In this context, a concept of particular relevance is ephemerality [11, 2] . Originally proposed for the analysis of attention in online videos, ephemerality describes how much time it takes to accumulate most (empirically set to 80%) attention or interactions when considering the total attention gathered.

Considering the volume of our database (counting millions of tweets per day), ephemerality analysis, which relies on voluminous data, is an appropriate approach. In our context, ephemerality is based on the number of days passed between the first appearance of tweets for a topic, to the day when that topic accumulates 80% of all tweets. This interpretation of ephemerality measures the temporal distribution of tweets generated. Since there are no formal definitions of ephemerality for our context, we first propose some alternatives.

Let us now formally define our ephemerality measures. Let t i denote a normalized frequency vector of topic i of length M ; let t im denote its m th element. According to the original definition of ephemerality [], for our data it would formalize as follows:

Here we compute the proportion of time taken by a topic to reach 80% of activity, with respect to the topic period of activity. We then subtract the result from 1 to compute ephemerality (closer to 1 more ephemeral). This definition allows us to compare the ephemeralities of topics of different duration. The main drawback of this approach is its sensitivity to outliers. A single early tweet matching the topic query keywords will specify the starting date of the topic, and heavily influence the proportion of time taken until reaching 80% of activity. Furthermore, this approach may provide different ephemerality scores to two burst topics, if their corresponding bursts happen either close to the beginning or close to the ending of the topic lifespan.

One way to limit the impact of both factors is to filter out data at both ends of the topic distribution vector. For example by removing 10% of tweets from either side, and then analyzing the (relative) length of the "middle" section of the discussion.

According to this definition, ephemerality values are limited between 0.2 (least ephemeral) and 1 (most ephemeral). Zero ephemerality is unreachable since we are discarding a 20% of activity on both ends, making varepsilon 2 = 1 − 0.8 1 the minimum. While this solution is more resistant to the relative position of bursts (i.e., burst topics with the peak near the beginning or the end with score similarly), and to the arbitrary occurrence of the first and last activity, it may be affected by multiple bursts (i.e., having an early and a late burst will result in low ephemerality). To fix that, we consider a definition of ephemerality which gets rid of the temporal dimension, sorting the topic frequency vector in descending order of activity, and calculating the ephemerality on that sorted vector:

where t denotes the sorted array of tweet frequencies for the topic m. Using this form of ephemerality breaks the connection of the frequencies to the specific days and allows us to see if the discussion contained enough of days with enough tweets. The limitation of this approach is that it considers only the proportion of days with most activity, regardless of their relative position (i.e., ephemerality is the same for topic where all activity happens in four burst days, regardless of these days being close or far from one another in time). Considering the properties and limitations of varepsilon 2 , and varepsilon 4 , and how they complement each other, we decide to use both. A correlation analysis shows how both metrics provide distinct information (see Figure 6 ). While varepsilon 2 focuses on the length of the central part of the distribution, it is insensitive to the shape of that middle section. On the other hand, varepsilon 4 counts in how many days the topic concentrated its activity, but it does so at cost of properly detecting temporal length of the discussion. Table 3 shows how we can interpret the combinations of ephemerality values, and Figure 7 shows samples of topics falling within these categories. Here by burst topics we mean the topics with a Table 3 : The shape of topic's frequency vector based on the ephemerality values ε 4 is low ε 4 is high ε 2 is low Uniform and sustained topic Rollercoaster topic ε 2 is high -Single burst/peak topic single high peak lasting for one or several days, but comparatively short w.r.t. the topic duration, and by rollercoaster topic -one with at least two burst separated by periods of low activity. We explore the relation between the ephemeralities defined in this section, and the clusters found in 4.3. Statistics can be seen in Figure 5 . Cluster 1 includes topics with lower ε 4 values (mean 0.318, std 0.119) than Cluster 1 topics (mean 0.432, std 0.117). Meanwhile, ε 2 values seem to be rather similar between both clusters (mean 0.293 vs mean 0.282), although Cluster 1 includes a higher variance (std 0.101 vs std 0.045). According to these statistics, Cluster 2 contains most burst-like topics, and Cluster 1 most uniform topics. However, the difference between single-burst and rollercoaster topics is not properly represented with these two clusters.

In this paper we aimed at analyzing online discussions on various topics using only non-semantical temporal information on user attention to them. Our objective was to figure out which discussions can be considered healthy, whether it is possible to extract this information from the aforementioned attention distributions, and, if so, what would be formal measures to estimate it.

The work resulted in two proposals. First of all, our experiments show that for the purpose of clustering using HDBSCAN method, the best quality of the produced clusters is achieved with the proposed NDS distance measure applied for pairwise aligned distribution vectors. Normalization is needed so the topic distribution vectors can be compared using metrics proposed to compare distribution. Smoothing is optional, although one has to be careful, as it might decrease the magnitude of peaks which in turn would move some bursty topics into "uniform" clusters. The NDS distance measure has been chosen because it is insensitive to low frequency values that can be related to noise. Because an alignment is needed, from the three proposed alignments, the pairwise exhaustive alignment is the one chosen as it provides the minimal distance between topics based on the chosen metric.

Secondly, we have formalized the notion of ephemerality, that estimates the ability of a discussion topic to maintain users' attention. We have studied different types of ephemerality measures, aiming at different aspects of what topics might be considered ephemeral; ε 3 showed the best results as a standalone measure with a combination of ε 2 and ε 4 (the former focuses on topics that last longer, the latter -on topics with higher number of days with high attention) being the close second.

Although the current results are promising, there is a number of future research lines:

• First of all, it may be beneficial to ensure that the discussion chains of tweets are kept as intact as possible (which will require additional parsing of Twitter data and • Secondly, more clustering methods can be tested with this use-case.

• Thirdly, the ephemerality threshold must not be set specifically to 80%. It might prove useful to try other values or implement some form of a dynamic threshold.

The pulse of news in social media: Forecasting popularity

Junk News Bubbles: Modelling the Rise and Fall of Attention in Online Arenas

Characterizing the Life Cycle of Online News Stories Using Social Media Reactions

Debot: Twitter bot detection via warped correlation

Robust dynamic classes revealed by measuring the response function of a social system

Social fingerprinting: detection of spambot groups through DNAinspired behavioral modeling

Meme-Tracking and the Dynamics of the News Cycle

Holoscope: Topology-and-spike aware fraud detection

Accelerating dynamics of collective attention

RTbust: Exploiting Temporal Patterns for Botnet Detection on Twitter

From Fake to Junk News, the Data Politics of Online Virality

Novelty and collective attention

Patterns of Temporal Variation in Online Media

General discussion of data quality challenges in social media metrics: Extensive comparison of four major altmetric data aggregators