key: cord-0954521-prse0anq
authors: Bhadani, Saumya; Yamaya, Shun; Flammini, Alessandro; Menczer, Filippo; Ciampaglia, Giovanni Luca; Nyhan, Brendan
title: Political audience diversity and news reliability in algorithmic ranking
date: 2020-07-16
journal: Nature human behaviour
DOI: 10.1038/s41562-021-01276-5
sha: a6274ef6c17826a0ffdf1b56abb873e70854d36f
doc_id: 954521
cord_uid: prse0anq

Newsfeed algorithms frequently amplify misinformation and other low-quality content. How can social media platforms more effectively promote reliable information? Existing approaches are difficult to scale and vulnerable to manipulation. In this paper, we propose using the political diversity of a website's audience as a quality signal. Using news source reliability ratings from domain experts and web browsing data from a diverse sample of 6,890 U.S. citizens, we first show that websites with more extreme and less politically diverse audiences have lower journalistic standards. We then incorporate audience diversity into a standard collaborative filtering framework and show that our improved algorithm increases the trustworthiness of websites suggested to users -- especially those who most frequently consume misinformation -- while keeping recommendations relevant. These findings suggest that partisan audience diversity is a valuable signal of higher journalistic standards that should be incorporated into algorithmic ranking decisions.

of checking each individual piece of content. Unfortunately, many of these methods are hard to scale to large groups and/or depend upon context-specific information about the type of content being generated. For example, methods for assessing the credibility of content on Wikipedia often assume content is organized as a wiki. As a result, they are not easily applied to news content recommendations on social media platforms.

Another approach is to try to evaluate the quality of articles directly [54] , but scaling such an approach would likely be costly and cause lags in the evaluation of novel content. Similarly, while crowdsourced website evaluations have been shown to be generally reliable in distinguishing between high and low quality news sources [39] , the robustness of such signals to manipulation is yet to be demonstrated.

Building on the literature about the benefits of diversity at the group level [25, 46] , we propose using the partisan diversity of the audience of a news source as a signal of its quality. This approach has two key advantages. First, audience partisan diversity can be computed at scale given that information about the partisanship of users is available or can be inferred in a reliable manner. Second, because diversity is a property of the audience and not of its level of engagement, it is less susceptible to manipulation if one can detect inauthentic partisan accounts [44, 49, 52, 53] . These two conditions (inferring partisanship reliably and preventing abuse by automated amplification/deception) could easily be met by the major social media platforms, which have routine access to a wealth of signals about their users and their authenticity.

We evaluate the merits of our proposed approach using data from two sources: a comprehensive data set of web traffic history from 6,890 Americans, collected along with surveys of self-reported partisan information from respondents in the YouGov Pulse survey panel, and a data set of 3,765 news source reliability scores compiled by trained experts in journalism and provided by News-Guard [37] . We first establish that domain pageviews are not associated with overall news reliability, highlighting the potential problem with algorithmic recommendation systems that rely on popularity and related metrics of engagement. We next define measures of audience partisan diversity and show that these measures correlate with news reliability better than popularity does.

Finally, we study the effect of incorporating audience partisan diversity into algorithmic ranking decisions. When we create a variant of the standard collaborative filtering algorithm that explicitly takes audience partisan diversity into account, our new algorithm provides more trustworthy recommendations than the standard approach with only a small loss of relevance, suggesting that reliable sources can be recommended without the risk of jeopardizing user experience.

These results demonstrate that diversity in audience partisanship can serve as a useful signal of news reliability at the domain level, a finding that has important implications for the design of content recommendation algorithms used by online platforms. Although the news recommendation technologies deployed by platforms are more sophisticated than the approach tested here, our results highlight a fundamental weakness of algorithmic ranking methods that prioritize content that generates engagement and suggest a new metric that could help improve the reliability of the recommendations that are provided to users.

To motivate our study, we first demonstrate that the popular news content that algorithmic recommendations often highlight is not necessarily reliable. To do so, we assess the relationship between source popularity and news reliability. We measure source popularity using the YouGov Pulse traffic data. Due to skew in audience size among domains, we transform these data to a logarithmic scale. In practice, we measure the popularity of a source in two ways: as the (log of) number of users, and as the (logged) number of visits, or pageviews. News reliability is instead measured using NewsGuard scores (see Methods A). Fig. 1 shows that the popularity of a news Domains for which we have NewsGuard reliability scores [37] are shaded in blue (where darker shades equal lower scores). Domains with no available score are plotted in gray.

source is at best weakly associated with its reliability. At the user level (left pane), the overall Pearson correlation is r = 0.03 (two-sided p = 0.36). At the pageview level (right pane), r = 0.05 (two-sided p = 0.12). The association between the two variables remains weak even if we divide sources based on their partisanship. When measuring popularity at the user level, websites that have a predominantly Democratic audience have a significant positive association (r = 0.09, twosided p = 0.02), but for websites with a Republican audience the correlation is negative and not significant at conventional standards (r = −0.12, two-sided p = 0.06). A similar pattern holds at the pageview level: a weak positive association for websites with predominantly Democratic audiences (r = 0.08, two-sided p = 0.02) and a negative but not significant association for those with predominantly Republican audiences (r = −0.06, two-sided p = 0.34). Overall, these results suggest the strength of association between the two variables is quite weak.

In contrast, we observe that sites with greater audience partisan diversity tend to have higher NewsGuard scores while those with lower levels of diversity, and correspondingly more homogeneous partisan audiences, tend to have lower reliability scores. As our primary metric of diversity, we selected from a range of alternative definitions (see Methods B) the variance of the partisanship distribution. Fig. 2 As Fig. 2 indicates, unreliable websites with very low NewsGuard scores are concentrated in the tails of the distribution, where partisanship is most extreme and audience partisan diversity is, by necessity, very low. This relationship is not symmetrical: low-reliability websites (whose markers are darker shades of blue in the figure) are especially concentrated in the right tail, which corresponds to websites with largely Republican audiences. The data in Fig. 2 also suggests that the reliability of a website may be associated not just with the variance of the distribution of audience partisanship slants, but also with its mean. To account for this, we first compute the coefficient of partial correlation between NewsGuard reliability scores and the variance of audience partisanship given the mean audience partisanship of each website. Compared with popularity, we find a stronger (and significant) correlation regardless of whether mean partisanship and audience partisan diversity are calculated by weighting individual audience members equally (user level, left panel: partial correlation r = 0.38, two-sided p < 10 −4 ) or by how often they visited a given site (pageview level, right panel: partial correlation r = 0.22, two-sided p < 10 −4 ).

Aside from mean partisanship, a related, but potentially distinct, confounding factor is the extremity of the partisanship slants distribution (i.e., the distance of the average partisanship of a website visitor on a 1-7 scale from the midpoint of 4, which represents a true independent). We thus computed partial correlation coefficients again, but instead keep the ideological extremity of website audiences constant instead of the mean. Our results are consistent using this approach (user level: r = 0.26, p < 10 −4 ; pageview level: r = 0.15, p < 10 −4 ; both tests are two-sided).

We study the diversity-reliability relationship in more detail in Fig. 3 , which differentiates be-tween websites with audiences that are mostly Republican and those with audiences that are mostly Democratic. Consistent with what we report above, Fig. 3 shows that audience partisan diversity is positively associated with news reliability. Again, this relationship holds both when individual audience members are weighted equally (user level, left panel) and when they are weighted by their number of accesses (pageview level, right panel), though the association is stronger at the user level (standardized OLS coefficient: β = 6.67 (0.58) at user level; β = 3.91 (0.71) at pageview level). In addition, we find that the relationship is stronger for sites whose average visitor identifies as a Republican (standardized OLS coefficient of Republican domains: β = 13.1 (1.59) at user level; β = 9.61 (2.20) at pageview level) versus those whose average visitor identifies as a Democrat In fact, the association between diversity and Newsguard reliability scores is consistent even when controlling for popularity (user level: r = 0.34, two-sided p < 10 −4 ; pageview level: r = 0.17, twosided p < 10 −4 ), suggesting that diversity could contribute to detecting quality over and above the more typical popularity metrics used by social media algorithms. However, the previous analysis of Fig. 3 shows that the overall relationship masks significant heterogeneity between websites with mostly Republican or Democratic audiences. To tease apart the contributions of popularity from those of partisanship, we estimate a full multivariate regression model. After controlling for both popularity and political orientation, we find qualitatively similar results. Full regression tables can be found in Supplementary Materials.

As mentioned before, variance in audience partisanship is not the only possible way to define audience partisan diversity; alternative definitions can be used (e.g., entropy; see Methods B). As a robustness check, we therefore consider a range of alternative definitions of audience partisan diversity and obtain results that are qualitatively similar to the ones presented here, though results are strongest for variance (see Supplementary Materials).

To understand the potential effects of incorporating audience partisan diversity into algorithmic recommendations, we next consider how recommendations from a standard user-based collaborative filtering (CF) algorithm [29, 41] change if we include audience partisan diversity as an additional signal. We call this modified version of the algorithm CF+D, which stands for Collaborative Filtering + Diversity (see Methods C for formal definition).

In classic CF, users are presented with recommendations drawn from a set of items (in this case, web domains) that have been "rated" highly by those other users whose tastes are most similar to theirs. Lacking explicit data about how a user would "rate" a given web domain, we use a quantity derived from the number of user pageviews to a domain (based on TF-IDF; see also Methods C) as the rating.

To evaluate our method, we follow a standard supervised learning workflow. We first divide web traffic data for each user in the YouGov Pulse panel into training and testing sets by domain (see Methods D). We then compute similarities in traffic patterns between users for all domains in the training set (not just news websites) and use the computed similarities to predict the aforementioned domain-level pageviews metric on the test set. The domains that receive the highest predicted ratings (i.e., expected TF-IDF-transformed pageviews) are then selected as recommendations. As a robustness check, we obtain consistent results if we split the data longitudinally instead of randomly (i.e., as a forecasting exercise; see Supplementary Materials for details).

Note that if a user has not visited a domain, then the number of visits for that domain will be zero. In general, due to the long tail in user interests [13] , we cannot infer that the user has a negative preference toward a website just because they have not visited it. The user may simply be unaware of the site. We therefore follow standard practice in the machine learning literature in only evaluating recommendations for content for which we have ratings (i.e., visits in the test set), though in practice actual newsfeed algorithms rank items from a broader set of inputs, which typically includes content the user may not have seen (for example, content shared by friends [5] ).

To produce recommendations for a given user, we consider all the domains visited by the user in the test set for which ratings are available from one or more respondents in a neighborhood of most similar users (domains with no neighborhood rating are discarded since neither CF nor CF+D can make a prediction for them; see Methods C) and for which we have a NewsGuard score (i.e., a reliability score). We then rank those domains by their rating computed using either CF or CF+D.

This process produces a ranked list of news domains and reliability scores from both the standard CF algorithm and the CF+D algorithm, which has been modified to incorporate the audience partisan diversity signal. We evaluate these lists using two different measures of trustworthiness which are computed for the top k domains in each list: the mean score (a number in the 0-100 range) and the proportion of domains with a score of 60 or higher, which NewsGuard classifies as indicating that a site "generally adheres to basic standards of credibility and transparency" [37] This baseline does not include any local information about user-user similarities, and thus can be seen as a "global" measure of popularity with no contribution due to user personalization (see

We observe in Fig. 4 that the trustworthiness of recommendations produced by CF+D is sig- nificantly better than standard CF recommendations, global popularity recommendations, and baseline statistics from user behavior. In particular, CF produces less trustworthy rankings than both the recommendations based on global popularity and on user visits (for small values of k the difference is within the margin of error). In contrast, CF+D produces rankings that are more trustworthy than CF and either baseline (global popularity or actual visits) across different levels of k. These results suggest that audience partisan diversity can provide a valuable signal to improve the reliability of algorithmic recommendations.

Of course, the above exercise would be meaningless if our proposed algorithm recommended websites that do not interest users. Because CF+D alters the set of recommended domains to prioritize those visited by more diverse partisan audiences, it may be suggesting sources that offer counter-attitudinal information or that users do not find relevant. In this sense, CF+D could represent an audience-based analogue of the topic diversification strategy from the recommender systems literature [55] . If so, a loss of predictive ability would be expected. To provide intuition about the contribution of popularity in recommendations, the left panel of Fig. 5 also shows the precision of the naïve baseline obtained by ranking items by their global popularity. This baseline outperform CF and CF+D but at the price of providing the same set of recommendations to all users (i.e., the results are not personalized) and of providing recommendations of lower trustworthiness (Fig. 4) . Note that the RMSE cannot be computed for this baseline because this metric requires knowledge of the rating of a domain, not just of its relative ranking.

Our results are generally encouraging. In both cases, precision is low and RMSE is high for low values of k, but error levels start to stabilize around k = 10, which suggests that making correct recommendations for shorter lists (i.e., k < 10) is more challenging than for longer ones. Moreover, when we compare CF+D with CF, accuracy declines slightly for CF+D relative to CF but the difference is not statistically significant for all but small values of k, suggesting that CF+D is still capable of producing relevant recommendations.

Re-ranking items by diversity has minimal effects on predictive accuracy, but how does it affect user satisfaction? The recommendations produced by CF+D would be useless if users did not find them engaging. Unfortunately, we lack data about user satisfaction in the YouGov panel -our primary metric (log number of website visits) cannot be interpreted as a pure measure of satisfaction (other factors of course shape the decision by users in the YouGov panel to visit a website, including social media recommendations themselves).

However, it is possible that more accurate recommendations will result in higher user satisfaction. In the revised version, we therefore quantify the significance of the observed drop in accuracy due to re-ranking by diversity. More specifically, we simulated the sampling distribution of the precision of recommendation after re-ranking by re-shuffling domain labels in the list of ratings produced by CF+D. This procedure allows us to calculate the probability of a drop in precision as small as the observed one due to random chance alone. Compared with this null model, we find that our results lead to significantly higher precision -most random re-rankings of the same magnitude as the one produced by CF+D would result in lower precision than what we observe.

We report the results of this additional analysis in the Supplementary Materials (Fig. S8 ).

The results above demonstrate that incorporating audience partisan diversity can increase the trustworthiness of recommended domains while still providing users with relevant recommendations. However, we know that exposure to unreliable news outlets varies dramatically across the population. For instance, exposure to untrustworthy content is highly concentrated among a narrow subset of highly active news consumers with heavily slanted information diets [16, 20] . We therefore take advantage of the survey and behavioral data available on participants in the Pulse panel to consider how CF+D effects vary by individual partisanship (self-reported via survey), behavioral measures such as volume of news consumption activity and information diet slant, and contextual factors that are relevant to algorithm performance such as similarity with other users.

In this section, we again produce recommendations using either CF or CF+D and measure their difference in trustworthiness with respect to a baseline based on user visits (specifically the ranking by TF-IDF-normalized number of visits v; see Methods C). However, we analyze the results differently than those reported above. Rather than considering recommendations for lists of varying length k, we create recommendations for different subgroups based on the factors of interest and compare how the effects of the CF+D approach vary between those groups.

To facilitate comparisons in performance between subgroups that do not depend on list length k, we define a new metric to summarize the overall trustworthiness of the ranked lists obtained with CF and CF+D over all possible values of k. Since users tend to pay less attention to items ranked lower in the list [28] , it is reasonable to assume that lower-ranked items ought to contribute less to the overall trustworthiness of a given ranking.

Let us now consider probabilistic selections from two different rankings, represented by random variables X and X , where X is the random variable of the ranking produced by one of the two recommendation algorithms (either CF or CF+D) and X is the selection from the baseline ranking based on user visits. Using a probabilistic discounting method (see Eq. 8 in Method H), we compute the expected change in trustworthiness Q from switching the selection from X to X,

where the expectations of Q(X) and Q(X ) are taken with regard to the respective rankings (see Applying Eq. 1, we find that CF+D substantially increases trustworthiness for users who tend to visit sources that lean conservative ( Fig. 6(a) ) and for those who have the most polarized information diets (in either direction; see Fig. 6 (c)), two segments of users who are especially likely to be exposed to unreliable information [2, 16, 20] . In both cases, CF+D achieves the greatest improvement among the groups where CF reduces the trustworthiness of recommendations the most, which highlights the pitfalls of algorithmic recommendations for vulnerable audiences and the benefits of prioritizing sources with diverse audiences in making recommendations to those users.

Note that even though the YouGov sample includes self-reported information on both party ID and partisanship of respondents. We use only the former ( Fig. 6 (b)) for stratification to avoid circularity given the definition of CF+D, which relies on the latter. In Figs 6(a) and 6(c), we instead stratify on an external measure of news diet slant (calculated from a large sample of social media users; see Methods I).

We also observe that CF+D has strong positive effects for users who identify as Republicans or lean Republican (Fig. 6(b) ) and for those who are the most active news consumers in terms of both total consumption ( Bars represent the standard error of the mean of each stratum. Change in trustworthiness ∆Q based on scores from NewsGuard [37] .

nearest neighbor similarities, we find that CF+D results in improvements for the users whose browsing behavior is most similar to others in their neighborhood and who might thus be most at risk of "echo chamber" effects ( Fig. 6(f) ). Finally, when we group users by the trustworthiness of the domains they visit, we find that the greatest improvements from the CF+D algorithm occur for users who are exposed to the least trustworthy information (Fig. 6(g) ). By contrast, the standard CF algorithm often recommends websites that are less trustworthy than those that respondents actually visit (∆Q < 0).

The findings presented here suggest that the ideological diversity of the audience of a news source is a reliable indicator of its journalistic quality. To obtain these findings, we combined source reliability ratings compiled by expert journalists with traffic data from the YouGov Pulse panel. Of course, we are not the first to study the information diets of Internet users. Prior work has leveraged Web traffic data to pursue related topics such as identifying potential dimensions of bias of news sources [38, 42] , designing methods to present diverse political opinions [34, 35] , and measuring the prevalence of filter bubbles [10] . Unlike these studies, however, we focus on how to promote exposure to trustworthy information rather than seeking to quantify or reduce different sources of bias.

A number of limitations must be acknowledged. First, our current methodology, which is based on reliability ratings compiled at the level of individual sources, does not allow us to evaluate the quality of specific articles that participants saw. However, even a coarse signal about source quality could still be useful for ranking a newsfeed given that information about reliability is more widely available at the publisher level than the article level. Another limitation is that our data lack information about actual engagement. Though we show that our re-ranking procedure is associated with a minimal loss in predictive accuracy, it remains an open question whether diversity-based rankings lead not just to higher exposure with trustworthy content, but also to more engagement with it. More research is needed to tease apart the causal link between political attitudes, readership, engagement, and information quality.

Our work has a number of implications for the integrity of the online information ecosystem.

First, our findings suggest that search engines and social media platforms should consider including audience diversity to their existing set of news quality signals. Such a change could be especially valuable for domains for which we lack other signals about their quality like source reliability ratings compiled by experts. Media ratings systems such as NewsGuard could also benefit from adopting our diversity metric, for example to help screen and prioritize domains for manual evaluation.

Critics may raise concerns that such a change in ranking criteria would result in unfair outcomes, for example by reducing exposure to content by certain partisan groups but not others. To see whether ranking by diversity leads to any differential treatment for different partisan news sources, we compute the rate of false positives due to re-ranking by diversity. Here the false positive rate is defined as the conditional probability that CF+D does not rank a trustworthy domain among the top k recommendations while CF does for both left-and right-leaning domains. To determine whether a domain is trustworthy we rely on the classification provided by NewsGuard (i.e. the domain has a reliability score ≥ 60). Fig. 7 shows the rate of false positives as a function of k of both left-and right-leaning domains averaged over all users. Despite some small differences, especially

for low values of k, we find no consistent evidence that this change would produce systematically differential treatment across partisan groups.

Another concern is the possibility of abuse. For example, an attacker could employ a number of automated accounts to collectively engage with an ideologically diverse set of sources. This inauthentic, ideologically diverse audience could then be used to push specific content the attacker wants to promote atop the rankings of a recommender system. Similarly, an attacker who wanted to demote a particular content could craft an inauthentic audience with low diversity. Fortunately, there is a vast literature on the topic of how to defend recommender systems against such "shilling" attacks [21, 30] and platforms already collect a wealth of signals to detect and remove inauthentic coordinated behavior of this kind. Future work should investigate the feasibility of creating trusted social media audiences that are modeled on existing efforts in marketing research using panels of consumers. We hope that our result stimulates further research in this area.

Our analysis combines two sources of data. The first is the NewsGuard News Website Reliability Index [37] , a list of web domain reliability ratings compiled by a team of professional journalists and news editors. The data that we licensed for research purposes includes scores of 3,765 web domains on a 100-point scale based on a number of journalistic criteria such as editorial responsibility, accountability, and financial transparency. 1 NewsGuard categorizes web domains into four main groups: "Green" domains, which have a score of 60 or more points and are considered reliable;

"Red" domains, which score less than 60 points and are considered unreliable; "Satire" domains, which should not be regarded as news sources regardless of their score; and "Platform" domains like Facebook or YouTube that primarily host content generated by users. The mean reliability score for domains in the data is 69.6; the distribution of scores is shown in Fig. 8 .

The second data source is the YouGov Pulse panel, a sample of U.S.-based Internet users whose web traffic was collected in anonymized form with their prior consent. This traffic data was collected during seven periods between October 2016 and March 2019 (see Table I We perform a number of pre-processing steps on this data. We combine all waves into a single sample. We pool web traffic for each domain that received thirty or more unique visitors. Finally, we use the self-reported partisanship of the visitors (on a seven-point scale from an online survey)

to estimate mean audience partisanship and audience partisan diversity, which we estimate using 

To measure audience partisan diversity, first define N j as the count of participants who visited a web domain and reported their political affiliation to be equal to j for j = 1, . . . , 7 (where 1 = strong Democrat and 7 = strong Republican). The total number of participants who visited the domain is thus N = j N j , and the fraction of participants with a partisanship value of j is p j = N j /N . Denote the partisanship of the i-th individual as s i . We calculate the following metrics to measure audience partisan diversity: The above metrics all capture the idea that the partisan diversity of the audience of a web domain should be reflected in the distribution of its traffic across different partisan groups. Each weighs the contribution of each individual person who visits the domain equally; they can thus be regarded as user-level measures of audience partisan diversity. However, the volume and content of web browsing activity is highly heterogeneous across internet users [19, 33] , with different users recording different numbers of pageviews to the same website. To account for this imbalance, we also compute the pageview-level, weighted variants of the above audience partisan diversity metrics where, instead of treating all visitors equally, each individual visitor is weighted by the number of pageviews they made to any given domain.

As a robustness check, we compare the strength of association of each of these metrics to news reliability in the Supplementary Materials. We find that all variants correlate with news reliability, but the relationship is strongest for variance.

In general, a recommendation algorithm takes a set of users U and a set of items D and learns a function f : U × D → R that assigns a real value to each user-item pair (u, d) representing the interest of user u in item d. This value denotes the estimated rating that user u will give to item d. In the context of the present study, D is a set of news sources identified by their web domains (e.g., nytimes.com, wsj.com), so from now on we will refer to d ∈ D interchangeably as either a web domain or a generic item.

Collaborative filtering is a classic recommendation algorithm in which some ratings are provided as input and unknown ratings are predicted based on those known input ratings. In particular, the user-based CF algorithm, which we employ here, seeks to provide the best recommendations for users by learning from others with similar preferences. CF therefore requires a user-domain matrix where each entry is either known or needs to be predicted by the algorithm. Once the ratings are predicted, the algorithm creates a ranked list of domains for each user that are sorted in descending order by their predicted ratings.

To test the standard CF algorithm and our modified CF+D algorithm, we first construct a user-domain matrix V from the YouGov Pulse panel. The YouGov Pulse dataset does not provide user ratings of domains, so we instead count the number of times π u,d ∈ Z + a user u has visited a domain d (i.e., pageviews) and use this variable as a proxy [28] . Because this quantity is known to follow a very skewed distribution, we compute the rating as the TF-IDF of the pageview counts:

where π = u d π u,d is the total number of visits. Note that if a user has never visited a particular domain, then v u,d = 0. Therefore, if we arrange all the ratings into a user-domain matrix V ∈ R |U |×|D| , such that (V ) u,d = v u,d , we will obtain a sparse matrix. The goal of any recommendation task is to complete the user-domain matrix by predicting the missing ratings, which in turn allows us to recommend new web domains to users that may not have seen them. In this case, however, we lack data on completely unseen domains. To test the validity of our methods, we therefore follow the customary practice in machine learning of setting aside some data to be used purely for testing (see Methods D).

Having defined V , the next step of the algorithm is to estimate the similarity between each pair of users. To do so, we use either the Pearson correlation coefficient or the Kendall rank correlation of their user vectors; i.e., their corresponding row vectors in V (i.e., zeroes included). For example, if τ (·, ·) ∈ [−1, 1] denotes the Kendall rank correlation coefficient between two sets of observations, then the corresponding coefficient of similarity between u ∈ U and u ∈ U can be defined as:

where V u , V u ∈ R 1×|U | are the row vectors of u and u , respectively. A similar definition can be used for Pearson's correlation coefficient in place of τ .

These similarity coefficients are in turn used to calculate the predicted ratings. In the standard user-based CF, the predicted rating of a user u for a domain d is calculated as:

where N u d ⊆ U is the set of the n = 10 most similar users to u who have also rated d (i.e., the neighbors of u), v u ,d is the observed rating (computed with Eq. 2) that neighboring user u has given to domain d,v u andv u are the average ratings of u and u across all domains they visited, respectively, and sim(u, u ) is the similarity coefficient (computed with Eq. 3) between users u and u based on either the Pearson or the Kendall correlation coefficient.

Having defined the standard CF in Eq. 4, we now define our variant CF+D, which incorporates audience partisan diversity of domain d ∈ D as a re-ranking signal in the following way:

where g (δ d ) is the re-ranking term of domain d, obtained by plugging the audience partisan diversity δ d (for example, we use the variance of the distribution of self-reported partisan slants of its visitors,

into a standard logistic function:

In Eq. 6, parameters a, ψ, and t generalize the upper asymptote, inverse growth rate, and location of the standard logistic function, respectively. For the results reported in this study we empirically estimate the location as t =δ, the average audience partisan diversity across all domains, which corresponds to the value ofδ = 4.25 since we measure diversity as the variance of the distribution of self-reported partisan slants. For the remaining parameters, we choose a = 1, ψ = 1. As a robustness check, we re-ran all analyses with a larger value of a and obtained qualitatively similar results (available upon reasonable request).

To evaluate both recommendation algorithms, we follow a standard supervised learning workflow. We use precision and root mean squared error (RMSE), two standard metrics used to measure the relevance and accuracy of predicted ratings in supervised learning settings. We define these two metrics elsewhere (see Methods G). Here, we instead describe the workflow we followed to evaluate the recommendation methods. Since our approach is based on supervision, we need to designate some of the user ratings (i.e., the number of visits to each domain, which are computed using Eq. 2) as ground truth to compute performance metrics.

For each user, we randomly split the domains they visited into a training set (70%) and a testing set (30%). This splitting varies by user: the same domain could be included in the training set of a user and in the testing set of another. Then, given any two users, their training set ratings are used to compute user-user similarities using Eq. 3 (which is based on Kendall's rank correlation coefficient; a similar formula can be defined using Pearson's correlation). If, in computing user-user similarities with Eq. 3, a domain is present for a user but not for the other, then the latter rating is assumed to be zero regardless of whether the domain is present in testing or not. This assumption, which follows standard practice in collaborative filtering algorithm, ensures that there is no leaking of information between the test and training sets.

Finally, using either Eq. 4 or Eq. 5, we predict ratings for domains in the test set and compare them with the TF-IDF of the actual visit counts in the data.

We also generate ranked lists for users based on global domain popularity (user-level) as an additional baseline recommendation technique. All the domains are initially assigned a rank (global popularity rank) according to their user-level popularity, which is calculated from the training set views. Then, the domains in the test set of each user are ranked according to their global popularity ranks to generate the recommendations. This method does not include any personalization as the rank of a domain for a particular user does not depend on other similar users but depends on the whole population. In particular, if two users share the same two domains in testing, their relative ranking is preserved, even if the two users visited different domains in training.

In addition to standard metrics of accuracy (precision and RMSE; see Methods G), we define a new metric called trustworthiness to measure the news reliability of the recommended domains.

It is calculated using NewsGuard scores in two ways: either using the numerical scores or the set of binary indicators for whether a site meets or exceeds the threshold score of 60 defined by NewsGuard as indicating that a site is generally trustworthy [37] . Let d 1 , d 2 , . . . , d k be a ranked list of domains. Using numerical scores, the trustworthiness is the average:

where Q(d) ∈ [0, 100] denotes the NewsGuard reliability score of d ∈ D.

If instead we use the binary indicator of trustworthiness provided by Newsguard, then the trustworthiness of domains in a list is defined as the fraction of domains that meet or exceed the threshold score. Note that, unlike precision and RMSE, the trustworthiness of a list of recommendations does not use information on the actual ratings v u,d . Instead, using Eq. 7, we compute the trustworthiness of the domains in the test set ranked in decreasing order of user visits v u,d . We then compare the trustworthiness of the rankings obtained with either CF or CF+D against the trustworthiness of this baseline.

Given a user u, let us consider a set D of web domains for which |D| = D. For each domain d ∈ D, we have three pieces of information: the two predicted ratingsv CF u,d andv CF+D u,d produced by CF and CF+D and the actual rating v u,d (defined elsewhere; see Methods C). In the following, we omit the subscript u of the user, which is fixed throughout, and the CF/CF+D superscript unless it is not obvious from context.

Let us consider a given recommendation method (either CF or CF+D) and denote with r(d) (respectively, r (d)) the rank of d when the domains are sorted by decreasing order of recommendation and actual ratings, respectively. Given a recommendation list length 0 < k ≤ D, let us define the set of predicted domains as:

and the set of actual domains as:

Then the precision for a given value of k is given by the fraction of correctly predicted domains:

Similarly, the root mean squared error for a given value of k between the two ranked lists of ratings is computed as:

where ρ : [D] → D (respectively ρ ) is the inverse function of r(·) (respectively, r (·)); that is, the function that maps ranks back to their domain by the recommendation method (respectively, by actual visits). Note that, in the summation, ρ(r) and ρ (r) do not generally refer to the same web domain: the averaging is over the two ranked lists of ratings, not over the set of domains in common between the two lists.

To measure the effect of CF+D on the trustworthiness of rankings, we must select a particular list length k. Although Fig. 4 shows improvements for all values of k, one potential problem when stratifying on different groups of users is that the results could depend on the particular choice of k. To avoid dependence on k, we consider a probabilistic model of a hypothetical user visiting web domains from a ranked list of recommendations and define overall trustworthiness as the expected value of the trustworthiness of domains selected from that list (i.e., discounted by probability of selection).

Let us consider a universe of domains D as the set of items to rank. Inspired by prior approaches on stochastic processes based on ranking [11] , we consider a discounting method that posits that the probability of selecting domain d ∈ D from a given ranked recommendation list decays as a power law of its rank in the list:

where X ∈ D is a random variable denoting the probabilistic outcome of the selection from the ranked list, r d ∈ N is the rank of a generic d ∈ D, and α ≥ 0 is the exponent of power-law decay (when α = 0, all domains are equally likely; when α > 0, top-ranked domains are more likely to be selected).

This procedure allows us to compute, for any given user, the effect of a recommendation method (either CF or CF+D) simply as the difference between its expected trustworthiness and the trustworthiness of the ranking obtained by sorting the domains visited by the user in decreasing order of pageviews (see Eq. 1).

In practice, to compute Eq. 1, let d 1 , d 2 , . . . , d k and d 1 , d 2 , . . . , d k be two ranked lists of domains, d r , d r ∈ D ∀r = 1, . . . , k, generated by a recommendation algorithm and by actual user pageviews, respectively, and let us denote with Q(d) the NewsGuard reliability score of d ∈ D (see Methods F).

Recall that Eq. 8 specifies the probability of selecting a given domain d ∈ D from a particular ranked list as a function of its rank. Even though any pair of equally-ranked domains will be different across these two lists (that is, d r = d r in general), their probability will be the same because Eq. 8

only depends on r. We can thus calculate the expected improvement in trustworthiness as:

where P (r) is the probability of selecting a domain with rank r from Eq. (8), which we computed setting α = 1. The raw data that support the findings of this study are available from NewsGuard Technology, Inc. but restrictions apply to the availability of these data, which were used under license for the current study and thus cannot be made publicly available. However, data are available from the authors upon reasonable request subject to licensing from NewsGuard. The data used in this study were current as of November 12, 2019 and do not reflect NewsGuard's regular updates of the data.

"Political audience diversity and news reliability in algorithmic ranking "

We repeat the analysis of Fig. 3 for all diversity metrics (see Methods B) and summarize the results in Table S1 . For each metric, we estimate the degree of linear association with news quality using the Pearson correlation coefficient. We also report the R 2 coefficient of determination and the two-sided p-value of the F-statistic as a measure of significance of the fit. And finally, we

show the partial correlation coefficient by controlling the mean partisanship and the extremity of domains. Each metric is positively correlated with quality at the user level, but we find that the relationship is strongest for variance of audience partisanship. At the pageview level, however, the association disappears for all metrics but variance, which still produces a modest correlation. Fig. 3 in the main text we show the relationship between NewsGuard reliability scores of news domains and audience partisan diversity, via linear regression. In Tables S2-S3 we report the associated summary tables for both the user and pageview level, respectively. To ensure the regression coefficients and associated errors can be comparable across datasets, we standardize all independent variables prior to fitting the models to the data. In this sense, they represent a true forecasting exercise. Despite a slightly larger loss of precision relative to CF (compare the left panel of Fig. 5 in the main text with the left panel of Fig. S7 ), our results remain qualitatively consistent with those shown in the main text. For the prior Figs. 4, 5, S1, S2, S3 and S4, the data for each user are randomly split into a training (70%) and

testing set (30%), so that, for any given user, there is no overlap between the two sets. Note that each user is split independently of the others, so a given domain can appear in the training set of one user and in the testing set of another. Instead, in Figs. S6 and S7, the data of traffic that took place before a fixed boundary date (which is identical for all users) form the training set, and those that took place after form the testing set. This means that the same domain can occur in both the training and the testing set.

Data collection for the YouGov Pulse panel took place in 7 different time periods (see Table I adding a term that depends on diversity (see Eq. 5), we simulate this process by simply shuffling the diversity terms among the items before ranking them. This procedure ensures that we consider only lists obtained by shifting the ratings by the same amount of CF+D. Fig. S8 shows the sampling distribution of the precision of re-rankings of the same magnitude as those of CF+D using this process for k = 1 and k = 10. To sample from this distribution, we rank domains using the ratings computed from Eq. 4. We then compute in a separate labeled vector the diversity term g(δ d ) obtained using the logistic function (Eq. 6), reshuffle the labels at random, obtaining for each term a new label d , and finally apply the reshuffled term g(δ d ) as in Eq. 5. We then re-rank based on the new ratings and compute the precision of the ranked list. This reshuffling is carried out separately for each user with at least k domains in their testing set. The precision is then averaged over all users. This procedure is repeated 1,000 times to obtain the sampling distribution. Finally, we compute a one-tailed p-value by finding the proportion of samples that have a precision higher than the observed value for CF+D.

e. Stratification analysis without discounting. Fig. S9 -S15 show the results of the stratification analysis without using the discounting model. from Bakshy et al. [5] ) and by length of ranked list k. In this and the following plots, bars represent the standard error of the mean. Change in trustworthiness ∆Q based on scores from NewsGuard [37] .

FIG. S14. Effect of CF and CF+D versus baseline by average user-user similarity with nearest n = 10 neighbors in training set (terciles) and by length of ranked list k. Change in trustworthiness ∆Q based on scores from NewsGuard [37] .

Social media and fake news in the 2016 election

Evaluating the fake news problem at the scale of the information ecosystem

Exposure to social engagement metrics increases vulnerability to misinformation

Exposure to ideologically diverse news and opinion on Facebook

Neutral bots reveal political bias on social media

A survey on trust modeling

How algorithmic popularity bias hinders or promotes quality

Prioritizing original news reporting on facebook

Filter bubbles, echo chambers, and online news consumption

Scale-free network growth by ranking

The few-get-richer: A surprising consequence of popularity-based rankings?

Anatomy of the long tail: Ordinary people with extraordinary tastes

Computing and applying trust in web-based social networks

Surfacing useful and relevant content -how news works

Avoiding the echo chamber about echo chambers: Why selective exposure to like-minded political news is less prevalent than you think

Less than you think: Prevalence and predictors of fake news dissemination on Facebook

(almost) everything in moderation: New evidence on americans' online media diets

Exposure to untrustworthy websites in the 2016 US election

Shilling attacks against recommender systems: a comprehensive survey

TweetCred: Real-Time Credibility Assessment of Content on Twitter

Feeling validated versus being correct: a meta-analysis of selective exposure to information

Disentangling the effects of social signals

Groups of diverse problem solvers can outperform groups of highability problem solvers

Factoring Fact-Checks: Structured Information Extraction from Fact-Checking Articles

Linguistic signals under misinformation and fact-checking: Evidence from user comments on social media

Accurately interpreting clickthrough data as implicit feedback

Grou-pLens: Applying collaborative filtering to usenet news

Shilling recommender systems for fun and profit

The science of fake news

Opinion cascades and the unpredictability of partisan polarization

Identifying web browsing trends and patterns

Encouraging reading of diverse political viewpoints with a browser widget

Presenting Diverse Political Opinions: How and How Much

Entropy and inference, revisited

Rating process and criteria

Quantifying biases in online information exposure

Fighting misinformation on social media using crowdsourced judgments of news source quality

Truth of varying shades: Analyzing language in fake news and political fact-checking

GroupLens: An open architecture for collaborative filtering of netnews

Media bias monitor: Quantifying biases of social media news outlets at large-scale

Experimental study of inequality and unpredictability in an artificial cultural market

Detection of novel social bots by ensembles of specialized classifiers

The spread of low-credibility content by social bots

The wisdom of polarized crowds

Sorting the news: How ranking by popularity polarizes our politics

Bots increase exposure to negative and inflammatory content in online social systems

Online human-bot interactions: Detection, estimation, and characterization

The spread of true and false news online

Prevalence of low-credibility information on Twitter during the COVID-19 outbreak

Arming the public with artificial intelligence to counter social bots

Scalable and generalizable social bot detection through data selection

A structured response to misinformation: Defining and annotating credibility indicators in news articles

Improving recommendation lists through topic diversification

Right: proportion of domains labeled as 'trustworthy,' also by NewsGuard. Actual visits v are normalized using TF-IDF (see Methods C). Each bin represents the average computed on the topk recommendations for all users in the YouGov panel with ≥ k recommendations in their test sets. Bars represent the standard error of the mean

Right: RMSE (root mean squared error) of predicted pageviews for top k ranked domains by length of ranked list k (lower is better). Each bin represents the average computed on the top-k recommendations of all users with ≥ k recommendations in their test sets. Bars represent the standard error of the mean

In this figure, both CF and CF+D compute the similarity between users using the Kendall τ correlation coefficient (see Methods C)

Distribution of precision obtained after re-ranking the domains, by means of re-shuffling the diversity signal values g(δ d ) from the CF+D ratings calculation (see Eq. 5 and Eq. 6). The re-shuffling was repeated 1, 000 times. The two distributions correspond to different values of k. The (one-sided) p-values are 0.002 (k = 1) and 0.021 (k = 10)

Effect of CF and CF+D versus baseline by self-reported party ID from YouGov Pulse responses as measured on a 7-point scale (1-3: Democrats including people who lean Democrat but do not identify as Democrats, 4: Independents

Effect of CF and CF+D versus baseline by absolute slant of visited domains (terciles using scores from Bakshy et al.) and by length of ranked list k. Change in trustworthiness ∆Q based on scores from NewsGuard

Effect of CF and CF+D versus baseline by total online activity (TF-IDF

terciles) and by length of ranked list k. Change in trustworthiness ∆Q based on scores from NewsGuard

Effect of CF and CF+D versus baseline by distinct number of domains visited (terciles) and by length of ranked list k. Change in trustworthiness ∆Q based on scores from NewsGuard

Effect of CF and CF+D versus baseline by baseline trustworthiness of domains visited by users (terciles) and by length of ranked list k. Change in trustworthiness ∆Q based on scores from NewsGuard

We thank NewsGuard for licensing the data and acknowledge Andrew Guess and Jason Reifler, Nyhan's coauthors on the research project that generated the web traffic data used in this study.We are also grateful to organizers, chairs and participants of the News Quality in the Platform Era