key: cord-0638426-3paqgzql
authors: Hoang, Le-Nguyen; Faucon, Louis; Jungo, Aidan; Volodin, Sergei; Papuc, Dalia; Liossatos, Orfeas; Crulis, Ben; Tighanimine, Mariame; Constantin, Isabela; Kucherenko, Anastasiia; Maurer, Alexandre; Grimberg, Felix; Nitu, Vlad; Vossen, Chris; Rouault, S'ebastien; El-Mhamdi, El-Mahdi
title: Tournesol: A quest for a large, secure and trustworthy database of reliable human judgments
date: 2021-05-29
journal: nan
DOI: nan
sha: 14f200f87d9c215b16e4ab3f92c21a09b7a0ad67
doc_id: 638426
cord_uid: 3paqgzql

Today's large-scale algorithms have become immensely influential, as they recommend and moderate the content that billions of humans are exposed to on a daily basis. They are the de-facto regulators of our societies' information diet, from shaping opinions on public health to organizing groups for social movements. This creates serious concerns, but also great opportunities to promote quality information. Addressing the concerns and seizing the opportunities is a challenging, enormous and fabulous endeavor, as intuitively appealing ideas often come with unwanted {it side effects}, and as it requires us to think about what we deeply prefer. Understanding how today's large-scale algorithms are built is critical to determine what interventions will be most effective. Given that these algorithms rely heavily on {it machine learning}, we make the following key observation: emph{any algorithm trained on uncontrolled data must not be trusted}. Indeed, a malicious entity could take control over the data, poison it with dangerously manipulative fabricated inputs, and thereby make the trained algorithm extremely unsafe. We thus argue that the first step towards safe and ethical large-scale algorithms must be the collection of a large, secure and trustworthy dataset of reliable human judgments. To achieve this, we introduce emph{Tournesol}, an open source platform available at url{https://tournesol.app}. Tournesol aims to collect a large database of human judgments on what algorithms ought to widely recommend (and what they ought to stop widely recommending). We outline the structure of the Tournesol database, the key features of the Tournesol platform and the main hurdles that must be overcome to make it a successful project. Most importantly, we argue that, if successful, Tournesol may then serve as the essential foundation for any safe and ethical large-scale algorithm.

However, an immediate corollary of this is that such algorithms are manipulated by their data, especially given their current designs. If the data is full of discriminatory texts, then such "stochastic parrots" will repeat or favor such discriminatory texts [BGMS21] . What is especially concerning is that today's most influential algorithms are mostly powered by user-generated data downloaded from the web [SSP + 13, WSM + 19, WPN + 19]. As a result, many of today's trained algorithms should be considered extremely unsafe. More generally, any algorithm trained on uncontrolled data must not be trusted.

A recent line of research on Byzantine resilient learning algorithms provides new tools to protect algorithms from poisoned data or malicious actors [BEMGS17, EMGR18, EGR20, EM20, Rou21] . However, such protections cannot perform miracles. Especially in heterogeneous environments, impossibility theorems apply [EFG + 20a]. In particular, there can be no algorithmic trick to protect against a majority of maliciously crafted data sources.

More generally, the safety of learning algorithms demands the safety of their training datasets. In particular, to design trustworthy ethical algorithms, it is essential to train them on large, secured and trustworthy datasets of human ethical judgments. To the best of our knowledge, today, there is no such dataset. As a result, according to our reasoning, there can currently be no safe learning algorithm. The goal of Tournesol, available at https://tournesol.app, is to remedy this state of affairs, by constructing the largest, most secured and most trustworthy dataset of reliable human judgments ever created. Introducing Tournesol. More specifically, Tournesol aims to elicit comparison-based judgments on what videos are preferable to recommend widely, and on quality features such as "reliable and not misleading", "important and actionable" and "engaging and thought-provoking", among others.

These judgments are then leveraged by a machine learning model, which infers a global score for each video. The model roughly combines the Bradley-Terry model [BT52, May18] and the Licchavi framework [FGH21] , which allows to provide Byzantine resilience and (approximate) strategyproofness. Byzantine resilience means that the outputs of the model are resilient to a minority of contributors with arbitrary behaviors, while strategyproofness means that it is in strategic con-tributors' best interests to provide judgments that match their true preferences. We stress that strategyproofness is particularly important to guarantee as much as possible the trustworthiness of our dataset.

Moreover, contributors are asked to provide personal information, which we use to assess their trustworthiness. In particular, we rely on email verification from trusted email domains, to protect Tournesol against Sybil attacks, i.e., attackers creating and having multiple identities [Dou02] . In the near future, we also plan to deploy a vouching mechanism to include more contributors without sacrificing too much security. Contributors are also asked to report their degrees and expertise, as well as demographic data.

Tournesol's interface allows contributors to either provide data publicly or privately. In the former case, their data will be appended to the Tournesol public database, which is readily downloadable from the Tournesol home page 2 . We encourage its widespread reuse and remix, given appropriate attribution, under the conditions of the license CC BY-SA 3 . Moreover, we ask that the entities who reuse the data pay great attention to strategyproofness considerations, for the trustworthiness of the Tournesol database depends on this. Finally, we ask any entity that reuses our data to do so responsibly, and, as much as possible, for the good of all of humanity. In particular, we ask them to avoid reusing our data for, e.g. advertisement targeting, especially if what is advertised is not for the good of our contributors or of humanity.

Tournesol is deeply attached to transparency. Our code is open-source 4 , and nearly all of our discussions on bugs and new functionalities are publicly available 5 . Our front-end uses React [Gac15] , and the back-end relies on Django [HKM09] , with machine learning algorithms running with Tensorflow [ABC + 16]. Our server's statistics are also public 6 . Note also that the Tournesol platform is still rapidly evolving. The present document will be updated accordingly every few months. We refer however to the Tournesol wiki, available at https://wiki.tournesol.app, for more frequently updated information.

Structure of the paper. The rest of the paper is organized as follows. Section 2 will present the key elements of our databases, and our motivations to collect the data we chose to collect. It will also briefly discuss our privacy policy and options. Section 3 then presents the algorithm currently in use on Tournesol. We stress in particular that our algorithm, based on Licchavi [FGH21] , obeys the principle "one contributor, one unit force", which makes it somewhat strategyproof. More importantly, we hope that this section will motivate and inspire new research into analyzing usergenerated comparison-based judgment datasets like Tournesol's. Section 4 presents our preliminary analysis of our dataset. Since the data collection process is ongoing, we stress that the statistics we present are bound to evolve within the next months and years. Nevertheless, we hope to thereby give insights into both the data we have collected so far, and to illustrate the kind of insights that the analysis of our database can yield. Section 5 then provides a list of research challenges that we regard as urgent to tackle for safer and more ethical algorithms in general, and for the specific case of Tournesol in particular. Finally, Section 6 provides a summary, as well as a call for contributions to make Tournesol a success, and to move towards safer and more ethical large-scale algorithms.

In this section, we describe the main contribution of Tournesol, namely its new, scalable, secured and trustworthy database of reliable human judgments.

To increase the quality of our data, we ask contributors to provide comparison-based judgments, by building upon a large literature on the value of such comparisons [Fes54, BT52, May18] .

It is important to stress that the extent to which a piece of content should be recommended must not be measured by a binary variable. Indeed, while some pieces of content are highly undesirable to recommend at all, others are somewhat desirable to recommend widely, and others still are very important to recommend at scale. The same remark arguably holds for all the other quality criteria considered by Tournesol. A video can "encourage better habits", be "layman-friendly", and promote "diversity and inclusion" to varying degrees.

This suggests that videos should be scored on a continuous scale for each quality criterion. This is what Tournesol eventually proposes as an end product on its platform. Contributors could have been asked to judge videos on a Likert scale [Lik32] . But, while the Likert scale provides valuable information [Nor10, WTL16] , the Likert scale has been widely criticized for being an unreliable measure and difficult to interpret [Alb97, JKCP15, Sub16] . In the case of Tournesol, we were concerned in particular that contributors abuse extreme values on the scale, or that different contributors use different values of the scale to mean different things [LJMZ02] . Intuitively, for instance, if a math video is consistently reviewed by extremely rigorous contributors, it might end up with a worse "reliability and not misleading" score than a video on another topic that was reviewed by much less demanding contributors. Moreover, we aim to obtain a much more fine-grained judgment of video quality. To circumvent the difficulty of extreme scoring and of incomparable individual scales, Tournesol asks contributors to provide comparison-based ratings (see Figure 2 ). Namely, contributors are asked to select two videos, and to tell Tournesol which one of the videos should be recommended at scale. Moreover, rather than a binary decision, the contributor is asked to provide the judgment by moving a slider on a continuous scale, from 0 to 100. When rating a quality criterion Q, the value 0 means that the contributor judges that the left video content is far better in terms of Q, while the value 100 means that they believe the right video content is far better in terms of Q. Evidently, if the contributor reports 50, it means that they put the slider in the middle, as they judge that the two videos should be scored similarly according to Q.

We believe that such comparative judgments will yield more reliable human judgments. We hope that this will allow for the design of more reliable algorithm training.

Tournesol currently proposes ten quality criteria to rate, only one of which is available by default.

The only default quality criterion is "Should be largely recommended". Tournesol chose to single out this one criterion as it is arguably the bottom line criterion for constructing recommendation algorithms. Moreover, Tournesol chose to make it the only default quality criterion to facilitate the use of Tournesol for a large public of contributors. We were particularly concerned about the risk of overwhelming new contributors with too many complex queries.

Nevertheless, Tournesol provides contributors with the possibility to rate nine other optional quality criteria:

Reliable and not misleading: Content that scores high on 'reliable and not misleading' should make nearly all viewers improve their global world model, despite viewers' biases and motivated reasoning.

Important and actionable: Content that scores high on 'important and actionable' should present data and arguments with major consequences, as well as actionable plans that would have a large impact.

Engaging and thought-provoking: Content that scores high on 'engaging and thought-provoking' should catch the attention of a larger audience, trigger their curiosity and a desire to find out more or question their own beliefs.

Encourages better habits: Content that scores high on 'encourages better habits' should be successful at motivating viewers to grow and improve themselves.

Clear and pedagogical: Content that scores high on 'clear and pedagogical' should help viewers understand all the elements that lead to a conclusion.

Layman-friendly: Content that scores high on 'layman-friendly' should be accessible to a very large audience.

Diversity and inclusion: Content that scores high on 'diversity and inclusion' should celebrate diversity and be appealing to minority groups.

Resilience to backfiring risks: Content that scores high on 'resilience to backfiring risks' should be safe to recommend to all sorts of viewers, with limited risks of misunderstandings.

Entertaining and relaxing: Content that scores high on 'entertaining and relaxing' should entertain and relax viewers.

While detailed descriptions of all the criteria are provided on the Tournesol wiki, at https://wiki. tournesol.app/index.php/Quality_criteria, data scientists and researchers should probably not expect most contributors to read thoroughly our descriptions. Arguably, most contributors will more likely judge these criteria according to their own understanding, which will be mostly based on the name of the criteria. Note also that contributors can skip the judgment of a quality criterion. The database is partially sparse in this regard. Tournesol also proposes the option for contributors to assess their confidence in their ratings along each quality criterion, on a scale from 0 to 3, as illustrated in Figure 3 . A confidence of 0 is equivalent to skipping the criterion altogether.

Such data are currently used by our learning algorithm to determine both the contributors' scores, and their impacts on the global Tournesol scores.

To guarantee the security of our data, Tournesol aims to verify that every account is owned and controlled by a human, and that this human only owns and controls this single account on the platform. In other words, Tournesol aims to obtain a Proof of Personhood [BKJ + 17] to verify each active Tournesol account, and to thereby prevent Sybil attacks [Dou02] . Unfortunately, there is currently no reliable and scalable solution for Proof of Personhood.

Today's main solution is email certification. More precisely, when they create a Tournesol account, contributors are asked to validate, if possible, an email address from a trusted email domain. The list of trusted email domains is currently managed manually. An email domain will be considered trusted if it seems sufficiently unlikely that a large number of fake accounts can be created from this email domain.

Clearly, this excludes email provider domains like @gmail.com and personal domains like @mypersonal-website.com. Indeed, the concern is not only that the email domain owner will maliciously create a large number of fake accounts; it is also that they may be hacked by a malicious entity that will create such fake accounts. The list of trusted email domains is available at https: //tournesol.app/email_domains. It includes domains like @epfl.ch, @who.int and @rsf.org.

Evidently, however, this solution is still highly imperfect. On one hand, this does not guarantee the absence of fake accounts. On the other hand, and perhaps more importantly, this excludes most potential contributors from participating.

Figure 4: Any account on Tournesol can vouch for another account. This helps Tournesol verify more accounts in a secured manner.

Tournesol also proposes a vouching mechanism (see Figure 4) . Namely, any account can vouch for the authenticity of another account. More precisely, the account must vouch that the other account is used by a human who is not using any other account on the platform. Each emailverified account is then given a certain amount of vouching power, and each account must receive a certain amount of vouching (weighted by vouching power) to be certified. We refer to https: //wiki.tournesol.app/index.php/Vouching_mechanism for more (updated) details.

We are currently investigating the design of a reliable, Byzantine-resilient and scalable vouching mechanism (see Section 5.2).

To better understand our contributors' rating patterns, and identify which ratings follow from thoughtful reflections, Tournesol measures the contributors' response times. Tournesol also records the motions of the sliders. We believe this to be particular important for volition learning (see Section 5.8).

Contributors can provide ratings publicly or privately. More precisely, each contributor can select the privacy setting of any video they rate. If a video is rated privately, then all its comparisons to any other video will be recorded privately; namely, only Tournesol's server will have access to such data. Conversely, all comparisons that involve two publicly rated videos are public, and can be downloaded from the Tournesol home page.

Overall, we encourage transparency in our contributors, as we believe that this will foster important research on human judgments, and help make safer and more ethical algorithms. However, we acknowledge that, because of social and political pressures, some judgments are dangerous to make public. In such cases, Tournesol allows contributors to provide these judgments in a private way that nevertheless affects the Tournesol global scores, and thus what will be recommended by our platform.

Each Tournesol contributor has a dedicated Tournesol contributor page, where they can access their individual scores for different videos, as well as statistics based on their recommendations.

On this page, we also ask contributors to provide personal information, publicly or privately. Again, we encourage transparency from our contributors, as this will greatly facilitate research and the design of solutions for safer and more ethical algorithms. But we acknowledge that this comes at a cost for our contributors, which they may want to avoid.

Personal information of great interest to Tournesol includes the contributor's expertise and degrees, as well as their demographic data. Such data are particularly critical for Tournesol to understand its sampling bias, and to design debiasing solutions to avoid discriminatory algorithms.

In this section, we briefly describe the learning algorithm used by Tournesol to transform comparisonbased ratings into individual and global scores. We hope that this will increase the transparency of Tournesol, inspire further research into leveraging our data and stress the importance we assign to strategyproof learning. Figure 5 lists the current top recommendations on Tournesol.

Note that each quality criterion is treated independently. Thus, without loss of generality, we only focus on a single criterion here.

Tournesol uses the Licchavi framework [FGH21] . More precisely, denote [N ] = {1, . . . , N } the set of verified Tournesol contributors, and [V ] = {1, . . . , V } the set of rated videos. We denote D n = {(v, w, r)} a set of ratings by contributor n ∈ N . We assume that r = (slider − 50)/50 ∈ [−1, 1], where slider ∈ [0, 100] is the position of the slider discussed in Section 2.1. The variables θ nv correspond to the (learned) score of video v given by contributor n, while ρ v will be the global score of video v. Finally, for clarity, we denote D and θ the tuple of data D n and θ n . Tournesol then considers the following loss function:

where is a loss-per-input function which will be detailed in the next section, λ is the weight of collaboration, w nv is the weight of contributor n on the score of video v, and ν is the weight of the regularization of the global scores. A precise analysis of this loss function is beyond the scope of this paper, and is currently being investigated. Tournesol's individual scores θ nv and global scores ρ v are then computed by minimizing the Tournesol loss function. In particular, the global scores are used to make recommendations (see Figure 6 ) and are displayed on Youtube, for users using on browser extension (see Figure 7 ).

Let us now detail the loss-per-input function currently used by Tournesol. We derive this loss from the classical Bradley-Terry model [BT52] , which assumes that video comparisons are binary: either the video v is better than w (expressed by a rating r = 1), or vice versa (r = −1). The probability of r = 1 (i.e., preferring video v to video w) depends on the difference t between the videos' scores: t θ nv − θ nw . The larger the difference t is, the more probable it is that the contributor declares preferring v to w. The loss is then the negative likelihood of the data, assuming independence of different ratings, i.e.

(t, r) ln (1 + exp(tr)) .

(2) Note that this loss is also essentially the logistic regression loss. The usual interpretation of this loss is that v will be one point above w (i.e., t = 1), if it is e times more likely to be rated better. However, this model is no longer adequate when contributors provide continuous ratings r ∈ (−1, 1). In fact, with the Bradley-Terry model, providing a rating r = 0 is equivalent to not contributing at all. However, a rating r = 0 may be more accurately interpreted as the contributor saying that the two videos should have similar scores. The Bradley-Terry model fails to capture the nuanced information contained within continuous ratings.

This remark calls for future research to investigate alternative learning algorithms, which may yield better interpretability, and may better capture the intuitive idea that, when a contributor judges r ≈ 0, they are actually saying that the two compared videos should have similar scores.

We ran preliminary experiments to understand the impact of the hyperparameters of our learning model. Intuitively, λ measures how much the global scores are used to adjust each contributor's local scores. Large values of λ mean that the global scores are regarded as a very reliable prior. Smaller values of λ imply that the contributor's scores are likely to quickly diverge from the global scores. In fact, 1/λ is the order of magnitude of the prior standard deviation on contributor n's score for video v, given the common score ρ v . Intuitively, the more diversity there is in contributor's judgments, the smaller the value of λ should be. Future research will address how to dynamically adjust λ.

The hyperparameter ν weighs the prior regularization on the scores. More importantly, its value determines the Byzantine resilience of our learning model [FGH21] . Larger values of ν imply that a single contributor can hardly affect the global scores. As the number of contributors on Tournesol grows, we plan to decrease the value of ν. In particular, a when ν = µ/ν is small, then the rating of a video cannot be large if the video was not rated by many contributors. Further investigations of the impact of hyperparameters, and what a safe and ethical choice of hyperparameters should be, are needed.

Note however, that the effect of the hyperparameter ν vanishes in the limit of a large number of contributors. At this point, the global scores are simply the medians of the contributors' scores. Likewise, the effect of the hyperparameter λ also vanishes in the limit of a large number of data per contributor. In other words, if Tournesol is used by a large number of very active contributors, then the choice of the hyperparameters hardly matters.

Currently, the platform uses the values λ = ν = 1. Finally, we defined w nv R nv /(C + R nv ), where R nv is the number of times contributor n rated video v against other videos. This allows to favor contributors who provided more ratings to a video. We set the hyperparameter C = 3. As a result, a contributor would obtain half of their maximal influence by rating a video three times, and 75% of it by rating it nine times.

Note that, given that ratings are sparse, inferring individual scores from individual ratings only is suboptimal. In particular, this approach is vulnerable to Stein's paradox [Ste56, JS61] . Instead, to learn individual scores, even for non-verified contributors, we leverage the learned global scores ρ * (note that minimizing Loss does this automatically for verified contributors). This corresponds to minimizing this loss for every non-verified contributor n:

(3)

Such learned scores are used to provide individual recommendations on the non-verified contributors' Tournesol page.

Tournesol applies the learning algorithm described above to the different quality criteria. This allows users to obtain customizable recommendations and search results, by adjusting the importance they assign to the different quality criteria. The videos are then ranked based on the weighted sum of the scores per criteria, where the weights are given by the positions of the user's sliders (see Figure 5 ). However, we leave open the question of designing personalized robustly beneficial content recommendation, as discussed in Section 5.14.

We stress that the Licchavi framework with such 1 collaborative regularizations has been tied with strategyproofness theorems in [FGH21] . We acknowledge, however, that the strategyproofness theorem in [FGH21] does not quite apply to our setting though, partly because our loss-per-inputs are not actually gradient-PAC, but more importantly because our queries are not coordinatewise. Nevertheless, our framework applies the fundamental fairness principle "one voter, one unit force" proposed by [EMFGH21] , which suggest a reasonable amount of strategyproofness (and of Byzantine resilience). Future research is however needed to better determine the extent to which our algorithms are robust and strategyproof.

To highlight the value of our database, we present insightful preliminary data analyses. Evidently, our data will evolve, hopefully drastically, in the coming months. Figure 9 displays the number of contributions per user among the top current contributors. Perhaps unsurprisingly, this statistics seems to follow a heavy tail distribution, with a few contributors providing most of the ratings, and the majority of them providing very few ratings. Fortunately, our learning algorithm, based on Licchavi with norm-based collaborative regularizations, fits the principle "one voter, one unit force" [EMFGH21] , and thus prevents the leading contributors from having an uncontrollable influence on global scores. 

The left part of Figure 10 shows the connectivity of the graph of video comparisons. In this graph, two videos are connected if they have been compared at least once on the Tournesol platform. While the graph features a dense center, many videos are only weakly connected to most other videos. The connectivity of the graph is important, as it allows to guarantee that all videos are scored along the same scale. The right part of Figure 10 displays the commonalities between contributors. Namely, nodes on the graph are contributors, and two contributors are connected if their lists of rated videos have a non-empty intersection. Figure 11a reports the correlations between quality criteria. It shows that the choice of our criteria is somewhat reasonable, as most criteria are only weakly correlated. Some criteria are however somewhat redundant like "reliable and not misleading", "clear and pedagogical" and "resilient to backfiring risks". Similarly, "important and actionable" and "encourage better habits" are also quite correlated.

As expected given Berkson's paradox [Ber46] , the correlations decrease if we only consider the top 10% videos on Tournesol (Figure 11b ). This is important as these are the videos that Tournesol users will be more exposed to.

As it is not formally defined how contributors should rate a pair of videos, we expected many different rating styles. Figure 12 shows how two contributors whose rating patterns are drastically different, despite rating similar videos. Namely, while contributor "aidjango" provided ratings close to "indifferent", contributor "le science4all" provided more extreme ratings. This suggests that the discrepancies between their individual scores will be due to their rating style, rather than actual differences in their judgments. As discussed in Section 5.4, research should address how to adapt (a) All videos rated on Tournesol (b) Top 10% videos rated on Tournesol Figure 11 : Correlations between quality criteria, whose numbering are given in Table 1 . Ideally, a set of criteria should be orthogonal enough to provide complementary information with minimal effort on the contributor's side.

# Name of criteria 1

Should be largely recommended 2

Reliable and not misleading 3 Important and actionable 4

Engaging and thought-provoking 5

Clear and pedagogical 6

Layman-friendly 7

Diversity and Inclusion 8

Resilience to backfiring risks 9

Encourages better habits 10 Entertaining and relaxing Table 1 : Numbering of the criteria used in Figure 11 . the learning model to contributors' different rating patterns. Figure 13 shows the number of videos per Pareto rank. To recall, the Pareto rank of a video is the number of videos that need to be removed so that the video becomes a Pareto-optimal video, i.e., a video that no other video outperforms on all quality criteria.

We note that, currently, there is a large number of pareto-optimal videos (> 50) because there are 10 dimensions allowing for plenty of trade-offs. This highlights the usefulness of customizable recommendations. It is also noteworthy that many videos have rank 3 or less. This is likely due to the fact that current contributors mostly rate videos that they deem particularly worthy of recommendation. It will be interesting to see how these statistics evolve as our database grows, and contains a more diverse range of video rating styles.

Tournesol raises numerous fascinating research challenges. Below, we sketch some of these challenges. We would welcome any potential collaborations on any of the following challenges, and actively encourage the use of our public database for good, and, in particular, for research pur-poses.

We expect the combination of many different quality criteria to yield a more reliable judgment of what content ought to be recommended at scale, or to a given a specific user. However, the appropriate aggregation of our different quality criteria is still unclear, especially given probable nonlinear phenomena. For instance, a reasonably reliable content seems vastly superior to a misleading content, whereas an extremely reliable content does not seem vastly superior to a reasonably reliable content. We hope to investigate how best to do this, by comparing aggregations of optional quality criteria to the "Should be largely recommended" criterion, and by factoring in discrepancies between different contributors' ratings, given their expertise and contributing profiles.

We hope to leverage vouching to verify more contributors in a Byzantine-resilient manner. However, it is unclear so far what are the best algorithms to achieve this. We are currently investigating this relatively general problem (see Section 2.5), which is tightly connected to ideas such as Web of Trust [Mue21] or reputation mechanisms [dORM + 20].

Unfortunately, like in many online participatory projects [ADK + 18], we expect huge participation imbalances. Young male technology-savvy individuals are probably more likely to participate in Tournesol, which means that they will be overrepresented in the Tournesol database. We are currently investigating how to leverage our demographic data to debias the Tournesol recommendations, e.g., by giving stronger voting rights to individuals whose communities are underrepresented in the Tournesol database.

The current "binomial Bradley-Terry" model assumes that all contributors rate in the same manner, given implicit score differences. In practice, as discussed in Section 4.4 and as depicted by Figure 12 , different contributors will have very different rating habits. To make sure that the learned individual scores are fairly comparable, the rating model should be hyperparameterized, and the hyperparameters should be learned and adjusted per contributor. We are currently investigating ways to do this.

On technical topics like vaccination or climate change, especially when misconceptions are widespread in the general population, it seems desirable to assign more voting rights to experts, especially when judging the reliability of content within their domains of expertise. This issue is intimately connected to Condorcet's jury problem [Con85, NP82] . We are currently investigating how best to leverage personal information to determine appropriate voting rights.

To understand the reliability of Tournesol's data, it is critical to understand the psychology of contributors when they provide judgments. We are currently investigating contributors' thought processes, and how rating on Tournesol affects contributors' reasonings. This work is inspired by [LKK + 19], who interviewed participants of their participatory ethical algorithm design process, and found out that, by the participants' own assessments, this enabled them to improve their ethical judgments and to increase their trust in the participatory system.

To increase the quality and the quantity of the database, it is crucial for Tournesol to help contributors select the videos to compare in the best possible ways. This is a challenging task, as it encompasses several considerations. On one hand, the comparison should provide valuable data, by typically querying interesting comparisons, involving videos with too few ratings or better connecting the graph of pairwise video comparisons. On the other hand, it is important that this demands as little cognitive investment from the contributor as possible. This typically requires suggesting the contributor to compare videos that they have recently watched and assessed. Finding out how to best do this is a research challenge that we hope to investigate soon.

Despite our efforts, we cannot expect the Tournesol database to contain fully reliable human judgments. We expect many of ratings will be provided by contributors who might not be considering all the possible ramifications and unwanted side effects of promoting a video content at scale when they provided their judgments. In particular, some judgments will arguably be more reliable than others. Understanding which judgments are more reliable than others is a vast research program that is critical to design safer and more ethical algorithms. More reliable judgments are sometimes called volitions, rather than preferences. Volitions are also often described as second-order preferences: they correspond to what we would prefer to prefer, rather than what we instinctively prefer [Tar10, HE19] . We hope to initiate the research on volition learning, by leveraging the meta-data of our database (see Section 2.6).

While we believe that they have a very reasonable amount of privacy protection, our current algorithms require a deeper analysis to understand the extent to which they guarantee the protection of private ratings. Future research should also investigate how to strengthen privacy without harming too much the quality and the security of the Tournesol scores, and how to leverage private personal information about our contributors (which are currently not used) in a privacy-preserving manner. Perhaps most importantly, ideally, Tournesol would be able to leverage private ratings to score videos without being a single point of failure for private data protection. Moving forwards, a challenging research avenue is to investigate the extent to which the mission of Tournesol can be achieved without any transfer of private information.

A longer-term goal is to fully decentralize Tournesol. In this vision, the data would no longer be stored on Tournesol's server, but would be replicated appropriately on a large number of contributors' devices. Moreover, the computations of Tournesol scores should also be decentralized, while guaranteeing Byzantine resilience. Recent research in fully decentralized Byzantine learning has provided the building blocks of such a decentralization [EFG + 20b], but more research is needed to understand how to best do so in the context of Tournesol.

Right now, Tournesol only leverages the Tournesol database to compute individual and global scores. However, in the longer-term, it seems desirable to leverage additional information, such as the videos' channels or their descriptions, to generalize the individual scores of a contributor to videos that they have not rated (and not even watched!). In particular, the use of natural language processing on video captions can be a promising avenue, though this probably requires a larger database than the current database.

Right now, Tournesol is imposing each contributor to use our model, and in particular the binomial Bradley-Terry loss (see Section 3.2). However, especially as we move to algorithms able to generalize to videos the contributor has not watched yet, it seems desirable to enable the contributor to choose which learning model they want to use. To achieve this, Licchavi must be adapted to enable the collaboration of different local learning models.

In the long run, we hope that Tournesol's database will be useful to design more sophisticated ethical algorithmic products, such as a ethical language models [BGMS21] . Typically, when prompted, such language models should consistently produce texts that are robustly beneficial to communicate, either at scale, or to targeted consumers. Determining how to combine large language models [FZS21] with Tournesol's database to design safe and ethical language models is currently a very open research challenge.

For most large-scale algorithmic systems, such as recommendation algorithms, the objective function is the main avenue we have to make them robustly beneficial. The problem of making this objective function safe and desirable to optimize is called the alignment problem [Yud16, Rus19, Hoa19]. While a large, secured and trustworthy database of reliable human judgments seems critical to solve alignment, it is unclear so far how to best leverage the Tournesol database to achieve this, especially while being personalized [FGH21] and while being robust to Goodhart's law [EMH21] . This is arguably the most challenging and the most exciting of all the problems Tournesol raises.

Summary. In this paper, we introduced Tournesol, a platform whose goal is to collect and curate a large, secured and trustworthy database of reliable human judgments. We discussed the main features of the platform, the algorithm we use today to leverage this database to make more robustly beneficial recommendations, and the numerous research challenges that must be overcome to make Tournesol a success. We strongly believe that the creation of such a database will stimulate research and development on ethical algorithms, which should help improve the informational diet of billions of people for the better. To achieve this ambitious goal, however, we will need support.

Call for action. To date, many individuals have voluntarily contributed to our code base and our database, for which we are immensely grateful. However, we still need a large amount of software development to make our platform more contributor-friendly, more secure and more scalable. For this reason, the Tournesol Association is currently searching for funding to hire top developers to maintain and develop the platform. Just as critically, Tournesol needs to attract a large number of contributors and to obtain human judgments from a diverse population. We would also be hugely grateful for the support of fellow researchers and institutions. To help us, try out the platform and please spread the word! Your contributions will be improving the quality of online discourse for everyone.

Tensorflow: A system for large-scale machine learning

Jean-François Bonnefon, and Iyad Rahwan. The moral machine experiment

Persistent anti-muslim bias in large language models

How facebook, twitch, and youtube are handling live streams of the capitol mob attack. The Verge

The likert scale revisited

The hype machine: How social media disrupts our elections, our economy and our health-and how we must adapt. Currency

Machine learning with adversaries: Byzantine tolerant gradient descent

Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems

Limitations of the application of fourfold table analysis to hospital data

On the dangers of stochastic parrots: Can language models be too big?

Proof-of-personhood: Redemocratizing permissionless cryptocurrencies

Existential risk prevention as global priority

Rank analysis of incomplete block designs: I. the method of paired comparisons

Essai sur l'application de l'analysè a la probabilité des décisions renduesà la pluralité des voix. L'imprimerie royale

Dawn Song,Úlfar Erlingsson, Alina Oprea, and Colin Raffel

Blockchain reputation-based consensus: A scalable and resilient mechanism for distributed mistrusting applications

The sybil attack

The division of labor in society

Microsoft asia's a.i. 'girlfriend' has a state-imposed filter to avoid sex & politics. LaptrinhX

Lê Nguyên Hoang, and Sébastien Rouault. Collaborative learning as an agreement problem. CoRR, abs

Lê Nguyên Hoang, and Sébastien Rouault. Collaborative learning in the jungle

Distributed momentum for byzantine-resilient learning. CoRR, abs

On the strategyproofness of the geometric median. ArXiV

The hidden vulnerability of distributed learning in byzantium

On goodhart's law, with an application to value alignement

Facebook language predicts depression in medical records

A theory of social comparison processes

Facebook has shut down 5.4 billion fake accounts this year

Strategyproof learning: Building trustworthy user-generated datasets. ArXiV

The cultural evolution of civilizations

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

Introduction to React

It's already too late to stop the ai arms race-we must manage it instead

Rules for a flat world: why humans invented law and how to reinvent it for a complex global economy

Le fabuleux chantier: Rendre l'intelligence artificielle robustement bénéfique

Recommendation algorithms, a neglected opportunity for public health

The definitive guide to Django: Web development done right

Towards robust end-to-end alignment

Science communication desperately needs more aligned recommendation algorithms

Bullying, cyberbullying, and suicide. Archives of suicide research

The relations among social media addiction, selfesteem, and life satisfaction in university students

Likert scale: Explored and explained

Estimation with quadratic loss

A technique for the measurement of attitudes. Archives of psychology

Cultural differences in responses to a likert scale

Webuildai: Participatory framework for algorithmic governance

Efficient Learning from Comparisons

What makes online content viral

Neurotics can't focus: An in situ study of online multitasking in the workplace

A survey on bias and fairness in machine learning

The radicalization risks of GPT-3 and advanced neural language models. CoRR, abs

Will this time be different? a review of the literature on the impact of artificial intelligence on employment, incomes and growth

Let's attest! multi-modal certificate exchange for the web of trust

Likert scales, levels of measurement and the "laws" of statistics

Optimal decision rules in uncertain dichotomous choice situations

Ethics of artificial intelligence

Information overload in the information age: a review of the literature from business administration, business psychology, and related disciplines with a bibliometric approach and framework development

Practical Byzantine-resilient Stochastic Gradient Descent

Auditing radicalization pathways on youtube

Factfulness: Ten Reasons We're Wrong About the World-and Why Things Are Better Than You Think

Human compatible: Artificial intelligence and the problem of control. Penguin

What are you optimizing for? aligning recommender systems with human values

The value learning problem

Youtube's ai is the puppet master over most of what you watch

Dirt cheap web-scale parallel text from the common crawl

Inadmissibility of the usual estimator for the mean of a multivariate normal distribution

Using likert type data in social science research: Confusion, issues and challenges

A survey on hate speech detection using natural language processing

Coherent extrapolated volition: A meta-level approach to machine ethics. Machine Intelligence Research Institute

Time distortion when users at-risk for social media addiction engage in non-social media tasks

A survey on deep learning techniques for privacy-preserving

L'affaiblissement des corps intermédiaires par les plateformes Internet. Le cas des médias et des syndicats français au moment des Gilets jaunes

Aligning recommender systems as cause area. Effective Altruism Forum

Computational propaganda: political parties, politicians, and political manipulation on social media

Misinformation in social media: Definition, manipulation, and detection

The reality game: how the next wave of technology will break the truth

Superglue: A stickier benchmark for general-purpose language understanding systems

GLUE: A multi-task benchmark and analysis platform for natural language understanding

Another look at likert scales

The ai alignment problem: why it is hard, and where to start. Symbolic Systems Distinguished Speaker

The design and implementation of xiaoice, an empathetic social chatbot