key: cord-0459829-ic7fwd7c authors: Sawicki, Jan; Ganzha, Maria; Paprzycki, Marcin; Buadicua, Amelia title: Exploring usability of Reddit in data science and knowledge processing date: 2021-10-05 journal: nan DOI: nan sha: c82bc61f70bcecc48f43f83aede9428ef0f7caaa doc_id: 459829 cord_uid: ic7fwd7c This contribution argues that Reddit, as a massive, categorized, open-access dataset, is a useful data source, for"almost any topic". Hence, it can be used in data science, e.g. for knowledge exploration. This statement is backed-up with presented analysis, based on 180 manually annotated papers, related to Reddit itself, and data acquired from popular databases of scientific papers. Finally, an open source tool is introduced, which provides an easy access to Reddit resources, and an exploratory data analysis of how Reddit covers selected topics. These functions can be used as a prelude analysis to a broader exploration of Reddit's applicability. Recently, social networks and content sharing networks became popular repositories of data, used for information and knowledge processing (especially for information retrieval). The aim of this work is to explore the usability of Reddit as a data source. In this context, we present a review of scientific literature about Reddit itself, its presence in scientific databases, and elaborate its "topical coverage". Moreover, for the latter study, a specialized tool (Reddit-TUDFE )' is introduced, which allows for fast check of Reddit coverage of a selected topic. The key contributions of this work are answers to the following research question (RQs): • RQ1: What are the most popular methods to acquire Reddit data? (do they allow capturing graph networks 1 ) • RQ2: What problems are the most researched when using Reddit as a dataset? • RQ3: How does Reddit usage in data science change over time? Is it declining or is it increasing? • RQ4: Are there any popular topics that are not (substantially) covered on Reddit? • RQ5: Is Reddit used as a single dataset, or with datasets from other online platforms? 2 These questions are essential for further planned research and positive answer would mean that Reddit is a proper choice for proceeding with the project of information retrieval about popular trends, using graph databases and complex networks. Moreover, positive answers would indicate that Reddit may be a competitor (or a companion) to explorations based on more popular data sources, like Twitter. 2. What is Reddit. Let us start from a brief description of Reddit. It is a web content rating and discussion website [31] . It was created in 2005 and is ranked as the 17 th most visited website in the world, with over 430 million monthly active users 3 and total of over 13 billion posts and comments 4 . The structure of Reddit is illustrated in Figure 1 . Reddit is divided into thematic subfora (so called, subreddits) dynamically created by its users. Therefore, the topic structure is systematically evolving, in response to user needs. Each subreddit has its moderators who may supervise submissions and comments. Comments are linked to submissions, or to earlier comments, forming a tree-like structure. 2.1. Content access rules and restrictions. Most of the subreddits are public (for registered and non-registred users). There are some exceptions based, for instance, on karma points (i.e. user's score), comments, gold (i.e. Reddit's currency that can be purchased with real money), moderator status, time on Reddit, username and others. For instance, such restriction can be be applied to even a Harry Potter house preference (e.g. r/gryffindor) 5 . Here, let us note that the Reddit topic explorations tool (introduced in Section 5), is based only on access to publicly available data. Not only is the data on Reddit publicly accessible (with the exception of private communities), it is also made available via the official Reddit API 6 . However, in the course of literature review, it was found that most researchers do not actually use it. Over 90% of analyzed papers either use ready datasets scraped earlier from Reddit and posted online (possibly in an annotated form), or they choose the Pushshift API [4] . None of the analysed papers stated the explicit reason for this choice (very few even mention how their datasets have been retrieved). However, practically testing capabilities of Reddit API and Pushshift API shows that the key factor could have been that Reddit API does not allow easy retrieval of historical data, while Pushshift API does. Hence, when developing the Reddit data exploration tool, the Pushshift API was used. 3. Data acquisition and processing. To explore Reddit, as seen by the scientists, a dataset of all, most recent, papers available on arXiv has been assembled -a total of 180 papers. All of them were related to Reddit and submitted to arXiv between 01-01-2019 and 01-03-2021 (and retrieved on 30-03-2021 7 ). This dataset has been processed both manually and automatically. First, collected papers have been manually annotated with four attribute sets: topic (a general area of research), methods (theoretical approach, e.g. neural network, text embedding), dataset and technologies (practical software, e.g. BERT [10] ). Next, obtained results were merged using arXiv identification code and the publicly available data, i.e. the content (title and raw text) and the bibliometric metadata. This allowed extraction of information presented in Section 4. All collected content has been converted to a raw text file, using PDF Miner software [38] . Next, the key features of titles and texts have been cleaned and mined using the NLTK framework [26] (for sentiment and subjectivity), and TF-IDF [36] for vectorization (both frameworks are part of the scikit-klearn library [34] ). As a result of processing of collected data, we were able to formulate a number of observations. Let us summarize the most important ones. As shown in Figure 2 , there is a significant growth in the number of articles (related to Reddit) published after March 2020 (correlated with the outburst of the COVID-19 pandemic) and in October 2020 (correlated with notification dates for many scientific conferences [40] ). The latter fact was also verified during manual processing of collected data. This suggests that Reddit was used to provide data related to COVID pandemic and that it is used as a data source for contributions to, broadly understood, data analytics related conferences. Next, as seen in Figure 3 , majority of papers were written by 2-4 authors, with one having 26 authors [13] . S t a r n in i C h e n L in g B a r r y B r a d ly n Iv a n G a r ib a y Ji a n fe n g G a o M a t t e o C in e ll i A le s s a n d r o G a le a z z i W a lt e r Q u a t t r o c io c c h i M ic h a e l S ir iv ia n o s L u W a n g T r is t a n C a u lf ie ld Finally, Figure 4 , shows that the most prolific authors, of Reddit-related papers, were Savvas Zannettou (Max-Planck-Institute), Jeremy Blackburn (Binghamton University) and Gianluca Stringhini (Boston University). This seems to suggest that large number of scientific content, generated while studying Reddit posts, is delivered by a close circle of scientists. Analysis of topic, methods and technology. Topics, methods and technologies are key to answer RQ1 and RQ2. These were extracted manually from the collected papers. They are summarized in Figures 5, 6 and 7. co nv er sa tio n an aly sis co vid ha te sp ee ch co nv er sa tio n mo de llin g po liti ca l dr ug s co nv er sa tio n ge ne ra tio n sa rca sm inf or ma tio n sp re ad su ici de me me s ar gu me nt s co ns pir ac y th eo rie s im ag e an aly sis co nv er sa tio n pr ed ict ion me nt al he alt h tre nd s an aly sis 6 show clearly that the most popular research topic is conversation, which matches the fact that Reddit is a discussion forum. Due to the timing of this work (overlapping with the COVID-19 pandemic), the second most common topic is COVID (see Figure 5 ). Since Reddit consists mostly of text-based discussions, it is not surprising that the two most common methods, in Reddit-related research, are text embeddings, used in text processing, and networks, used for social network analysis. Note that, in the reported results, "network" (understood as a graph) and "neural network" are separate terms. Regarding technologies (shown in Figure 7) , over 45% of studies used Pushshift API [4] for Reddit data extraction, and over 35% applied BERT [10] embedding (and its variations) for the natural language processing. Finally, topics and methods have been combined in a correlation heatmap (Figure 8 ). Here, a few significant correlations have been established. However, they have to be considered keeping in mind that they materialize in the context of a specific dataset, created on from contributions reporting research that used Reddit as a data source. therefore, no claim is made that these observation can be immediately generalized beyond the dataset used in this work. However, based on general knowledge of the field, they seem to be in line with more general trends. • Papers related to drugs typically use word embeddings. However, this can be related to the overall popularity of word embeddings in the research conducted in early 2020th (see, for instance, the citation count for [10] ). • Networks are typically applied in analysis of trends, e.g. topic popularity (this is a key finding for RQ1). • Articles dealing with sarcasm often use LSTM networks. • Research devoted to the conversation generation typically applies the BLEU metric. The topic of information and knowledge retrieval is one of the main aims of undertaken analysis. Hence, this category was checked specifically. Even though many works focus on information spreading in online communities [13, 41, 11, 12] , there is hardly any focus purely on information/knowledge retrieval. There are precisely two papers (1% of the considered work) related to knowledge processing (specifically, knowledge graphs [6, 43] ). Expanding arXiv search, to capture all articles including terms "knowledge" and "Reddit", resulted in 4 records, none of which is related to knowledge capture. Pairing keyword "Reddit" with "information retrieval" or 'information processing" yielded 0 results. Therefore, top knowledge processing/management-related conferences were searched, but only one contribution [18] , about knowledge and Reddit, has been found (published by the K-CAP conference in 2011 9 ). This renders Reddit as a source that is definitely underexplored in terms of knowledge/information mining. Moving to the RQ5, it was discovered that among papers that use Reddit, over 30% also use Twitter, which is a data source that is very often used for sentiment analysis [24] ). Other datasets that have been utilized together with Reddit are: Facebook, 4Chan, YouTube, and Gab. Each of them appears in less than 10% of papers, which used Reddit (details are shown in Figure 9 ). Datasets are rarely used in triplets, i.e. Reddit and two other datasets (the highest scoring triplets were Reddit, combined with Twitter and Facebook 6.6% of articles; Reddit, used together with Twitter and 4chan 6% (e.g. [41, 42] ), and Reddit studied jointly with Twitter, YouTube 5% of contributions (e.g. [5] )). Finally, a single paper considers combination of four datasets (i.e. Reddit, Twitter, Facebook, and Gab [7] ). An interesting use case of Reddit usage in scientific environment has been found in "IEEE Top Programming Languages: Design, Methods, and Data Sources" 10 . This work shows a practical approach to a an interesting research question; here, what are the top programming languages. In this work Reddit is listed as one of the sources among others, such as Google Trends, Twitter, GitHub and Stack Overflow. Top platforms used in tandem with Reddit (% of article using a platform) Fig. 9 . Online platforms used as data sources together with Reddit 4.3. Linguistic analysis. During exploratory data analysis, various natural language processing techniques were applied. Among them, papers were also analysed linguistically. Specifically, sentiment analysis using NLTK framework [26] and Senti-mentAnalyzer 11 was applied. Observed polarization (depicted in Figure 10 ) indicates a negligible displacement towards the positive sentiment. This was expected, and is consistent with previous studies on scientific literature sentiment [19] . However, the subjectivity measure (summarised in Figure 11 ) raised concerns. Obviously, it has been claimed that scientific research may be subjective, as it needs to allow "leaps of faith" (see, [9] ). Moreover, some philosophers [30, 29] argue that subjectivity is intrinsic for human nature. However, it is also claimed (and for good reasons) that the foundation of the scientific method [32] revolves around aiming at objectivity. Hence, results summarised in Figure 11 , indicating high level of subjectivity, were somewhat concerning. To establish the reason for this finding, the most "subjective" texts were studied directly. As a result is was found that this is a false alarm. Specifically, apparent shift towards subjectivity was caused by inaccuracy of the classifier (SentimentIntensityAnalyzer from nltk.sentiment 12 ). For further under-standing, let us consider the selected sentences from the most subjective (according to the NLTK metric) articles. • "However, this openness formed a platform for the polarization of opinions and controversial discussions" [22] (score: 0.95) • "(...) also presented an extended version of the study discussing potential racial bias in offensive content datasets (...)" [2] (score: 1.0) • "All datasets only contain activity between 01/2015 and 10/2018" [16] (score: 1.0) Moreover, let us also consider how the calculated subjectivity measure changes with a simple modification of selected statements (i.e. by removing particular words): • Statement before transformation (score: 0.63): "Controversially initiated and non-controversially initiated cascades, (a,b,c) are controversially initiated posts' cascades while (d,e,f) are non-controversial posts' cascades where the red dots represent a comment labeled as controversial by Reddit that is directed to the post's author while a green dot is a comment labeled controversial by Reddit that is directed to another comment." [22] • The same statement after transformation (score: 0.15): "initiated and initiated cascades, (a,b,c) are initiated posts' cascades while (d,e,f) are posts' cascades where the red dots represent a comment labeled as by Reddit that is directed to the post's author while a green dot is a comment labeled by Reddit that is directed to another comment." [22] This suggests that simply using the "subjective" (key)words (e.g. "controversial", "bias") in the text, regardless of their context, results in radically increased value of the variable that is to indicate subjectivity of the text. However, there are sentences that do not use such words, which have also received a high subjectivity score. Hence, further research would be required into the way that the NLTK metric works and why, sometimes, it is rather misleading. However, this is outside of scope of the current contribution. Reddit-based literature in scholarly databases. Let us now address RQ3 and RQ4. Even though they cannot be unequivocally answered, possible answers can be experimentally explored. To verify the change over time of the number of scholarly papers related to Reddit, between 2010 and 2021, 10 databases have been analysed and queried for the term "reddit". As shown in Figure 12 the number of found articles raises year to year (RQ3). Table 1 shows how many articles, related to Reddit (e.g. using it as a data source, processing it, analysing the comments, etc.), have been indexed in scientific databases. Observations that can be made, on the basis of the results found in Table 1 , are: • the number of Reddit articles is quite small, yet representative, • the number of Reddit papers is somewhat proportional in each database, so it can be stated that the literature is quite equally spread in the Internet. Outlying results found in Google Scholar. The only database with trends inconsistent with others is Google Scholar (Figure 12 ). However, although it is one of the most widely known databases that indexes scientific publications [27, 17] , it already received both praise and criticism [20, 39] . Main problems of Google Scholar, pointed out in the literature, are: (i) difficulties to estimate the actual size of the database [33, 21] , (2) gender-and race-related bias in displaying contributions [23] , (iii) favoring incremental work [23] , (iv) favoring larger research communities [23] , (v) limited indexing of files [20] instead of skilled librarians) [28, 14] , (vi) "uncertain quality of Google Scholar's performance" [14] , (vii) "Google Scholar's inability or unwillingness to elaborate on what documents its system crawls" [14] , and (viii) limitations of bibliometric analysis [28] . Moreover, Google Scholar declares inconsistently the number of results of a query, and the actual number of returned results (e.g. a query returns 1000 actual results, while it declares 58,600 31 ). This finding may correspond to already reported Google Scholar inconsistencies [33, 21] and lack of transparency [14] . Therefore, Google Scholar can be treated as an outlier and disregarded in conclusions drawn from this experiment. Trends. The next experiment explored presence of popular trends in Reddit. This was done based on Google Trends, an analytical website which provides information about popularity of search queries in Google search engine 32 . For all Global Google Trends 2020 33 their Reddit presence has been measured (see Table 2 ). Overall, 79% of top Google Trends have a dedicated subreddit, while all of them are widely discussed. Table 2 illustrates top three in each Google Trend category. Table 2 Global Google Trends 2020 35 (top 3 in each Google Trends category) and their appearance on Reddit ("subreddit" -there exists a dedicated subforum, "discussion" -the topic is present in (a) subreddit(s) of a broader topic) [26] , removal of stopwords, punctuation, numbers). 4. Generates and displays post titles and content wordclouds 38 . The code follows state-of-the-art solutions for code sharing ( [35] ) and is publicly available on GitHub 39 as a Jupyter Notebook [37] . To illustrate the capabilities of the developed application, let us present few examples, in two groups, in Figures 14 and 13 . The wordclouds are build from posts related to a subreddit dedicated (or closest) to the searched topic. Reddit-TUDFE allows to quickly check if, and how, a particular topic is covered. Note that similar examples can be derived for any other topic, while Reddit also shows potential in, for instance, building ontologies, or semantic graphs. However, this possibility is out of scope of this contribution. In Figure 13 : • Left subfigure shows result for the phrase "music", a generic term, which is certainly discussed on Reddit. One may see particular genres: rock, pop, rap, relaxing, electronic, etc. • Middle subfigure displays results for phrase "rock", a bit narrowed, but still vague music-related (sub)topic, which is also present in Reddit, including artists/bands like: Rolling Stones, AC/DC, Led Zeppeling, Queen, Pink etc. • Right subfigure contains a strictly specific topic, i.e. the band "The Beatles", which is also widely covered on Reddit. Here one may see, among others, individual band members: John Lennon, Paul McCartney, Ringo Starr, and George Harrison. Another example is summarized in Figure 14 . 01-05-2020) displays that the main phrases changed to: "deaths" (due to COVID-19 infection) and "lockdown" (the preventive measures against the spread of the virus). Figure 14 (right) (before 08-12-2020, i.e. near the first vaccine invention) shows the general interests in phrases like: "vaccine" and "pfizer" (the company to invent the vaccine [3] ). Note that analysing the evolution of thematic ecosystem is just one of possible applications of the Reddit-TUDFE tool. Most importantly, it quickly allows checking whether given topical domain contains live (evolving over time) information. 6. Concluding remarks. This work provides evidence that Reddit is a robust, but underutilized, resource for information retrieval and knowledge capture, in almost any field of interest. Based on performed exploratory analysis, the following answers to the research questions formulated at the beginning of this work can be stipulated: • RQ1: Reddit offers publicly available data, which can be easily retrieved with Pushshift API. • RQ2: Most popular techniques for Reddit information processing are: text embeddings, neural networks, and graph networks. • RQ3: Reddit is trending in scientific research as more and more articles using it are published every year. • RQ4: Reddit covers the majority (79%) of topics that appear in Global Google Trends, sustaining the claim that Reddit is a robust source of knowledge about "everything trendy". • RQ5: Reddit is most commonly used in tandem with Twitter. These conclusions render Reddit a perfect candidate for future research -especially the presence of graph networks among common research methods and high coverage of popular trends. Finally, this analysis and the Reddit-TUDFE tool provide solid foundation for future research on Reddit and its potential in information retrieval. A comprehensive overview of the covid-19 literature: Machine learning-based bibliometric analysis Trawling for trolling: A dataset Pfizer: The miracle vaccine for covid-19? The pushshift reddit dataset Youtube recommendations and effects on sharing across online social platforms Building and using personal knowledge graph to improve suicidal ideation detection on social media Echo chambers on social media: A comparative analysis Exploratory data analysis The science of subjectivity Bert: Pre-training of deep bidirectional transformers for language understanding Danish stance classification and rumour resolution But-fit at semeval-2019 task 7: Determining the rumour stance with pre-trained deep bidirectional transformers Deep agent: Studying the dynamics of information spread and evolution in social networks Scholarish: Google scholar and its value to the sciences Google scholar to overshadow them all? comparing the sizes of 12 academic search engines and bibliographic databases To act or react: Investigating proactive strategies for online community moderation Suitability of google scholar as a source of scientific information and as a source of data for scientific evaluation-review of the literature How to model the shapes of molecules? combining topology and ontology using heterogeneous specifications Analyzing scientific papers based on sentiment analysis, Information System Department Faculty of Computers and Information Cairo University Google scholar: the pros and the cons Google scholar revisited, Online information review Controversial information spreads faster and further in reddit Benefits and pitfalls of google scholar Sentiment analysis of twitter data: a survey of techniques First-wave covid-19 transmissibility and severity in china outside hubei after control measures, and second-wave scenario planning: a modelling impact assessment Nltk: The natural language toolkit Google scholar as a data source for research assessment Google scholar: the big data bibliographic tool, Research analytics: boosting university productivity and competitiveness through scientometrics Subjectivity in qualitative research Science and subjectivity: Understanding objectivity of scientific knowledge The anatomy of reddit: An overview of academic research Philosophiae naturalis principia mathematica Methods for estimating the size of google scholar Scikit-learn: Machine learning in Python Why jupyter is data scientists' computational notebook of choice Data Mining Using the jupyter notebook as a tool for open science: An empirical study Pdfminer: Python pdf parser and analyzer, Retrieved on Comparing test searches in pubmed and google scholar How scientific conferences will survive the coronavirus shock Towards understanding the information ecosystem through the lens of multiple web communities Disinformation warfare: Understanding state-sponsored trolls on twitter and their influence on the web Grounded conversation generation as guided traverses in commonsense knowledge graphs Acknowledgement. This work has been supported in part by the joint research project "Novel methods for development of distributed systems" under the agreement on scientific cooperation between the Polish Academy of Sciences and Romanian Academy.