key: cord-0428330-dwmoy2rr authors: Shimada, Kyosuke; Kazama, Kazuhiro; Yoshida, Mitsuo; Ohmukai, Ikki; Sato, Sho title: Analysis of Leading Communities Contributing to arXiv Information Distribution on Twitter date: 2021-12-15 journal: nan DOI: 10.1145/3486622.3493947 sha: 6086f142db96460bb9fe5239c77a8d56cac8f10c doc_id: 428330 cord_uid: dwmoy2rr To analyze the impact that arXiv is having on the world, in this paper we propose an arXiv information distribution model on Twitter, which has a three-layer structure: arXiv papers, information spreaders, and information collectors. First, we use the HITS algorithm to analyze the arXiv information diffusion network with users as nodes, which is created from three types of behavior on Twitter regarding arXiv papers: tweeting, retweeting, and liking. Next, we extract communities from the network of information spreaders with positive authority and hub degrees using the Louvain method, and analyze the relationship and roles of information spreaders in communities using research field, linguistic, and temporal characteristics. From our analysis using the tweet and arXiv datasets, we found that information about arXiv papers circulates on Twitter from information spreaders to information collectors, and that multiple communities of information spreaders are formed according to their research fields. It was also found that different communities were formed in the same research field, depending on the research or cultural background of the information spreaders. We were able to identify two types of key persons: information spreaders who lead the relevant field in the international community and information spreaders who bridge the regional and international communities using English and their native language. In addition, we found that it takes some time to gain trust as an information spreader. In recent years, preprint servers such as arXiv have been actively used for research information exchange and conference management, and it has been pointed out that Twitter plays a particularly important role in such academic information distribution [2] . It is assumed that the characteristics of information distribution on Twitter differ greatly from that of the conventional one-way information distribution using academic journals as a medium. It is important to clarify its characteristics in order to understand contemporary academic information distribution. For example, one of the possible characteristics is the role that Twitter users play in the distribution of preprints. Expert users in the relevant research field discover important arXiv papers and introduce them on Twitter. When users who read the tweets decide that the tweets are important and repeatedly like and retweet them, the information spreads widely. In other words, users involved in arXiv information distribution play two types of roles: information spreaders and information collectors, and the strength of each aspect varies greatly depending on the characteristics of the user. In this paper, we model arXiv information distribution on Twitter in three layers: arXiv papers, information spreaders, and information collectors, especially focusing on information spreaders, who we believe play a particularly important role. First, we use the Hyperlink-Induced Topic Search (HITS) algorithm to analyze the arXiv information diffusion network with users as nodes, which is created from three types of behavior on Twitter regarding arXiv papers: tweeting, retweeting, and liking. Next, we extract communities from the network of information spreaders with positive authority and hub degrees using the Louvain method, and analyze the relationship and roles of information spreaders in communities using research field, linguistic, and temporal characteristics. There is some relevant research available on information retrieval, the sharing of arXiv preprints, and social media. A user survey by the arXiv team at Cornell University found that the impact of Social Networking Services (SNS) was not evaluated because there were no SNS options in the answers. However, they were mainly looking for arXiv preprints from Google search, Google Scholar, and the actual arXiv homepage [12] . It has been reported that even in SNS, especially on Twitter, the users who spread academic information are not necessarily academic, but many of them have backgrounds in social sciences and humanities [7, 11] , and academic users follow the bots of each preprint server to obtain information, spread preprint information, and discuss it [2] . In other words, Twitter plays an essential role in the distribution of academic information. Regarding preprints that have changed with the development of the web, Boya et al. [20] quantitatively showed the overall trends and impacts using the preprint datasets collected in the Microsoft Academic Graph (MAG) from 1991 to 2019 [13, 15] . Preprint data of arXiv accounted for about 60% of the overall sample. In particular, most physics, mathematics, and computer science fields are posted on arXiv. The authors noted that the number of preprints in computer science and biology has been increasing over the last ten years. Unlike other fields, international conferences are the main forum for presenting results in computer science, for example, and many preprints are submitted to conferences on machine learning. Many of the studies on arXiv have evaluated data in specific fields, such as physics and mathematics [5, 18] . However, Charles et al. [16] and Jialiang et al. [10] have conducted surveys limited to the computer science field [6] , which is growing particularly rapidly. Charles et al. derived the percentage of papers published in arXiv by cross-referencing between DBLP and arXiv preprints and clarified the usage environment of arXiv in the computer science field. Therein, the ratio of electronic editions was exceptionally high in theoretical computer science and machine learning. Meanwhile, though the usage rate is increasing in other fields, it still remains close to zero. Jialiang et al. quantified how many preprints submitted to arXiv were eventually published in peer-reviewed journals. The preprints posted in the computer science field of arXiv from 2008 to 2017 were investigated using Bidirectional Encoder Representations from Transformers (BERT), and the changes from the preprint version to the official publication were captured. In addition, arXiv users may submit a preprint to arXiv and then publish the paper in a journal or international conference after peer review. Several studies have analyzed such usage patterns using academic information databases. Larivière et al. [9] analyzed two data sources, arXiv and Web of Science (WoS). They found that about 64% of the arXiv preprints between 1991 and 2011 were included in WoS, and 93% of those were in mathematics, physics, earth science, and space science. In particular, in astronomy, astrophysics, nuclear physics, and particle physics, most of the papers included in WoS were also submitted as preprints to arXiv. In mathematics and physics, a high percentage also contributed arXiv preprints, though the percentage was low in certain sub-fields. Shuai et al. [14] analyzed the relationship between the number of mentions on Twitter, the number of arXiv downloads, and the number of citations of papers for 4,606 preprints between October 2010 and May 2011. As a result, it was clarified that the number of mentions on Twitter has a moderate correlation with the number of arXiv downloads several months after the preprint posting, as well as the number of early citations. Many of the preprint subjects downloaded from arXiv and mentioned on Twitter were in the realms of astrophysics, high energy physics, and mathematics, with nearly 70% of the papers peaking within five days of submission. Based on the above, it is clear that studies on how users engage with academic information and on the relationship between preprints and peer-reviewed papers are available; however, such research has not considered the dual role of users in collecting and disseminating academic information on social media. Thus, we believe it is necessary to investigate such users, who contribute to the spread of arXiv preprints, and examine their duality. In this paper, arXiv information distribution on Twitter is presented as a three-layer model, as shown in Figure 1 , instead of detecting or estimating the diffusion path of the arXiv paper's information on a graph of users or tweets. The first layer comprises arXiv papers. The reason we call them "arXiv papers" instead of "arXiv preprints" is because we do not use other databases in order to narrow the scope to preprints only. The second layer consists of information spreaders, who spread information by tweeting or retweeting the URL of arXiv articles. The third layer is made up of information collectors, who retweet or like tweets only if they find the information to be valuable. In this model, we assume that the same user can have two different roles for arXiv papers: information spreader and information collector. Using this model, we focus on the information spreaders, who contribute to arXiv information distribution in particular, and analyze them in terms of the importance of the user, the community to which the user belongs, and the characteristics of the users or communities. On Twitter, information spreaders who are retweeted or liked by many information collectors are considered important and reliable, while information collectors who retweet or like many tweets of important information spreaders are considered trustworthy. Thereby, we created an information diffusion network with users as nodes by combining the information diffusion of arXiv papers on Twitter. We used the authority and the hub weight, which are calculated by Kleinberg's HITS algorithm [8] , representing the importance of an information spreader and an information collector, respectively. Suppose that a user ( = 1, . . . , ) performs two roles as an information spreader and an information collector . When an information collector retweets or likes the tweet of an information spreader , the information is considered to have propagated from to . The information diffusion network of the user can be represented by an adjacency matrix with rows and columns whose elements are the diffusion , . The authority weight and hub weight ℎ of user are obtained by the following procedure: (1) Initialize each element of an authority vector = ( 1 , . . . , ) ⊤ , which represents the importance of each user as an information spreader, and a hub vector = (ℎ 1 , . . . , ℎ ) ⊤ , which represents the importance of each user as an information collector, with 1. (2) Using an adjacency matrix , repeat the normalization after applying Equations 1 and 2 until the values of and converge. (3) From this result, create a pair ( , ℎ ) of authority weight and hub weight designations for user . Degree centrality, such as indegree and outdegree, is easily affected by the size of an information spreader community and tends to be evaluated higher for nodes in large communities. However, the authority and hub weight degrees tend to be higher when there is a bipartite graph structure of information spreaders and information collectors in the community, making them more suitable for comparison and analysis across multiple communities than degree centrality. We created an information spreader network focusing on information spreaders who contribute to arXiv information distribution and extracted similar information spreader communities from the viewpoint of information collectors using the following procedure: (1) Create a set = { | , = 1} of information collectors who retweeted and liked information spreader satisfying > 0 ∧ ℎ > 0. (2) If the Szymkiewicz-Simpson coefficient ( , ) of the information spreaders and is greater than or equal to the threshold , the edge , is stretched and the information spreader network is created. (3) Split into communities 1 , . . . , using the Louvain method [1] . The reason for limiting the analysis to users with positive authority and hub weights is that users with zero authority weight are not involved in information diffusion, and users with positive authority weights but zero hub weights are either bots or accounts dedicated to information diffusion, so they are excluded from the analysis. In addition, we used betweenness centrality [4] in the information spreader network to identify important information spreaders that interconnect multiple information spreader communities. We analyzed information spreader communities from three perspectives: the research field characteristics of users, linguistic characteristics, and temporal characteristics. Assuming that users' expertise is manifested in the categories of arXiv papers they spread or collect information from, we analyzed the academic trends of information spreader communities using the categories of arXiv papers. The arXiv papers are stored in 11 different archives, such as Computer Science (e.g., cs), and the categories are represented as a string with the archive name as the major category, with subcategories of research fields added on after periods (e.g., cs.AI). In addition, physics is divided into several categories. Therefore, grqc (General Relativity and Quantum Cosmology), nlin (Nonlinear Sciences), nucl (Nuclear Theory, Nuclear Experiment), and quantph (Quantum Physics) are collectively denoted as physics*, as in "arXiv submission rate statistics" 1 . Since information spreaders are also information collectors, we distinguished between the categories of arXiv papers that spread information and those that collected information by calling them information spread category and information collection category, respectively. For example, it is thought that the information spread category represents the user's current expertise, while the information collection category represents the fields that the user would like to refer to in the future. In arXiv, multiple categories can be assigned to a single paper; however, we use only the first assigned category in this analysis. We call this the main category of the arXiv paper. In addition, for each user, the main categories of the articles hat were used to spread or collect information are obtained, and the category with the largest number is used as the user's information spread category or information collection category. Since the cultural background of users is reflected in the language they use, we analyzed the cultural trends of information spreader communities using the language information about users in each community. On Twitter, people tend to use their native language or the language of their organization for daily communication such as tweeting and replying. This is called communication language (CL). On the other hand, users may use a language different from their mother tongue, such as English for example, when presenting international academic papers. This is called profile language (PL). The linguistic characteristics of a user or a community on Twitter are defined using these two types of linguistic identifiers. For the communication language, we use the lang field obtained using the Twitter API. The profile language is determined using Python's langdetect 2 library after URL removal from the profile text. In both cases, the language is represented by a two-letter lowercase alphabet as specified in ISO 639-1. However, if the Twitter API or langdetect cannot determine the language, it is expressed as "UD" (undecided). Since the degree of influence that a user has on Twitter is considered to be determined by the period that the user was actively speaking, we analyzed the activity of the information spreader community using the mention period of each user. The mention period is the number of days between the first and last tweets mentioning any arXiv papers. However, this is limited to cases where there are two or more such tweets. We used the Twiter API to collect tweets about arXiv papers between March 21, 2007 and January 18, 2020. The compressed URLs in the tweets were decompressed and used as the mention dataset for arXiv papers. The details of the mention dataset are shown in Table 1 . Notably, there is a restriction in the Twitter API that means only a maximum of 100 retweets and likes can be obtained for a tweet. However, since there were only 5,600 cases violating this limit for likes and 1,449 cases for retweets, it is not considered to be a major problem. The details of the authority and hub weights of the users are shown in Table 2 . The number of users with positive authority is 11.0% of the total, indicating that only a small portion of the users involved in arXiv information distribution are information spreaders. Furthermore, there are 3.5% of users with positive authority but zero hub degree, because they send out information on arXiv papers but do not receive information from others. These were removed from the scope of the information spreader network. The number of nodes and edges in the information spreader network were 5,497 and 20,453, respectively, based on the Szymkiewicz-Simpson coefficient threshold of 0.5. In addition, the Louvain method was applied to the information spreader network, resulting in 169 communities. Its largest connected component had 5,034 nodes (91.6%) and 20,119 edges (98.4%), and contained 26 communities. In addition, the bibliographic information of 1,645,129 arXiv papers, which were collected from arXiv using OAI-PMH 3 on February 26, 2020, was used as the arXiv metadata set. This bibliographic 2 https://pypi.org/project/langdetect/ 3 http://www.openarchives.org/OAI/openarchivesprotocol.html We analyzed the time-series variation in the categories of arXiv papers mentioned on Twitter. First, we show the time-series variation in the number of mentioned papers by the category of arXiv papers in Figure 2 . The horizontal axis represents the year, and the vertical axis represents the number of mentions of arXiv papers in each category for that year. It can be seen that the number of mentions of arXiv papers has increased rapidly in several fields since 2009. A closer look reveals that immediately after 2009, the fields of mathematics (math) and physics (physics*) were frequently mentioned, but after 2011, the mentions of the field of computer science (cs) rapidly increased, and after 2018, it became the most mentioned field. Comparing with the number of submitted papers in "arXiv submission rate statistic" 4 , we can see that the trend of the increase in the number of submitted papers and the number of mentions is basically similar, but in the field of computer science, the increase in the number of mentions is much larger than the number of submitted papers. Next, we show the time-series change of detailed categories in computer science, where the number of mentions has been increasing, especially in recent years, in From the above results, it can be seen that arXiv was utilized by researchers in mathematics and physics in the early stages. However, recently, due to the rapid development of machine learning and deep learning, the use of arXiv by researchers in the field of computer science has increased significantly. One reason for this may be that in computer science, there is a tendency to place a higher value on international conferences than on academic journals as a place to present papers [17] ; the increasing use of preprint servers and Twitter in international conferences could be another major factor. We show the visualization result of the community structure extracted from the maximum connected component of the information spreader network in Figure 4 . After laying out nodes with Gephi's ForceAtlas2 algorithm, the node size was varied according to its authority weight and the edge thickness according to the Szymkiewicz-Simpson coefficient. The top 10 communities by the number of users are assigned different colors and the 15 remaining smaller communities are labeled by the same color. To investigate the research field characteristics of information spreader communities, the categories of information spreaders and information gatherers and the number of users in the top three categories of the top 10 communities are shown in Table 3 . The information spread categories refer to the category of arXiv papers mentioned by the user. The information collection categories constitute the category of arXiv papers that have been liked or retweeted. The top three categories and the number of users are shown, respectively. From information spread categories, we can classify information spreader communities into two major research fields: physics and machine learning. The dense communities 4, 7, 8, and 13 on the upper left of the visualization result are machine learning communities, and the dispersed communities 0, 2, 1, and 6 on the right are physics communities. It can be seen that the information spreaders of the same research field can be divided into several communities. Furthermore, when we compare information spread categories with information collect categories, they show essentially the same tendency. However, in communities 8, 7, and 4, the number of users of machine learning (cs.LG) is significantly larger in the information collect category than in the information spread category. This suggests that machine learning information is particularly sought after in these communities. Next, we analyzed the linguistic characteristics of information spreader communities. Table 4 shows the communication languages and profile languages of each community in descending order of the top three number of users. UD in this table represents the users whose language could not be identified. The number one language in most communities is English. However, it is Japanese in communities 7 and 1. In the machine learning communities 8, 7, 4, and 13, Japanese is the main language used in the second-largest community 7, English is the main language used in the other communities, and the research fields of all communities are very similar. In contrast, in the physics communities 0, 2, 1, and 6, Japanese is the main language used in community 1, while English is the main language used in the other communities, and the research fields of each community are different. This result indicates that the arXiv paper information is mainly diffused internationally using English. Even in communities 7 and 1, where Japanese is the main language, English is the second most commonly used language. However, since Japanese is in a very different language family from English, it can be assumed that Japanese is used for communication among Japanese researchers or developers who have many opportunities to interact with each other daily. Furthermore, in communities 7 and 1, there are more English users and fewer Japanese users in profile languages than in communication languages. This may be because they write their profiles in English for international academic activities, but communicate in their native language for regional communication and contributions. In a research field such as machine learning, which has had a large impact on the world, it can be assumed that in addition to an international community that uses English as a common language for research activities, a regional community has emerged that communicates using its native language while focusing on the same paper. When we examined the non-maximum connected component, we found community 105, whose main communication and profile language is Korean, and community 83, whose main communication language is Japanese; however, these are small, isolated communities with six and two users, respectively. We analyzed key people in information spreader communities using authority weight and betweenness centrality. First, we analyzed the top 20 users in authority weight. The ranking of the authority weight, screen names, community numbers (CN), communication languages (CL), profile languages (PL), authority weights ( ), and hub weights (ℎ ) are shown in Table 5 . The numbers in parentheses in this table are the rankings in hub weight. Users with hub weight 0 are not included in the information spreader network, so they are not assigned community numbers. Table 5 includes Twitter accounts from prominent researchers such as Miles Brundage (@Miles_Brundage), who is a member of OpenAI, and Ian Goodfellow (@goodfellow_ian), who proposed GANs; prominent companies such as DeepMind (@Deep-Mind), which developed AlphaGo; and bots such as @arxiv_org and @StatMLPapers. However, their hub weights are not necessarily high, especially for bots, which have a hub weight of 0. When we examined which communities the top 20 information spreaders belonged to, we found that most of them belonged to communities 8 or 4. Next, we analyzed the top 20 users in betweenness centrality to find out which users act as bridges between communities. The rankings of the authority weight, screen names, community numbers (CN), communication languages (CL), profile languages (PL), and betweenness centrality ( ) are shown in Table 6 . In betweenness centrality, the top users changed significantly from authority weight. Many users in communities 8 and 4 decreased in rank, while users in community 7 increased in rank. The number of users belonging to communities other than communities 8, 4, and 7 also increased. In particular, when we focus on regional community 7 of machine learning, Daisuke Okanohara (@hillbig), the COO/representative director of machine learning start-ups, ranks 17th in authority weight and 4th in betweenness centrality, and Yuta Kashino (@yutakashino), an entrepreneur, ranks 68th in authority weight and 20th in betweenness centrality, which is a remarkable improvement. These users are spreading information to their community by introducing the arXiv paper in Japanese. Their profile language is English and their communication language is Japanese, indicating that they are key persons with special roles, bridging the regional and international communities. Next, we visualized the positions of key people with high authority weight and betweenness centrality in the information spreader network. The results of the visualization are shown in Figure 5 . Only the nodes of the same information spreader network as Figure 4 are drawn in the same arrangement, with the size of nodes increasing for higher authority weight or betweenness centrality, and the transparency of nodes increasing for lower authority weight or betweenness centrality. Figure 4 (a) shows that the information spreaders with high authority weight exist in communities 8, 7, and 4, which are relatively large. In contrast, Figure 4 (b) shows that the information spreaders with high betweenness centrality are distributed in more communities, although they are less likely to be in a single community. Finally, we analyzed the temporal characteristics of the behavior of information spreaders. First, we show the mean, maximum, median, and standard deviation ( ) of the mention period for each community in Table 7 . The results show that most of the communities tend to have long mention periods. In addition, we show the relationship between mention periods and authority weights in Figure 6 . From this result, it can be seen that users with a long mention period do not necessarily have high authority weights, but the authority weight does not increase unless the mention period is at least somewhat long. In other words, relatively long-term activities by a key person in the arXiv information distribution network are correlated with a high authority weight. However, the mention period of community 13 is particularly short compared with the other communities in Table 7 . As a result of this observation, we examined the time-series variation in the number of mentions by each community in Figure 7 . The horizontal axis represents the year and the vertical axis represents the number of mentions in the community in that year. The results indicate that community 13 is a relatively young community, with a rapid increase in mentions since 2016. In community 13, the most mentioned arXiv paper was about Google's neural machine translation in 2016 [19] , and the most liked and retweeted arXiv paper was about Google's BERT in 2019 [3] . Considering that the main category of community 13 in Table 3 is natural language processing (cs.CL), we can see that this community is a group of information spreaders who spread information about natural language processing, especially deep learning. In other words, even in the same technical field of machine learning, different communities were formed due to the difference in the backgrounds of the information spreaders between computer vision (cs.CV) and natural language processing (cs.CL). In this paper, we modeled arXiv information distribution by assuming that a user has two types of roles: information spreader and information collector, which enabled us to eliminate bots and focus on users who contribute highly to information distribution on Twitter. Furthermore, we attempted to analyze the characteristics of users in more detail than previous studies by bringing in two different perspectives, such as authority weight and betweenness centrality, information spread category and information collect category, and communication language and profile language. From these results, we found that information about arXiv papers circulates on Twitter from information spreaders to information collectors and that multiple communities of information spreaders are formed according to their research fields. It was also found that different communities were formed in the same research field, depending on the research or cultural background of the information spreaders. We were able to identify two types of key persons: information spreaders who lead the relevant field in the international community and information spreaders who bridge the regional and international communities using English and their native language. In addition, we found that it takes some time to gain trust as an information spreader. This analysis was performed on data before the COVID-19 pandemic. We plan to analyze the new role of arXiv and its changes owing to the current global circumstances. Furthermore, we plan to study a new bibliographic index based on the analysis of academic information distribution on social media. Fast unfolding of communities in large networks Accelerating scholarly communication: The transformative role of preprints BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding A set of measures of centrality based on betweenness Citing and Reading Behaviours in High-Energy Physics: How a Community Stopped Worrying about Journals and Learned to Love Repositories CoRR: a computing research repository A systematic identification and analysis of scientists on Twitter Authoritative Sources in a Hyperlinked Environment arXiv E-prints and the journal of record: An analysis of roles and relationships How many preprints have actually been printed and why: a case study of computer science preprints on arXiv Academic information on Twitter: A user survey arXiv@25: Key findings of a user survey A Web-scale system for scientific knowledge exploration How the Scientific Community Reacts to Newly Submitted Preprints: Article Downloads, Twitter Mentions, and Citations An Overview of Microsoft Academic Service (MAS) and Applications Conferences versus journals in computer science Preprints as accelerator of scholarly communication: An empirical analysis in Mathematics Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Is preprint the future of science? A thirty year journey of online preprint services This work was supported by JSPS KAKENHI Grant Number JP19H04421.