key: cord-0142817-ipmjc5bj authors: Powell, James; Sentz, Kari title: Tracking Short-Term Temporal Linguistic Dynamics to Characterize Candidate Therapeutics for COVID-19 in the CORD-19 Corpus date: 2021-01-09 journal: nan DOI: nan sha: be5faca38c2f4439d72ab30b134fee712dd0cdf4 doc_id: 142817 cord_uid: ipmjc5bj Scientific literature tends to grow as a function of funding and interest in a given field. Mining such literature can reveal trends that may not be immediately apparent. The CORD-19 corpus represents a growing corpus of scientific literature associated with COVID-19. We examined the intersection of a set of candidate therapeutics identified in a drug-repurposing study with temporal instances of the CORD-19 corpus to determine if it was possible to find and measure changes associated with them over time. We propose that the techniques we used could form the basis of a tool to pre-screen new candidate therapeutics early in the research process. Diachronic word analysis is Natural Language Processing (NLP) technique for characterizing the evolution of words over time. Often used for historical linguistic studies, it can also be applied to scientific literature [Tshitoyan et al. 2019 ] and can reveal early evidence of scientific discoveries before they become widely known. Drug-repurposing studies aim to identify existing drugs that might be useful in treating other diseases. The availability of large amounts of data about drugs and infectious agents such as viruses has enabled such studies to be performed in-silico. In early 2020, a number of repurposing studies were undertaken to identify potential treatments for COVID-19. The CORD-19 corpus [Wang et al. 2020] was established in March 2020 as a repository for research related to SARS-COV-2 and other coronaviruses. It aggregates content from PubMed, bioRxiv, medRxiv, and other sources, and it is updated with new publications on a regular basis. Figure 1 illustrates the growth of CORD-19 through mid-2020. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. SenSys 2020, November 16-19, 2020, Yokohama, Japan Using CORD-19, we conducted a diachronic survey of candidate therapeutics identified in one of the more exhaustive drug re-purposing studies conducted to date for undertaken in February 2020 at Oak Ridge National Laboratory. This study, detailed in [Smith and Smith 2020] , analyzed drugs in the SWEET-LEAD database for potential antiviral properties. The study produced a dataset identifying over 9,000 existing approved drugs and supplements as potential candidate therapeutics for COVID-19. Most importantly for our purposes, this dataset included commonly used drug or supplement names for each candidate. Our survey considered the following questions: • How many candidate therapeutics appear in CORD-19? • Do references to the candidates change over time? Using temporal snapshots of CORD-19 spanning March 13 to June 30, we computed frequency and semantic representations for each candidate therapeutic found in the corpus. For each temporal instance, we computed TF/IDF (Term Frequency/Inverse Document Frequency) score for each candidate, a common metric to evaluate the relative importance of terms in a corpus. To perform semantic analysis, we first computed diachronic word embeddings for each temporal instance of the corpus. These embeddings were aligned with one another to ensure that terms from each temporal instance were comparable. The technique we used is based on TWEC [3]. It uses a negative sampling optimization of softmax to maximize the probability that a set of words surrounding word are representative of its context in time ( ), when multiplied by the mean of atemporal word embedding vectors from (the compass) for the same set of context words around . Since the TWEC embedding model did not account for phrases, we incoporated an additional step to indentify them. Phrases (including drug names) were then specially encoded to allow them to be treated like words. Figure 2 shows TF/IDF verses the mean embedding distance to the compass for candidate therapeutics. Because diachronic embedding instances were aligned with one another, we were able to isolate a given candidate and visualize its semantic trajectory over time ( Figure 3 ). As the trajectory is based on nearest neighbors at a given time, subtle changes in semantic associations become apparent [Stewart et al. 2017] . Nodes along the path represent the candidate embedding vector and its two closest terms at time We detected 14% (1267) of the candidate therapeutics in CORD-19 at 3/13, increasing to 26% (2361) by 6/30. For candidates detected in multiple adjacent temporal instances of the corpus, we were able measure their changes over time. We found that many candidates exhibited increases in frequency, and stable or strengthening semantic associations. However, given the nature of this corpus, we suspected some would exhibit other kinds of change over time. We found that some candidate therapeutics exhibited different patterns of semantic associations. Using heatmap visualizations as described in [Xu and Crestani 2017] , we can illustrate two additional recurring patterns of behavior. Some candidates exhibited weakening semantic associations over time (Figure 4) , while others exhibited an abrupt persistent shift to a different pattern ( Figure 5 ). Additionally, we found that these changes were not strongly correlated with changes to a target's frequency scores. Our diachronic survey of candidate therapeutics for COVID-19 in the CORD-19 corpus found that some exhibited weakening or abrupt changes of semantic associations. We speculate that this could be related to the publication of new research that positively or negatively affected consideration of a candidate therapeutic as a treatment for COVID-19. Future work will investigate how to detect and quantify these patterns, and to determine if there are any correlations between a target's rank and magnitude of change. Repurposing Therapeutics for Covid-19 Eric Bell, and Svitlana Volkova. 2017. Measuring, predicting and visualizing short-term change in word representation and usage in vkontakte social network Unsupervised word embeddings capture latent knowledge from materials science literature CORD-19: The Covid-19 Open Research Dataset Temporal Semantic Analysis and Visualization of Words