key: cord-0291337-h4sbldg0
authors: Eren, Maksim Ekin; Solovyev, Nick; Raff, Edward; Nicholas, Charles; Johnson, Ben
title: COVID-19 Kaggle Literature Organization
date: 2020-08-04
journal: nan
DOI: 10.1145/3395027.3419591
sha: 6d9c040dd9ba6736c4988a9de6fe3d7a00d9ea3f
doc_id: 291337
cord_uid: h4sbldg0

The world has faced the devastating outbreak of Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2), or COVID-19, in 2020. Research in the subject matter was fast-tracked to such a point that scientists were struggling to keep up with new findings. With this increase in the scientific literature, there arose a need for organizing those documents. We describe an approach to organize and visualize the scientific literature on or related to COVID-19 using machine learning techniques so that papers on similar topics are grouped together. By doing so, the navigation of topics and related papers is simplified. We implemented this approach using the widely recognized CORD-19 dataset to present a publicly available proof of concept.

In 2020, the world began to endure a health crisis of unprecedented scale. The scientific community was quick to respond. By one estimate, more than 23,000 papers related to COVID-19 were published between January 1 st and May 13 th of 2020. This figure is doubling every 20 days [4] . The influx of information left health professionals and the scientific community overwhelmed and unable to quickly find important publications. In this paper, we present a proof of * Both authors contributed equally to the paper Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request concept tool that organizes documents on or related to COVID-19 through clustering and dimensionality reduction. Clustering the texts and using the results as labels for a 2-dimensional scatter plot groups similar papers together, and separates dissimilar ones. To this extent, we parsed the COVID-19 Open Research Dataset (CORD-19) [11] , a collection of publicly available COVID-19 literature.

In response to the COVID-19 pandemic, the Allen Institute for AI [5] , the Chan Zuckerberg Initiative (CZI), Georgetown Universi-tyâĂŹs Center for Security and Emerging Technology, Microsoft Research, IBM, and the National Library of Medicine -National Institutes of Health prepared the CORD-19 dataset. CORD-19 consists of scientific literature which embodies over 141,000 scholarly articles related to the Corona family of viruses. The White House Office of Science and Technology Policy put out a call to action from citizen scientists to help better understand the virus through analysis using data science techniques [19] . The dataset was publicly hosted on Kaggle 1 . Users of the site quickly responded to the call to action and began experimenting with the data. Our analysis focused on using natural language processing, clustering, and dimensionality reduction methods to produce a structured organization of the literature.

SciSpacy's 2 biomedical parser was used to format the texts. After pre-processing, term frequencyâĂŞinverse document frequency (tfidf) [22] was used to vectorize the body text of the papers, giving for each document a feature vector X 1 . A reduced feature vector X 2 was then produced for each document by applying Principle Component Analysis (PCA) [2, 10] but keeping 95% variance. The X 2 vectors were then clustered with k-means [14] to produce plot labels. The X 1 vectors were reduced to two dimensions for visualization purposes with t-distributed stochastic neighbor embedding (t-SNE) [23] . The output of these algorithms was displayed in an interactive plot that allowed users to quickly access the original papers and navigate related work 3 .

Our solution was first hosted on the Kaggle web site 4 , and we received positive feedback from a surprising range of sources [13] . Several users have told us how this tool has helped them in their work and research, including an application to other domains in biology. The Jupyter notebook 5 we built continues to be forked, or at least visited, at least ten times a day, on average, over the last several weeks. The rest of this paper details the approach and discusses the results.

Visualization of high-dimensional data has proven to be a common objective across many domains. Plotting a representation of data, reduced to two or three dimensions, can provide valuable insights for data scientists. Most common approaches try to preserve the topology of the data. These algorithms map points in one space to another space in an effort to preserve the latent structure of the data [7] . Another common objective of these approaches is to preserve local and global structure in a single map, such that there is a clear distinction in differing instances. Preserving the topology in this way can allow for a comprehensive visualization, making it easier to find similar instances as well as dissimilar ones.

Several visualization and search tools using CORD-19 have already been attempted. Haselwimmer [9] uses literature-based discovery (LBD) [8] to create a graph of interconnected topics found in the CORD-19 data. Wolffram and King [24] uses Latent Dirichlet Allocation (LDA) [3] to form a metric of similarity between papers. Another topic modeling project used a different data source to analyze the discussions of COVID-19. Ordun et al. [20] utilized LDA, UMAP [15, 18] , and other approaches to visualize the COVID-19 related posts on Twitter.

Scientific literature visualization has also been used across broader domains. Paperscape, for example, visualizes 1.7 million scientific research articles [6] . The papers in this tool are organized by way of references/citations to each other. Such clustering will not capture the content of the documents as it does not delve into the text. In our work, we utilize the body text of each paper to better determine the topic similarity. In another example [17] , Tucker tensor decomposition [12] is utilized to cluster Shakespeare's plays in an effort to demonstrate how the application of tensors and their decompositions might support cluster analysis of malware specimens. We build upon the idea of clustering similar documents within our work.

Our analysis pipeline consists of several steps where we clean our dataset, vectorize it to be used by machine learning algorithms, get labels for the documents, and finally visualize our results. The approach described in this paper took approximately 8 hours to run and was computed on a machine with Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz and 16GB DDR3 RAM. Here we present the steps within our pipeline.

Our pre-processing approach begins with a dataframe of n papers. We first removed duplicates, papers that only contain an abstract, and papers written in some language other than English. Removing non-English papers was specifically important as they would be clustered separately from the rest of the papers on account of language difference alone. After cleaning the dataset, we obtain a corpus of 49,967 papers. The body text of each instance was then tokenized using Spacy's parser. In the tokenization process we removed function (stop) words, or frequently occurring words carrying only trivial semantic meaning. A few examples of stop words as found in our dataset are "doi", "fig", and "medrxiv". We have also removed punctuation, and capitalization. The next step was then to vectorize the tokenized texts using Scikit-LearnâĂŹs [21] tf-idf, creating feature vector X 1 . Tf-idf converts the string formatted data into a measure of how important each word is to the instance out of the literature as a whole. In tf-idf, n_features were limited to 2 12 to limit memory usage and act as a noise filter. After vectorization, PCA was applied to X 1 to reduce the dimensions just enough to preserve 95% of the variance, forming a new embedding X 2 . PCA finds x principal components, each orthogonal to each other. The goal of each principal component is to maximize the variance in the dataset which that component "explains". Once the principal components have been identified, the dataset can be reduced to d dimensions by projecting it onto a hyper-plane defined by the first d principal components [1] . By keeping a large number of dimensions with PCA, much of the information is preserved, while some noise/outliers are removed from the data.

By clustering the literature, we can generate labels for similar papers. When the dataset is reduced to two dimensions, some clusters overlap or are not easily distinguishable within the scatter plot. To reduce this effect, we used k-means, a popular iterative, unsupervised clustering algorithm. First, k instances are randomly selected as initial centroids. An instance is said to belong to cluster k i if centroid k i is closer to that instance than any other centroid by the l2 norm. The centroids are then updated at the end of each pass. This clustering approach was applied on the feature vectors X 2 to get each of the research article d i 's label y i .

To use k-means, the value of k must be specified, and we determined this value using distortion. Distortion is the sum of the squared distances between every instance vector and its closest centroid. Increasing the number of clusters will reduce the distortion, but this may come at the cost of splitting valid clusters in half. It is common practice to pick a value of k that is on the inflection point of the curve [1] . After the inflection point, improvements in distortion are minimal and decrease with each additional added cluster. As shown in Figure 1 this inflection point is approximately k = 20.

We used PCA to perform dimensionality reduction and make the clustering problem easier for k-means, as described above. For visualization purposes we resorted to t-SNE, a popular, non-linear projection algorithm. The t-SNE algorithm generates a probability distribution that represents the interrelationships of the data, and then creates a low-dimensional space that follows this probability distribution. This approach combats the crowding problem [23] . Using t-SNE, the feature matrix X 1 , consisting of 49,967 instances, is projected down to two dimensions, X 1 embedding Y , which then can be plotted on the Cartesian plane. Figure 2 is a scatter plot of the t-SNE reduced data with k-means results as the labels.

A common challenge with dimensionality reduction techniques is the verification of results. Often these approaches work well on synthetic datasets, but fall short when applied to real world data. In our study, we tried to verify the success of our work in two ways: manually exploring the clusters and identifying what professionals could impart from our tool.

A qualitative check was performed through manual analysis of the t-SNE visualization of CORD-19. In Figure 2 , clusters are visually apparent. For example, the red cluster C-6, contains papers on infectious diseases carried by bats. The publications in C-6 are not limited to just coronavirus studies in bats; literature on other viruses like lyssaviruses and filoviruses observed in bat populations is also present within the cluster. We noted with interest that t-SNE grouped papers covering specific strains of virus in close proximity. A sub-cluster within C-6, for example, includes papers dealing with SARS in bats.

We also noted agreement between the clustering labels and the clusters that can be seen on the plot. t-SNE and k-means were applied independently to X 1 and X 2 respectively. While t-SNE reduced the data to two dimensions, k-means clustered a higher dimensional representation of the data. Despite this, both methods were able to agree on the bounds of many of the more apparent clusters formed by t-SNE.

However, disagreement between the two approaches is still visible in some sections of the plot. This can be explained by two factors: In a visualization problem like this one, the decision boundaries are not clear. The k-means elbow method indicates that using k = 20 is an optimal solution. Using a higher k-value would undoubtedly result in lower distortion, but this could come at the price of splitting larger clusters in half. As a result, smaller clusters can be detected by t-SNE but might be identified as part of one or several larger clusters by k-means. The other, perhaps more apparent factor, is that k-means was applied in a higher dimensional space, the structure of which cannot be accurately conveyed in a two dimensional plot.

We had the help from Utah State University biomedical engineers Andrew James Walters and Dylan Ellis to further understand the content of the clusters. Using the plot, Walters and Ellis were able to provide insights into the topics and their relationships across different clusters. Niche topics like animal infection and viral reservoirs tended to have tightly grouped clusters. This phenomenon most likely occurred due to a high usage of topic-specific terminology such as the species names. Clusters with multi-disciplinary topics such as epidemiology and virology tended to be either more spread out or shared an overarching theme between different clusters. Using the t-SNE plot, Walters and Ellis also found specific information that is not insignificant. Several new papers described the effectiveness of different drugs and broad spectrum antivirals that could combat COVID-19 symptoms. The plot allowed Walters and Ellis to quickly scan related work and find these publications.

Our work has also been applied to different domains in biology. Grace Reed, a biology researcher from the University of California, Santa Cruz, uses our techniques to cluster literature on the evolution of tropical reef fish and their response to plastic pollution. Reed's work focuses on exploiting the resulting clusters to identify the topics which will form the basis for future research in the subject matter.

We have developed a tool for COVID-19 literature visualization. It is easy to use, performs well qualitatively, and can be applied to other domains. The tool is able to map papers based off of latent interrelationships in the text. Feedback from professionals has indicated it useful to find new and relevant material.

GÃľron AurÃľlien. 2019. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O'Reilly Media

Who Wrote the 15th Book of Oz ?

Scientists are drowning in COVID-19 papers. Can new tools keep them afloat?

CORD-19, COVID-19 Open Research Dataset

Quantifying neighbourhood preservation in topographic mappings

A Survey on Literature Based Discovery Approaches in Biomedical Domain

Principal Component Analysis and Factor Analysis

COVID-19 Open Research Dataset Challenge (CORD-19

Tensor Decompositions and Applications

Incredible response from the AI community to the @WhiteHouse Call to Action to help scientists analyze coronavirus research!

Least squares quantization in PCM

UMAP: Uniform Manifold Approximation and Projection

ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

Mr. Shakespeare, Meet Mr. Tucker

Bringing UMAP Closer to the Speed of Light with GPU Acceleration

The White House Office of Science and Technology Policy. 2020. Call to Action to the Tech Community on New Machine Readable

Exploratory Analysis of Covid-19 Tweets using Topic Modeling, UMAP, and DiGraphs

Scikit-learn: Machine learning in Python

Visualizing Data using t-SNE