key: cord-0221600-6fqu3f14
authors: Parmar, Mihir; Ambalavanan, Ashwin Karthik; Guan, Hong; Banerjee, Rishab; Pabla, Jitesh; Devarakonda, Murthy
title: COVID-19: Comparative Analysis of Methods for Identifying Articles Related to Therapeutics and Vaccines without Using Labeled Data
date: 2021-01-05
journal: nan
DOI: nan
sha: 6ee35d08a37075a9bfd3e495cf859e54cbd32201
doc_id: 221600
cord_uid: 6fqu3f14

Here we proposed an approach to analyze text classification methods based on the presence or absence of task-specific terms (and their synonyms) in the text. We applied this approach to study six different transfer-learning and unsupervised methods for screening articles relevant to COVID-19 vaccines and therapeutics. The analysis revealed that while a BERT model trained on search-engine results generally performed well, it miss-classified relevant abstracts that did not contain task-specific terms. We used this insight to create a more effective unsupervised ensemble.

1 Introduction is a machine-readable collection of scientific articles relevant to COVID-19. Finding articles relevant to COVID-19 vaccines and therapeutics in the dataset has many practical uses. Our goal was to study how different transfer-learning and unsupervised methods perform on the task since these methods do not require labeled data.

We formulated the task as a classification problem and considered six different transfer learning or unsupervised techniques for it: (1) BERT's next sentence prediction; (2) Model trained on a different dataset to identify treatments; (2) Clinical semantic text similarity; (4) Lexicon-based semantic similarity; (5) Model trained on Google scholar search results; (6) TF-IDF based scorers. The input to the systems was the title, abstract, and journal name, and the output was one of three labels: vaccine-related, therapeutics-related, or the other. Where relevant, we used BERT (Devlin et al., 2018) and its variants.

An important challenge was to understand the performance characteristics of these models at a deeper level than the standard performance metrics so that the insight can be used to improve a promising method or develop an effective ensemble. Most error analyses tend to be ad-hoc.

We characterized the performance of the methods for four categories that were obtained by taking the cross-product of two factors: (1) whether an article was rated as a positive or negative class; and (2) whether the terms vaccines, therapeutics, and their synonyms appeared in the article (abstract) or not. We manually double-annotated a small (203 articles) test set. Results from our analysis showed that while a Google search-results trained method generally performed well, it missclassified large portion of articles in one category, and this insight helped us to formulate an effective ensemble, which achieved an F measure of 0.65.

We used the COVID-19 Open Research Dataset (CORD-19) (Wang et al., 2020) , which contains around 59K articles previously screened for COVID-19 and associated illnesses and were collected from peer-reviewed publications and archival services. For our experiments, we used the text consisting of [title + abstract + journal name] for each article. We removed the articles that have missing information for one of them, which resulted in about 47K articles. We also manually labeled 203 articles from CORD-19 dataset in three classes: 1) Vaccine, 2) Therapeutics and 3) Other. Each article was judged by two authors of this paper separately. The inter-annotator agreement, measured with kappa score was 0.83. There were 39 vaccine-related articles, 28 therapeuticsrelated articles, and 136 other articles in the test dataset.

In this approach, we formulated the task as the Next Sentence Prediction (NSP) in SciBERT (Beltagy et al., 2019) . We further pre-trained this model using MLM (Masked Learning Model) on the abstract and title text of the CORD-19 dataset. We handcrafted a passage each for vaccines and therapeutics, to use as the first "sentence". These passages were created from sentences in articles that were positive for vaccine and therapeutics, respectively, and strongly indicated relevancy (based on manual analysis). The passages are shown in Appendix A.

To test if an article is about therapeutics, the therapeutics passage and the article text (title + abstract + journal name) are used as the "sentence" pair in the next sentence prediction mode of SciB-ERT. If SciBERT predicts that an article is the next sentence with a probability threshold of 0.999 or higher, then this model outputs a "therapeutics" label. Similarly, the article is scored for vaccines, using the vaccine passage as the first sentence. The final predicted label is determined by the higher of the two scores, if the probabilities were greater than 0.999 for both.

In this scorer, we used a SciBERT model that was previously fine-tuned on the Clinical Hedges dataset to identify articles describing treatments. Clinical Hedges (Wilczynski NL, 2005) is a dataset of articles from MEDLINE which were manually annotated for the study-purpose, including therapeutics, etiology, diagnostics and others. Articles were also annotated for other study characteristics that are not of interest here. We defined the task as binary classification, i.e. therapeutics or not therapeutics. We combined this model with the NSP model described in Section 2.2.1 as follows: we used the text documents categorized as therapeutics or vaccine by the NSP model as the input to the Clinical Hedges model at the prediction time. If the Clinical Hedges model categorized the text as therapeutics then the final label is therapeutics otherwise the label is vaccine. The Clinical Hedges model consisted of SciBERT (fine-tuned on Clinical Hedges Data) with a Feed-Forward Network (FFN) head to generate the probability distribution for the binary classification.

In this clinical Semantic Textual Similarity (STS) approach, we pre-trained BERT on the Clinical STS dataset from n2c2 Challenge 2019 (Cer et al., 2017) to score semantic similarity between a manually created query and a given text segment. We input the pair ([CLS] + query + [SEP] + textsegment) to the BERT, and used the output CLS representation in a linear regression model to predict similarity in the range of 0 to 5 (Cer et al., 2017) . Separate queries were manually created for vaccine and therapeutics and are shown in Appendix B.

The text-segments were obtained by breaking up the concatenation of title, abstract, and the journal name on the sentence boundaries using NLTK. The average of the top n (n = 3) textsegment scores was produced as the article-level score (Yang et al., 2019; Kotzias et al., 2015) . As suggested in (Wang et al., 2018) , we used 2.0 (out of 5.0) as threshold for classifying an article as positive for a query-type (i.e. therapeutics or vaccine), and the final label was based on the higher of the two scores if both were above 2.0.

The intuition here is that vaccines-related articles will have many tokens whose embeddings are similar to those of words like "vaccines". The contextual embeddings for all tokens are first obtained by processing the [title + abstracts + journal name] of articles through the BioBERT model. If a token appear in multiple documents, we take the average of the embeddings of the token from the documents.

Next, starting with seed words -"vaccine", "vaccines", "vaccination" and "vaccinations" -we find additional tokens (extended-seeds whose embeddings are within a small cosine distance of the embeddings of the seed words. The cosine distance threshold is set as the max of the closest 1000 pairs. Subsequently, we find the top-k (k = 50) words called representative-words from the abstract of each article -the top-k words whose embeddings are closest to any of the extended-seeds.

Lastly, the article-score is the average of the cosine similarity values of the representative-words that appear in its abstract. The process was repeated for the seed words for therapeutics. The final label for an article is the argmax of the vaccine and therapeutics article-scores.

In this approach, we trained a model using results from the Google search engine, inspired by the knowledge hunting methods (Prakash et al., 2019; Emami et al., 2018) . Google Scholar™ was queried with "coronavirus vaccine" and "coronavirus therapeutics", and the respective top 1000 results were scraped. The intersection of the search results for each query and the CORD-19 dataset was designated as the set of positive samples for the corresponding label.

To obtain negative samples, additional Google Scholar queries that were semantically unrelated to coronavirus vaccines or therapeutics were carefully constructed (see Appendix C). The top 1000 Google Scholar™ results from these queries were intersected with CORD-19 dataset to get the negative samples required. Any articles common between the two positive samples sets and the negative samples set were removed from the negative samples. Any common articles between the vaccine and the therapeutics positive samples were kept in the set for which the article's rank is better.

A SciBERT model was fine-tuned on 80% of this weakly labeled dataset, the remaining 20% was used for validation. Concatenation of article's title, abstract and journal name were given as the input. The trained (fine-tuned) model was used to predict the label for the remaining articles in CORD-19.

In this approach, n representative vaccine-related queries (V Q 1 , V Q 2 , ..., V Qn ) and n representative therapeutics-related queries (T Q 1 , T Q 2 , ..., T Qn ) were manually created, and their Cartesian product forms a list of query pairs. The title and abstract were concatenated as the input text for an article. Given an input of the k th article, we first compute its tf-idf vector a k . For each query pair (V Q i , T Q j ), where i ∈ [1, n] and j ∈ [1, n], we obtained the tf-idf vectors (v i , t j ). The vaccinescore vs k,ij , therapeutics-score ts k,ij , and other-score os k,ij are computed as follow:

The intuition behind the other-score os k,ij is that if the angle between a k and vs k,ij is α, the angle between a k and ts k,ij is β, then vaccine-score is cos α, therapeutics-score is cos β, other-score is cos( α+β 2 + π). The accumulative scores are computed as follow:

The label corresponding class of the highest score is the final vote for the k th article. In our experiment, we selected four different queries, which are listed in the Appendix B.

We aggregated the output of all the scorers into one single prediction using majority voting. We did two majority voting scheme: (1) majority of all scorers (MV 6); (2) majority of three methods (MV 3) selected based on the performance analysis described in the next section. Each scorer contributes one vote as a label -vaccine, therapeutics, or the other -for each article. The label with the most votes is the overall prediction for the article. Since there are even number of scorers for majority of all, when there is a tie between two labels, a positive class is always preferred and between positive classes one is chosen randomly.

First we used the individual systems described earlier and the majority of all (MV 6) to predict labels (classes) for the manually annotated test dataset. As is customary, we calculated precision (P), recall (R), and F-measures (F) only for the positive classes (i.e., Vaccine and Therapeutics).

To characterize performance of the systems compared to the human judgment, random 119 of the 204 articles were categorized as shown in Table 1. The categories are a cross-product of two factors: (1) whether or not specific lexicons are present in the article text (abstract); (2) whether or not the article was manually judged as positive class (vaccines or therapeutics).

While 58% of the articles are in category 4, there were significant number of articles in the other categories as well lending this categorization to interesting analysis of the methods. From the performance of the different systems for these categories, we could draw preliminary conclusions about the scorer design vis-a-vis how the subjects of article screening (i.e. vaccines and therapeutics) were discussed in the abstracts. Based on the analysis, a majority of three scorers (MV 3) was formed and its performance was analyzed on the test dataset. Here is the implementation of all the scorers 1 . 

Among the individual systems, GS and MS outperformed all the others and achieved the same best F-score (0.50) (See Table 2 ). GS achieved the best precision (0.50) while MS achieved the best recall (0.76). NSP and LSS performed rather poorly across the board, and CH and STS achieved moderate performance. As expected, MV 6 (majority of all) performance was in the middle of the range. Table 3 shows performance of the systems for the four categories introduced in Table 1 . We highlighted in bold the best two systems for each category. We note that not all systems performed uniformly across the four categories. While GS performed well on categories 1, 2, and 4, it did poorly on category 3. On the other hand, MS and CH performed better on category 3. This suggested that 1 https://github.com/md-labs/covid19-kaggle combining predictions of GS, CH, and MS might result in a system performing better than any of them. We created majority of three voting system (MV 3) based on this observation, and found that it achieved the best results on the test dataset (the highest F measure 0.65 and precision 0.59, and the second best recall 0.69). 

Using the labels obtained from the Google Scholar search results to fine-tune the SciBERT model, helped GS perform well across most categories. It, however, performed poorly on the category 3 articles where terms specific to vaccines and therapeutics were not present. This observation from our analysis suggests the possibility that the term-based relevance matching commonly used in search engines may not be adequate for accurate screening (classification) of articles for a studytopic.

The robust tf-idf based scoring helped MS to make the better decisions on category 3 articles, in addition to category 1 articles. However, lack of query term discrimination might have undermined its performance for category 2 and 4. Surprisingly better performance of NSP on categories 2 and 4 was due to its tendency to classify most articles as "Other". CH benefited from transfer learning for category 3. These observations enabled by our analysis method, helped us to combine GS, MS, and CH as a select majority-voting group and achieve the best overall performance.

We proposed a new approach to analyzing performance of text classification methods. We applied this approach to study six transfer learning and unsupervised methods for screening articles related to COVID-19 vaccines and therapeutics. These methods are of particular value since timely results can be obtained without the time and effort of generating a large labeled dataset. We used a novel 2x2 categorization of articles to understand perfor-mance of the systems, and from the analysis formulated an effective voting ensemble of systems. Our methodology showed that while a weakly supervised model based on search-engine results performed generally well, it miss-classified articles that did not contain task-specific lexicon. Combining it with a tf-idf and a transfer learning system yielded better results. The key contribution of this paper is the novel approach to analyze text classification methods to gain insights into their performance characteristics.

Scibert: A pretrained language model for scientific text

SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation

Bert: Pre-training of deep bidirectional transformers for language understanding

A knowledge hunting framework for common sense reasoning

From group to individual labels using deep features

Combining knowledge hunting and neural language models to solve the Winograd schema challenge

Cord-19: The covid-19 open research dataset

Medsts: a resource for clinical semantic textual similarity

An overview of the design and methods for retrieving high-quality studies for clinical care

Simple applications of bert for ad hoc document retrieval

Vaccine-related query:Vaccine is a substance used to stimulate the production of antibodies and provide immunity against diseases. They are treated to act as an antigen without inducing the disease. When the virulent version of an agent comes along, the immune system is prepared to respond due to the generation of B cells (memory and plasma cells), which will generate antibodies that will bind to pathogens and destroy them. Vaccine researchers are working on the development of a vaccine candidate expressing the viral spike protein of SARS-CoV-2 using a messenger RNA vaccine. Scientists are also focusing on the development of a chimpanzee adenovirus-vectored vaccine candidate against COVID-19. In addition, scientists are also working to see if vaccines developed for SARS coronavirus are effective against COVID-19.

Therapeutics is the branch of medicine concerned with the therapeutics of disease and the action of remedial agents. There is no specific antiviral therapy and therapeutics given by doctors is largely supportive, consisting of supplemental oxygen and conservative fluid administration. Drugs like Chloroquine, Hydroxychloroquine, Lopinavir, Ritonavir, Azithromycin and Tocilizumab are being prescribed by doctors in ICU testing. The drug Remdesivir has shown promise against other coronaviruses in animal models. Patients with respiratory failure require intubation. Patients in shock require urgent fluid resuscitation and administration of empiric antimicrobial therapy. Corticosteroid therapy is not recommended for viral pneumonia; however, use may be considered for patients with refractory shock or acute respiratory distress syndrome.

Vaccine-related query:• vaccine vaccination dose antitoxin serum immunization inoculation for COVID-19 or coronavirus related research work.Therapeutics-related query:• Therapeutics therapeutics therapy drug antidotes cures remedies medication prophylactic restorative panacea for COVID-19 or coronavirus related research work.

• Coronavirus transmission, incubation and environment stability• Coronavirus ethical and social science considerations • Coronavirus information sharing and intersectoral collaboration