key: cord-0108156-elphxl9s authors: Guti'errez, Bernal Jim'enez; Zeng, Juncheng; Zhang, Dongdong; Zhang, Ping; Su, Yu title: Document Classification for COVID-19 Literature date: 2020-06-15 journal: nan DOI: nan sha: 85bb1317013141d5e5f41d18376cd7257c46264a doc_id: 108156 cord_uid: elphxl9s The global pandemic has made it more important than ever to quickly and accurately retrieve relevant scientific literature for effective consumption by researchers in a wide range of fields. We provide an analysis of several multi-label document classification models on the LitCovid dataset, a growing collection of 8,000 research papers regarding the novel 2019 coronavirus. We find that pre-trained language models fine-tuned on this dataset outperform all other baselines and that the BioBERT and novel Longformer models surpass all others with almost equivalent micro-F1 and accuracy scores of around 81% and 69% on the test set. We evaluate the data efficiency and generalizability of these models as essential features of any system prepared to deal with an urgent situation like the current health crisis. Finally, we explore 50 errors made by the best performing models on LitCovid documents and find that they often (1) correlate certain labels too closely together and (2) fail to focus on discriminative sections of the articles; both of which are important issues to address in future work. Both data and code are available on GitHub. The COVID-19 pandemic has made it a global priority for research on the subject to be developed at unprecedented rates. Researchers in a wide variety of fields, from clinicians to epidemiologists to policy makers, must all have effective access to the most up to date publications in their respective areas. Automated document classification can play an important role in organizing the stream of articles by fields and topics to facilitate the search process and speed up research efforts. To explore how document classification models can help organize COVID-19 research papers, we use the LitCovid dataset (Chen et al., 2020) , a collection of 8,000 newly released scientific papers compiled by the NIH to facilitate access to the literature on all aspects of the virus. This dataset is updated daily and every new article is manually assigned one or more of the following 8 categories: General, Transmission Dynamics (Transmission), Treatment, Case Report, Epidemic Forecasting (Forecasting), Prevention, Mechanism and Diagnosis. We leverage these annotations and the articles made available by LitCovid to compile a timely new dataset for multi-label document classification. Apart from addressing the pressing needs of the pandemic, this dataset also offers an interesting document classification dataset which spans different biomedical specialities while sharing one overarching topic. This setting is distinct from other biomedical document classification datasets which tend to exclusively distinguish between biomedical topics such as hallmarks of cancer (Baker et al., 2016) , chemical exposure methods (Baker, 2017) or diagnosis codes (Du et al., 2019) . The dataset's shared focus on the COVID-19 pandemic also sets it apart from open-domain datasets and academic paper classification datasets such as IMDB or the aRxiv Academic Paper Dataset (AAPD) (Yang et al., 2018) in which no shared topic can be found in most of the documents, and it poses unique challenges for document classification models. We evaluate a number of models on the LitCovid dataset and find that fine-tuning pre-trained language models yields higher performance than traditional machine learning approaches and neural models such as LSTMs (Adhikari et al., 2019b; Kim, 2014; Liu et al., 2017) . We also notice that BioBERT (Lee et al., 2019), a BERT model pretrained on the original corpus for BERT plus a large set of PubMed articles, performed slightly better than the original BERT base model. We also observe that the novel Longformer (Beltagy et al., Datasets. 2020) model, which allows for processing longer sequences, matches BioBERT's performance when 1024 subwords are used instead of 512, the maximum for BERT models. We then explore the data efficiency and generalizability of these models as crucial aspects to address for document classification to become a useful tool against outbreaks like this one. Finally, we discuss some issues found in our error analysis such as current models often (1) correlating certain categories too closely with each other and (2) failing to focus on discriminative sections of a document and get distracted by introductory text about COVID-19, which suggest venues for future improvement. In this section, we describe the LitCovid dataset in more detail and briefly introduce the CORD-19 dataset which we sampled to create a small test set to evaluate model generalizability. The LitCovid dataset is a collection of recently published PubMed articles which are directly related to the 2019 novel Coronavirus. The dataset contains upwards of 14,000 articles and approximately 2,000 new articles are added every week, making it a comprehensive resource for keeping researchers up to date with the current COVID-19 crisis. For a large portion of the articles in LitCovid, either the full article or at least the abstract can be downloaded directly from their website. For our document classification dataset, we select 8,002 from the original 14,000+ articles which contain full texts or abstracts. As seen in table 1, these selected articles contain on average approximately 51 sentences and 1,200 tokens, reflecting the roughly even split between abstracts and full articles we observe from inspection. Each article in LitCovid is assigned one or more of the following 8 topic labels: Prevention, Treatment, Diagnosis, Mechanism, Case Report, Transmission, Forecasting and General. Even though every article in the corpus can be labelled with multiple tags, most articles, around 76%, contain only one label. Table 2 shows the label distribution for the subset of LitCovid which is used in the present work. We note that there is a large class imbalance, with the most frequently occurring label appearing almost 20 times as much as the least frequent one. We split the LitCovid dataset into train, dev, test with the ratio 7:1:2. The In order to test how our models generalize to a different setting, we asked biomedical experts to label a small set of 100 articles found only in CORD-19. Each article was labelled independently by two annotators. For articles which received two different annotations (around 15%), a third annotator broke ties. Table 1 shows the statistics of this small set and Table 2 shows its category distribution. In the following section we provide a brief description of each model and the implementations used. We use micro-F1 (F1) and accuracy (Acc.) as our evaluation metrics, as done in (Adhikari et al., 2019a) . All reproducibility information can be found in Appendix A. To compare with simpler but competitive traditional baselines we use the default scikit-learn (Pe- dregosa et al., 2011) implementation of logistic regression and linear support vector machine (SVM) for multi-label classification which trains one classifier per class using a one-vs-rest scheme. Both models use TF-IDF weighted bag-of-words as input. Using Hedwig 2 , a document classification toolkit, we evaluate the following models: KimCNN (Kim, 2014), XML-CNN (Liu et al., 2017) as well as an unregularized and a regularized LSTM (Adhikari et al., 2019b) . We notice that they all perform similarly and slightly better than traditional methods. Using the same Hedwig document classification toolkit, we evaluate the performance of DocBERT (Adhikari et al., 2019a) on this task with a few different pre-trained language models. We fine-tune BERT base, BERT large (Devlin et al., 2019) and BioBERT (Lee et al., 2019), a version of BERT base which was further pre-trained on a collection of PubMed articles. We find all BERT models achieve best performance with their highest possible sequence length of 512 subwords. Additionally, we fine-tune the pre-trained Longformer (Beltagy et al., 2020) in the same way and find that it performs best when a maximum sequence length of 1024 is used. We find that all pre-trained language models outperform the previous traditional and neural methods by a sizable margin in both accuracy and micro-F1 score. The best performing models are the Longformer and BioBERT, both achieving a similar micro-F1 score of around 81% on the test set and an accuracy of 69.2% and 68.5% respectively. In this section, we explore data efficiency, model generalizability and discuss potential ways to improve performance on this task in future work. During a sudden healthcare crisis like this pandemic it is essential for models to obtain useful results as soon as possible. Since labelling biomedical articles is a very time-consuming process, achieving peak performance using less data becomes highly desirable. We thus evaluate the data efficiency of these models by training each of the ones shown in Figure 1 using 1%, 5%, 10%, 20% and 50% of our training data and report the micro-F1 score on the dev set. When selecting the data subsets, we sample each category independently to make sure they are all represented. We observe that pre-trained models are much more data-efficient than other models and that BioBERT is the most efficient, demonstrating the importance of domain-specific pre-training. We also notice that BioBERT performs worse than other pre-trained models on 1% of the data, suggesting that its pre-training prevents it from leveraging potentially spurious patterns when there is very little data available. Prediction Analysis on epidemic situation and spatiotemporal changes of COVID-19 in Anhui. ... We mapped the spatiotemporal changes of confirmed cases, fitted the epidemic situation by the population growth curve at different stages and took statistical description and analysis of the epidemic situation in Anhui province. Prevention Forecasting To effectively respond to this pandemic, experts must not only learn as much as possible about the current virus but also thoroughly understand past epidemics and similar viruses. Thus, it is crucial for models trained on the LitCovid dataset to successfully categorize articles about related epidemics. We therefore evaluate some of our baselines on such articles using our labelled CORD-19 subset. We find that the micro-F1 and accuracy metrics drop by around 10 and 30 points respectively. This massive drop in performance from a minor change in domain indicates that the models have trouble ignoring the overarching COVID-19 topic and isolating relevant signals from each category. It is interesting to note that Mechanism is the only category for which BioBERT performs better in CORD-19 than in LitCovid. This could be due to Mechanism articles using technical language and there being enough samples for the models to learn; in contrast with Forecasting which also uses specific language but has far fewer training examples. BioBERT's binary F1 scores for each category on both datasets can be found in Appendix B. We analyze 50 errors made by both highest scoring BioBERT and the Longformer models on Lit-Covid documents to better understand their performance. We find that 34% of these were annotation errors which our best performing model predicted correctly. We also find that 10% of the errors were nearly impossible to classify using only the text available on LitCovid, and the full articles are needed to make better-informed prediction. From the rest of the errors we identify some aspects of this task which should be addressed in future work. We first note these models often correlate certain categories, namely Prevention, Transmission and Forecasting, much more closely than necessary. Even though these categories are semantically related and some overlap exists, the Transmission and Forecasting tags are predicted in conjunction with the Prevention tag much more frequently than what is observed in the labels as can be seen from the table in Appendix C. Future work should attempt to explicitly model correlation between categories to help the model recognize the particular cases in which labels should occur together. The first row in Table 4 shows a document labelled as Forecasting which is also incorrectly predicted with a Prevention label, exemplifying this issue. Finally, we observe that models have trouble identifying discriminative sections of the document due to how much introductory content on the pandemic can be found in most articles. Future work should explicitly model the gap in relevance between introductory sections and crucial sentences such as thesis statements and article titles. In Table 4 , the second and third examples would be more easily classified correctly if specific sentences were ignored while others attended to more thoroughly. This could also increase interpretability, facilitating analysis and further improvement. We provide an analysis of document classification models on the LitCovid dataset for the COVID-19 literature. We determine that fine-tuning pretrained language models yields the best performance on this task. We study the generalizability and data efficiency of these models and discuss some important issues to address in future work. We split the LitCovid dataset into train, dev, test with the ratio 7:1:2. We adopt micro-F1 and accuracy as our evaluation metrics, same as (Adhikari et al., 2019a) . We use scikit-learn (Pedregosa et al., 2011) and Hedwig evaluation scripts to evaluate all the models. For preprocessing, tokenization and sentence segmentation, we use the NLTK library. All the document classification models used in the paper, logistic regression 1 SVM 2 DocBERT 3 , Reg-LSTM 4 , Reg-LSTM 5 , XML-CNN 6 , Kim CNN 7 are run based on the implementations listed here and strictly followed their instructions. We used the following pre-trained language models, BioBERT 8 , BERT base 9 , BERT large 10 and the Longformer 11 . For reproducibility, we list all the key hyperparameters, the tuning bounds and the # of parameters for each model in Table A1 . For the logistic regression and the SVM all hyperparameters used were default to scikit-learn and therefore are excluded from this table. For all models we train for a maximum of 30 epochs with a patience of 5. We used micro-F1 score for all hyperparameter tuning. All models were run on NVIDIA GeForce GTX 1080 GPUs. Docbert: Bert for document classification Rethinking complex neural network architectures for document classification Automatic semantic classification of scientific literature according to the hallmarks of cancer Longformer: The long-document transformer Ohio supercomputer center Keep up with the latest coronavirus research Bert: Pre-training of deep bidirectional transformers for language understanding This research was sponsored in part by the Ohio Supercomputer Center (Center, 1987) . The authors would also like to thank Lang Li and Tanya Berger-Wolf for helpful discussions.