key: cord-0871307-xybpya2a authors: Godavarthi, Deepthi; A, Mary Sowjanya title: Classification of covid related articles using machine learning date: 2021-02-28 journal: Mater Today Proc DOI: 10.1016/j.matpr.2021.01.480 sha: d41a8cd703000479277618fc1beaf05d85acb2b1 doc_id: 871307 cord_uid: xybpya2a Covid 19 pandemic has placed the entire world in a precarious condition. Earlier it was a serious issue in china whereas now it is being witnessed by citizens all over the world. Scientists are working hard to find treatment and vaccines for the coronavirus, also termed as covid. With the growing literature, it has become a major challenge for the medical community to find answers to questions related to covid-19. We have proposed a machine learning-based system that uses text classification applications of NLP to extract information from the scientific literature. Classification of large textual data makes the searching process easier thus useful for scientists. The main aim of our system is to classify the abstracts related to covid with their respective journals so that a researcher can refer to articles of his interest from the required journals instead of searching all the articles. In this paper, we describe our methodology needed to build such a system. Our system experiments on the COVID-19 open research dataset and the performance is evaluated using classifiers like KNN, MLP, etc. An explainer was also built using XGBoost to show the model predictions. The district named Wanzhou was affected by COVID 19 pandemic in China. Wanzhou became an enclosed center for epidemiological investigations with the lockdown of Wuhan and surrounding places on January 23, 2020. Hence an opportunity was provided for understanding the transmission dynamics and other risk factors associated with the spread of SARS-COV-2, the agent of COVID-19. 47 other Chinese cities also implemented the same measures to tackle COVID-19. Most of the COVID-19 cases are very mild in severity [1, 2] thus reducing the likelihood that they would look around for testing and medical care [3] . CORD 19 corpora [4] introduced by the Allen Institute for AI and other research groups consists of over 200,000 scholarly articles, of which 100,000 are with full text, about COVID-19, SARS-CoV-2, and similar coronaviruses like SARS and MERS. People applied various AI-based techniques in information retrieval and NLP on this dataset for extracting important information. We propose a machine learning-based system that uses text classification application of NLP for extracting the information from this scientific literature. The main aim is to classify the covid related articles according to the journals in which they were published. Hence it can be considered as a multi-class classification problem. On 7th January the raging virus was detected as coronavirus. It had >95% homology with the bat coronavirus and >70% similarity with the SARS-CoV. The covid related articles started increasing from February 2020 to date making it very difficult for human analysts to go through all the articles. Applications like identification of objects in images, speech to text conversion, news items matching, products related to user interests, etc can be done by using machine learning techniques [5] . They make use of the class of techniques named deep learning [6] . A system that is based on deep learning was proposed that uses NLP question answering methods to mine the literature [7] . CovidQA, a questionanswering dataset was developed for covid 19 [8] . Shuja et al. [9] have formulated research domain taxonomy and identified features of datasets concerning the type, methods, application. Santos et al. [10] presented a dataset on COVID-19 in which research activities overview was provided so that it would be easy to find scientists and researchers who are active in the task of combating the disease. The models in data mining to predict COVID-19 patient's recovery using an epidemiological dataset of South Korea COVID-19 patients were implemented [11] . The assessment of information flow as well as, scientific collaboration quality was performed which are important to find solutions for pandemic [12] . A dataset that consists of COVID-19 updates provided by the Nigeria Centre for Disease Control online from February to September [13] was provided. Saefi et al. [14] examined knowledge, practice, and attitude related to COVID-19 among undergraduate students in Indonesia and presented a dataset related to their examination. An automated theme-based visualization method that combines data modeling, information mapping, trend analysis was proposed [15] . Machine learning techniques are used for extracting activities and trends of covid related articles [16] . The author's gender distribution on covid related medical papers results is compared with published articles in the same journals in 2019 for articles from the US with first and last authors [17] . Kieuvongngam et al. [18] performed Text summarization on covid 19 using the Advances in pre-trained NLP models, BERT and OpenAI GPT-2. Chamola et al. [19] presented a deep review of important aspects related to covid 19 using the reliable source as well as used technologies such as IoT, Unmanned Aerial Vehicles (UAVs), blockchain, Artificial Intelligence (AI), and 5G to reduce covid 19 impacts. Feature extraction methods overview to recognize isolated or segmented characters were presented [20] . Dhole et al. [21] proposed a method where natural language interpretation and classification techniques are used for disease diagnosis. The usage of ROC curve for evaluating the performance of machine learning algorithms was investigated [22] . Machine learning techniques are used to illustrate the text classification process [23] . The overview of text classification algorithms was presented [24] . Our system based on machine learning consists of 3 modules.1) Text analysis 2) Data Processing 3) Machine learning. The first module performs Language detection, Named entity recognition, Data cleaning, Length analysis, Word count. The second module performs Target encoding and Dataset partitioning. The third module performs Feature engineering with vectorizer, Feature selection, Model design, Train/test, Evaluate, Explainability. The architecture of our system is shown in Fig. 1 . There are several journals in the dataset. In our model, we have considered a subset of 3 journals:bioRxiv, PLoS One, Virology rep-resented in Fig. 2 . The proportion of Virology is small compared to bioRxiv and PLoS One. To solve this issue we performed dataset resampling. First data cleaning was done followed by the extraction of insights from raw data. Then they were added as new columns in a data frame. This new information was used for the classification model. First, we are detecting the type of language in which the articles were published. Since there might be articles from multiple languages we are filtering out the articles in English. Langdetect package is used on articles of the dataset. It can be done on the complete dataset by adding a new column with information about the language as shown in Fig. 3 . The process of tagging named entities present in raw text with predefined categories like names of persons, organizations, quantities, locations, etc is called NER. It takes a lot of time to train the NER model as it requires a rich dataset. So we use NER tools provided by SpaCy as it provides various NLP models to identify various entity categories. SpaCy model en_core_web_lg is used on abstracts. For each abstract, all the recognized entities are inserted into one column with a count giving the number of times that entity occurred in the text. Then one more column was created for every tag category and the count of each entity is place. We can observe the tag types distribution macro view as shown in The raw data must be prepared in such a way that it must be suitable for the machine learning model to handle it. Steps in text cleaning depend on data type and the task required. Usually, strings will be converted to lower case letters and punctuations will be removed before splitting text into tokens. Since all tokens are not necessary we can remove tokens that don't give the required information. For example, words like ''and", ''the" are not useful as they occur in multiple places in the dataset. They are called stopwords and can be removed. While removing stopwords we must be very careful because if we remove the wrong token we may lose very important information. Word transformation techniques such as stemming and lemmatization are applied to produce words root form. All these preprocessing steps are written in one function and are applied to the complete dataset as shown in Fig. 7 . We here identify whether one category is larger than the other since length could be considered as the only feature required for building a model. There are various ways to measure the length of text data. Here the set of observations are divided into 3 samples based on the journal names (bioRxiv, PLoS One, virology), we compare the histograms and densities of the samples. The variable is said to be predictive if there are different distributions because there are different patterns for all groups. Though all 3 groups have a similar length distribution since they have different sizes density plots are essential and depicted as in Figs. 8 and 9 . The text length analysis and the most frequent words in a particular journal are shown below in Figs. 10 and 11. CountVectorizer from Scikit-learn is used to calculate word frequency. This vectorizer converts a set of documents into a matrix with a count of the token. This information can be visualized using the word cloud in which each tag frequency could be displayed with font size and color. The word count for various journals is shown below in Fig. 12 . We have encoded variable 'y' that is assigned to journals and added a new column named 'y_id' to the data frame to perform journal encoding for clarity as shown in Fig. 13 . Dataset is partitioned into a training set (70%) and test set (30%) to evaluate the performance of the model. So that the model can be fit on the training set as follows (Fig. 14) . In Bag of words, vocabulary is built from document corpus and the number of times a word appears in the document is counted. A vector is used to represent a document with a length equal to the vocabulary size. Words in vocabulary act as features. If there are more documents, vocabulary size becomes larger which results in a huge feature matrix. To reduce this dimensionality problem, preprocessing was done. We can have common words with the highest frequency in the dataset but they may have little impact on the target variable so term frequency need not be considered as a good representation for text. Instead of normal counting, tfidf can be used. Creating features from the raw text for a machine learning model is called feature engineering and is considered the most important phase of text classification. It is a process of information extraction from data for feature creation. Here we are using tf-idf vectorizer with 10,000 words limit along with capturing unigrams and bigrams. Now vectorizer is used on the preprocessed corpus of the train set for vocabulary extraction and feature matrix creation as shown in Fig. 15 . The feature matrix x_train has a shape of 4873(documents considered in training) * 10,000(vocabulary length). We can look for a word in vocabulary to know a certain word position. To reduce the dimensionality of a matrix we can drop a few unimportant columns by using feature selection where we will select a subset of only relevant variables. As such the number of features has now been reduced from 10,000 to 627 by considering only similar features. This new list of words can be given as input thus refitting the vectorizer on a corpus that produces a small feature matrix with less vocabulary. Fig. 16 shows the new feature matrix x_train with a shape of 4873*627. Support vector machine, XGBoost and, MLP classifier are used to train the model, then predictions can be made based on knowledge about the related conditions. If the dataset is very large then this algorithm is very suitable because it takes each feature independently into consideration, calculating each category probability, and then the highest probability category will be predicted. All the models are trained on the feature matrix and tested on the transformed test set and then a sci-kit learn pipeline was built. It is an application consisting of transformations and a final estimator list. Tfidf vectorizer and the model are kept in this pipeline so that it allows transformation and test data prediction. The metrics such as Accuracy, Confusion Matrix, ROC, Precision, Recall, f1-score, support were used for evaluation as shown in Figs. 17 to 19 . The performance of the bag of words model is evaluated. The BoW model got 76% of the test set correct (accuracy is 0.76) on using KNN and 83% on MLP classifier. XGBoost algorithm is used to improve the accuracy from 0.76 to 0.84. It is a process of explaining the internal mechanics of a system in human terms. As such an explainer lime package was used. The random observations from the test set were considered and model predictions can be observed. The words ''virus", ''protein" pointed the model in the right direction (virology) as shown in Fig. 20 . It can be seen that XGBoost and MLP Classifier perform almost similarly on the dataset in Table 1 . We have described our system consisting of 3 modules namely text analysis, processing and, machine learning model. Our system uses a CORD-19 dataset that consists of various scientific articles related to covid 19. We believe that by using our system it would be easy for the community to retrieve required articles from the journals of their interest and help them to combat the pandemic. In the future wish to develop a deep learning system with increased accuracy. Godavarthi Deepthi: Conceptualization, Methodology, Software, Visualization, Writing -original draft. A. Mary Sowjanya: Data curation, Supervision, Validation, Writing -review & editing. Mild or moderate Covid-19 Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: summary of a report of 72314 cases from the Chinese Center for Disease Control and Prevention Disease severity determines healthseeking behaviour amongst individuals with influenza-like illness in an internet-based cohort CORD-19: The C Deep learning On the Origin of Deep Learning CAiRE-COVID: A Question Answering and Multi-Document Summarization System for COVID-19 Research Rapidly Bootstrapping a Question Answering Dataset for COVID-19 COVID-19 open source data sets: a comprehensive survey COVID-19: A scholarly production dataset report for research analysis Predictive Data Mining Models for Novel Coronavirus (COVID-19) Infected Patients' Recovery Preliminary analysis of COVID -19 academic information patterns: a call for open science in the times of closed borders An exploratory assessment of a multidimensional healthcare and economic data on COVID-19 in Nigeria Survey data of COVID-19-related knowledge, attitude, and practices among indonesian undergraduate students Visualising COVID-19 Research Target specific mining of COVID-19 scholarly articles using one-class approach COVID-19 medical papers have fewer women first authors than expected Automatic Text Summarization of COVID-19 Medical Research Articles using BERT and GPT-2 A Comprehensive Review of the COVID-19 Pandemic and the Role of IoT , Drones, AI , Blockchain, and 5G in Managing Its Impact Feature extraction methods for character recognition -a survey NLP Based Retrieval of Medical Information for Diagnosis of Human Diseases The use of the Area Under The ROC Curve in the Evaluation Of machine learning Algorithms Text classification using machine learning techniques Text classification algorithms: a survey The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.