key: cord-0228992-y8665dlm
authors: Kocaman, Veysel; Talby, David
title: Improving Clinical Document Understanding on COVID-19 Research with Spark NLP
date: 2020-12-07
journal: nan
DOI: nan
sha: 1b6cbc2d96e18176929f92ed885457fcaf3ef009
doc_id: 228992
cord_uid: y8665dlm

Following the global COVID-19 pandemic, the number of scientific papers studying the virus has grown massively, leading to increased interest in automated literate review. We present a clinical text mining system that improves on previous efforts in three ways. First, it can recognize over 100 different entity types including social determinants of health, anatomy, risk factors, and adverse events in addition to other commonly used clinical and biomedical entities. Second, the text processing pipeline includes assertion status detection, to distinguish between clinical facts that are present, absent, conditional, or about someone other than the patient. Third, the deep learning models used are more accurate than previously available, leveraging an integrated pipeline of state-of-the-art pretrained named entity recognition models, and improving on the previous best performing benchmarks for assertion status detection. We illustrate extracting trends and insights, e.g. most frequent disorders and symptoms, and most common vital signs and EKG findings, from the COVID-19 Open Research Dataset (CORD-19). The system is built using the Spark NLP library which natively supports scaling to use distributed clusters, leveraging GPUs, configurable and reusable NLP pipelines, healthcare specific embeddings, and the ability to train models to support new entity types or human languages with no code changes.

The COVID-19 pandemic brought a surge of academic research about the virus -resulting in 23,634 new publications between January and June of 2020 (da Silva, Tsigaris, and Erfanmanesh 2020) and accelerating to 8,800 additions per week from June to November on the COVID-19 Open Research Dataset (Wang et al. 2020) . Such a high volume of publications makes it impossible for researchers to read each publication, resulting in increased interest in applying natural language processing (NLP) and text mining techniques to enable semi-automated literature review (Cheng, Cao, and Liao 2020) .

In parallel, there is a growing need for automated text mining of Electronic health records (EHRs) in order to find clinical indications that new research points to. EHRs are the primary source of information for clinicians tracking the care of their patients. Information fed into these systems may Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. be found in structured fields for which values are inputted electronically (e.g. laboratory test orders or results) (Liede et al. 2015) but most of the time information in these records is unstructured making it largely inaccessible for statistical analysis (Murdoch and Detsky 2013) . These records include information such as the reason for administering drugs, previous disorders of the patient or the outcome of past treatments, and they are the largest source of empirical data in biomedical research, allowing for major scientific findings in highly relevant disorders such as cancer and Alzheimer's disease (Perera et al. 2014) .

A primary building block in such text mining systems is named entity recognition (NER) -which is regarded as a critical precursor for question answering, topic modelling, information retrieval, etc (Yadav and Bethard 2019) . In the medical domain, NER recognizes the first meaningful chunks out of a clinical note, which are then fed down the processing pipeline as an input to subsequent downstream tasks such as clinical assertion status detection (Uzuner et al. 2011) , clinical entity resolution (Tzitzivacos 2007 ) and de-identification of sensitive data (Uzuner, Luo, and Szolovits 2007 ) (see Figure 1) . However, segmentation of clinical and drug entities is considered to be a difficult task in biomedical NER systems because of complex orthographic structures of named entities (Liu et al. 2015) .

The next step following an NER model in the clinical NLP pipeline is to assign an assertion status to each named entity given its context. The status of an assertion explains how a named entity (e.g. clinical finding, procedure, lab result) pertains to the patient by assigning a label such as present ("patient is diabetic"), absent ("patient denies nausea"), conditional ("dyspnea while climbing stairs"), or associated with someone else ("family history of depression"). In the context of COVID-19, applying an accurate assertion status detection is crucial, since most patients will be tested for and asked about the same set of symptoms and comorbidities -so limiting a text mining pipeline to recognizing medical terms without context is not useful in practice.

In this study, we introduce a set of pre-trained NER models that are all trained on biomedical and clinical datasets within a Bi-LSTM-CNN-Char deep learning architecture, and a Bi-LSTM based assertion detection module built on top of the Spark NLP software library. We then illustrate how to extract knowledge and relevant information from unstructured Figure 1: Named Entity Recognition is a fundamental building block of medical text mining pipelines, and feeds downstream tasks such as assertion status, entity linking, de-identification, and relation extraction.

electronic health records (EHR) and COVID-19 Open Research Dataset (CORD-19) by combining these models in a pipeline. Using state-of-the-art deep learning architectures, Spark NLP's NER and Assertion modules can also be extended to other spoken languages with zero code changes and can scale up in Spark clusters. Moreover, by utilizing Apache Spark, both training and inference of full NLP pipelines can scale to make the most of distributed Spark clusters. Due to brevity concerns, the implementation details and training metrics of these models will be kept out of the scope of this study.

The specific novel contributions of this paper are:

• Introducing a medical text mining pipeline composed of state-of-the-art, healthcare-specific NER models

• Introducing a clinical assertion status detection model that establishes a new state-of-the-art level of accuracy on a widely used benchmark

• Describing how to apply these models in a unified, performant, and scalable pipeline on documents from the CORD-19 dataset.

The remainder of the paper is organized as follows: Section 2 Introduces the Spark NLP library, summarizes the NER and assertion detection model frameworks it implements, and elaborates the named entities in each pre-trained NER model. Section 4 explains how to build a prediction pipeline to extract named entities and assign assertion statuses from a set of documents on a cluster with Spark NLP. Section 5 discusses benchmarking speed and scalability issues and Section 6 concludes this paper by summarizing key points and future directions.

The deep neural network architecture for named entity recognition in Spark NLP is based on the BiLSTM-CNN-Char framework. It is a modified version of the architecture proposed by Chiu et.al. (Chiu and Nichols 2016) . It is a neural network architecture that automatically detects word and character-level features using a hybrid bidirectional LSTM and CNN architecture, eliminating the need for most feature engineering steps. The detailed architecture of the framework in the original paper is illustrated at Figure 2 and a sample predictions from a set of pre-trained clinical NER models from a text taken from CORD-19 dataset is shown in 3. In Spark NLP, this architecture is implemented using Ten-sorFlow, and has been heavily optimized for accuracy, speed, scalability, and memory utilization. This setup has been tightly integrated with Apache Spark to let the driver node run the entire training using all the available cores on the driver node. There is a CuDA version of each TensorFlow component to enable training models on GPU when available. The Spark NLP provides open-source API's in Python, Java, Scala, and R -so that users do not need to be aware of the underlying implementation details (TensorFlow, Spark, etc.) in order to use it.

The full list of the entities for each pre-trained medical NER model is available in Appendix D, the accuracy metrics are given in Table 1 and a sample Python code for training a NER model from scratch is in Appendix C. 

The deep neural network architecture for assertion status detection in Spark NLP is based on a Bi-LSTM framework, and is a modified version of the architecture proposed by Fancellu et.al. (Fancellu, Lopez, and Webber 2016) . Its goal is to classify the assertions made on given medical concepts as being present, absent, or possible in the patient, conditionally present in the patient under certain circumstances, hypothetically present in the patient at some future point, and mentioned in the patient report but associated with someoneelse (Uzuner et al. 2011 ).

In the proposed implementation, input units depend on the target tokens (a named entity) and the neighboring words that are explicitly encoded as a sequence using word embeddings. Similar to Fancellu et.al. (Fancellu, Lopez, and Webber 2016) we have observed that that 95% of the scope tokens (neighboring words) fall in a window of 9 tokens to the left and 15 to the right of the target tokens in the same dataset. We therefore implemented the same window size and used learning rate 0.0012, dropout 0.05, batch size 64 and a maximum sentence length 250. The model has been implemented within Spark NLP as an annotator called AssertionDLModel. After training 20 epoch and measuring accuracy on the official test set, this implementation exceeds the latest state-of-the-art accuracy benchmarks as summarized as Table 2 Table 2: Assertion detection model test metrics. Our implementation exceeds the benchmarks in the latest best model (Uzuner et al. 2011) in 4 out of 6 assertion labels -and in overall accuracy. When we fit() on the pipeline with a Spark data frame, its text column is fed into the DocumentAssembler() transformer and a new column document is created as an initial entry point to Spark NLP for any Spark data frame. Then, its document column is fed into the SentenceDetector() module to split the text into an array of sentences and a new column "sentences" is created. Then, the "sentences" column is fed into Tokenizer(), each sentence is tokenized, and a new column "token" is created. Then, Tokens are normalized (basic text cleaning) and word embeddings are generated for each. Now data is ready to be fed into NER models and then to the assertion model. Since assertion status labels are assigned to a medical concept that is given as an input to the assertion detection model, NER and assertion models must work together sequentially. In Spark NLP, we handle this interaction by feeding the output of NER models to an NER converter to create chunks from labeled entities and then feed these chunks to the assertion status detection model within the same pipeline. The flow diagram of such a pipeline can be seen in Figure 4 . As the flow diagram shows, in Spark NLP each generated (output) column is pointed to the next module as an input, depending on its input column specifications. A sample Python code for such a prediction pipeline can be seen at Appendix B. This enables users to easily configure arbitrary pipelines -such as running 20 NER pre-trained models within one pipeline, as we do in this analysis of the CORD-19 dataset.

NLP pipelines configured this way are easily reproducible, since they are seriablizable and directly expressed in code. They also simplify experimentation -for example, comparing multiple NER and assertion status models in the same run (while benefiting from the fact that data and embeddings are only loaded into memory once), or trying with different text cleaning steps before the NER stage (such as stopword removal, lemmatization, or automated spell correction).

While the CORD-19 text mining pipeline scales to process an arbitrary number of articles, for purposes of concrete demonstration the next two tables show results on a randomly sampled of 100 articles. The number of recognized named entities for the selected entity classes can be seen at Table 4 . The number of entities detected from each document (20 NER models, over 10 document) can be seen at Table 5 . The most frequent phrases from the selected entity types can be found at Table 6 . The predictions from the assertion status detection model for Disease Syndrome Disorder is shown in Table 7 .

One benefit for this system compared to previous work is the variety of medical entity types that be recognized: As detailed in Appendix D, this NLP pipeline extracts over 100 entity types. While most clinical named entity recognition focus on symptoms, treatments, and drugs, and most biomedical focused projects focus on chemicals, proteins, and genes, this pipeline goes beyond these and can also extract:

• Entities related to social determinants of health such as age and gender, rate and ethnicity, diet, social history, employment, relationship status, alcohol use, sexual activity and orientation substance, process Table 4 shows that this variety is useful in practice in the context of COVID-19 research. On just 10 randomly selected documents and 20 entity types, there are over 60 cases of more than a hundred instances of one entity type found within one paper. Only in fewer than 10% of the cells there were fewer there 10 entities recognized for a specific entity type in a specific document. This suggests that text mining approaches that ignore these entity types fail to take advantage of a lot of clinical insight that the COVID-19 research papers include. Table 7 shows how an accurate assertion status detection model can help in filtering this large amount of entities -in order to focus researchers and downstream algorithms on the most clinically relevant insights. In this small sample, 'systemic disease' is a present clinical condition; 'infectious diseases' and 'disorders of immunity' are hypothetical; while 'skin diseases and 'parvovirus' are associated with someone else.

Consider a common use case of building an automated knowledge graph that links patient symptoms to drugs they are taking, existing conditions, or past procedures. The difference between having assertion status detection results, and being able to filter only to symptoms and drugs that positively impact the patient, will have a substantial impact on the accuracy of the bottom-line results. Since more than a thousand entities are recognized in each research paper, and hundreds of thousands of published COVID-19 papers -doing this automatically, accurately, and at scale is required.

The design of Spark NLP pipelines as described in Figure 4 , where new columns are added to an existing (potentially distributed) data frame with each additional pipeline step, is optimized for parallel execution. It's design for the case where different rows may reside on different machines -benefiting from the optimizations and design of Spark ML.

In order to evaluate how fast the pipeline works and how effectively it scales to make use of a compute cluster, we ran the same Spark NLP prediction pipelines in local mode and in cluster mode. In local mode, a single Dell server with 32 cores and 32 GB memory was used. In cluster mode, Table 6 : The most frequent 10 terms from the selected entity types predicted through parsing 100 articles from CORD-19 dataset (Wang et al. 2020 ) with an NER model named jsl ner wip in Spark NLP. Getting predictions from the model, we can get some valuable information regarding the most frequent disorders or symptoms mentioned in the papers or the most common vital and EKG findings without reading the paper. According to this table, the most common symptom is cough and inflammation while the most common drug ingredients mentioned is oseltamivir and antibiotics. We can also say that cardiogenic oscillations and ventricular fibrillation are the common observations from EKGs while fever and hyphothermia are the most common vital signs. Someone-else 10 machines with 32 GB and 16 cores each were used, in a Databricks cluster on AWS. The performance results are shared in Figure 5 . These benchmarks show that tokenization is 20x faster while the entity extraction is 3.5x faster on the cluster, compared to the single machine run. It indicates that speedup depends on the complexity of the task. For example, tokenization provides super-linear speedup (i.e. growing machines by 10x improves speed by more than 10x), while NER delivers sub-linear speedup (because it's a more computationally complex task).

In this study, we introduced a set of pretrained named entity recognition and assertion status detection models that are trained on biomedical and clinical datasets with deep learning architectures on top of Spark NLP. We then present how to extract relevant facts from the CORD-19 dataset by applying state-of-the-art NER and assertion status models in a unified & scalable pipeline and shared the results to illustrate extracting valuable information from scientific papers.

The results suggest that papers present in the CORD-19 include a wide variety of the many entity types that this new NLP pipeline can recognize, and that assertion status detec- Figure 5 : Comparing the Spark NLP document parsing pipeline in standalone and cluster mode. Tests show that tokenization is 20x faster while the entity extraction is 3.5x faster in cluster mode when compared to standalone mode.

tion is a useful filter on these entities. This bodes well for the richness of downstream analysis that can be done using this now structured and normalized data -such as clustering, dimensionality reduction, semantic similarity, visualization, or graph-based analysis to identity correlated concepts. One future research direction is to apply these downstream analyses on the richer, scalable, and more accurate insights that this NLP pipeline generates.

Since NER and assertion status models in Spark NLP are trainable, it is easy to add support for a new language like German, French, or Spanish, as long as there is a annotated data for it. Spark NLP currently supports 46 languages and 3 languages for Healthcare -English, German and Spanish. Spark NLP provides production-grade libraries for popular programming languages -Python, Scala, Java and R -and has an active community, frequent releases, public documentation and freely available code examples. Future work in this space includes adding support for additional languages, additional entity types, and extending the NLP pipeline further by adding relation extraction and entity resolution models.

BIO (Begin, Inside and Outside) and BIOES (Begin, Inside, Outside, End, Single) schemes for encoding entity annotations as token tags. Words tagged with O are outside of named entities and the I-XXX tag is used for words inside a named entity of type XXX. Whenever two entities of type XXX are immediately next to each other, the first word of the second entity will be tagged B-XXX to highlight that it starts another entity. On the other hand, BIOES (also known as BIOLU) is a little bit sophisticated annotation method that distinguishes between the end of a named entity and single entities. BIOES stands for Begin, Inside, Outside, End, Single. In this scheme, for example, a word describing a gene entity is tagged with "B-Gene" if it is at the beginning of the entity, "I-Gene" if it is in the middle of the entity, and "E-Gene" if it is at the end of the entity. Single-word gene entities are tagged with "S-Gene". All other words not describing entities of interest are tagged as 'O'.

B Defining a Spark NLP Pipeline from sparknlp_jsl.annotator import * documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel. pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained(" ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "

embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner" ]) \ .setOutputCol("ner_chunk") clinical_assertion = AssertionDLModel.

pretrained("assertion_dl", "en", " clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") .setInputCols(["sentence", "token", " embeddings"])\ .setLabelColumn("label")\ .setOutputCol("ner")\ .setMaxEpochs(10)\ .setDropout(0.5)\ .setLr(0.001)\ .setPo(0.005)\ .setBatchSize (8) 

An overview of literature on COVID-19, MERS and SARS: Using text mining and latent Dirichlet allocation

Named entity recognition with bidirectional LSTM-CNNs

Publishing volumes in major databases related to Covid-19

NCBI disease corpus: a resource for disease name recognition and concept normalization

Neural networks for negation scope detection

2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records

MIMIC-III, a freely accessible critical care database

Introduction to the bio-entity recognition task at JNLPBA

Validation of International Classification of Diseases coding for bone metastases in electronic health records using technology-enabled abstraction

Effects of semantic features on machine learning-based drug name recognition systems: word embeddings vs. manually constructed dictionaries

The inevitable application of big data to health care

Overview of BioNLP shared task 2013

Factors associated with response to acetylcholinesterase inhibition in dementia: a cohort study from a secondary mental health care case register in London

Anatomical entity mention recognition at literature scale

Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013)

A silver standard corpus of human phenotype-gene relations

Identifying risk factors for heart disease over time: Overview of 2014 i2b2/UTHealth shared task Track 2

Evaluating temporal relations in clinical text: 2012 i2b2 Challenge

International Classification of Diseases 10th edition (ICD-10):: main article

Evaluating the state-of-the-art in automatic de-identification

2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text

CORD-19: The Covid-19 Open Research Dataset

A survey on recent advances in named entity recognition from deep learning models

ner bionlp (Nédellec et al. 2013) Entities: cellular component, organ, cancer, organism substance, multi, simple chemical, tissue, anatomical system, organism subdivision, immaterial anatomical entity, organism, developing anatomical structure, amino acid, gene or gene product, pathological formation, cell ner cellular (Kim et al. 2004) Entities: dna, cell line, cell type, rna, protein ner clinical (Uzuner et al. 2011) Entities: treatment, problem, test ner deid (Stubbs et al. 2015) Entities: location, contact, date, profession, name, age, id ner deid enriched (Stubbs et al. 2015) Entities: idnum, country, date, profession, medicalrecord, username, organization, zip, id, healthplan, location, device, hospital, city, email, doctor, street, state, patient, bioid, url, phone, fax, age ner diseases (Dogan, Leaman, and Lu 2014) Entities: disease ner drugs (Henry et al. 2020) , (Segura Bedmar, Martínez, and Herrero Zazo 2013) Entities: drug ner events clinical (Sun, Rumshisky, and Uzuner 2013) Entities: test, problem, clinical dept, occurrence, date, time, evidential, treatment, frequency, duration jsl ner wip clinical Entities: triglycerides, oncological, female reproductive status, form, time, date, alcohol, medical history header, race ethnicity, temperature, drug brandname, frequency, fetus newborn, sexually active or sexual orientation, disease syndrome disorder, section header, social history header, strength, cerebrovascular disease, family history header, employment, weight, pregnancy, total cholesterol, diet, ekg findings, gender, drug ingredient, vaccine, substance, oxygen therapy, internal organ or component, blood pressure, overweight, obesity, birth entity, heart disease, diabetes, substance quantity, treatment, death entity, route, modifier, test, clinical dept, communicable disease, psychological condition, hypertension, direction, o2 saturation, hyperlipidemia, imagingfindings, vs finding, allergen, dosage, kidney disease, bmi, smoking, pulse, ldl, symptom, labour delivery, relationship status, external body part or region, hdl, respiration, procedure, height, vital signs header, relativetime, relativedate, injury or poisoning, medical device, test result, duration, age, admission discharge, ner medmentions coarse, pathologic function, geographic area, group, diagnostic procedure, organic chemical, organism attribute, mental or behavioral dysfunction, organization, research activity, therapeutic or preventive procedure, biomedical or dental material, mammal, genetic function, body system, substance, daily or recreational activity, quantitative concept, health care activity, molecular function, indicator, reagent, or diagnostic aid, body substance, virus, eukaryote, disease or syndrome, spatial concept, anatomical structure, body part, organ, or organ component, laboratory procedure, sign or symptom, nucleic acid, nucleoside, or nucleotide, food, mental process, prokaryote, nucleotide sequence, professional or occupational group, cell, biologic function, manufactured object, molecular biology research technique, gene or genome, chemical, neoplastic process, pharmacologic substance, tissue, qualitative concept, amino acid, peptide, or protein, fungus, population group, body location or region, clinical attribute, injury or poisoning, medical device, cell component, plant ner posology (Henry et al. 2020) Entities: form, dosage, strength, drug, route, frequency, duration ner risk factors (Stubbs et al. 2015) Entities: family hist, smoker, obese, medication, hypertension, hyperlipidemia, phi, diabetes, cad ner human phenotype go clinical (Sousa, Lamurias, and Couto 2019) Entities: go, hp ner human phenotype gene clinical (Sousa, Lamurias, and