key: cord-0844815-hl4wppxa authors: Newton, Adam J.H.; Chartash, David; Kleinstein, Steven H.; McDougal, Robert A. title: Pipeline for retrieval of COVID-19 immune signatures date: 2021-12-30 journal: bioRxiv DOI: 10.1101/2021.12.29.474353 sha: b876499f7c8c0732bb03844d2940096a10108560 doc_id: 844815 cord_uid: hl4wppxa Objective The accelerating pace of biomedical publication has made retrieving papers and extracting specific comprehensive scientific information a key challenge. A timely example of such a challenge is to retrieve the subset of papers that report on immune signatures (coherent sets of biomarkers) to understand the immune response mechanisms which drive differential SARS-CoV-2 infection outcomes. A systematic and scalable approach is needed to identify and extract COVID-19 immune signatures in a structured and machine-readable format. Materials and Methods We used SPECTER embeddings with SVM classifiers to automatically identify papers containing immune signatures. A generic web platform was used to manually screen papers and allow anonymous submission. Results We demonstrate a classifier that retrieves papers with human COVID-19 immune signatures with a positive predictive value of 86%. Semi-automated queries to the corresponding authors of these publications requesting signature information achieved a 31% response rate. This demonstrates the efficacy of using a SVM classifier with document embeddings of the abstract and title, to retrieve papers with scientifically salient information, even when that information is rarely present in the abstract. Additionally, classification based on the embeddings identified the type of immune signature (e.g., gene expression vs. other types of profiling) with a positive predictive value of 74%. Conclusion Coupling a classifier based on document embeddings with direct author engagement offers a promising pathway to build a semistructured representation of scientifically relevant information. Through this approach, partially automated literature mining can help rapidly create semistructured knowledge repositories for automatic analysis of emerging health threats. The rapid growth in scientific publications [1] presents a challenge for researchers to seeking a comprehensive understanding of the literature. This challenge is of particular importance in emerging disciplines and domains without existing comprehensive reviews or widely accepted frameworks for representing the field. The COVID-19 pandemic is one such example of an emerging publication phenomenon. While machine learning has provided many solutions for search problems related to information retrieval (IR) [2] , application of IR to specific scientific domains remains an ac-tive area of research [3, 4] . Researchers have leveraged search engines to retrieve relevant literature, with keywords searches [5] or alerts [6] , but these approaches usually require substantial further refinement. Once relevant sources have been retrieved, information has to be obtained from the text. For some domains, machine consumable structures make specific data types trivial to extract, e.g. genes [7] and proteins [8] , however integrating this information with a more comprehensive data model remains challenging. There are many methods to obtain salient information from identified sources, including; manual curation e.g. HIPC [5] , rule-based semi-automated extraction of metadata from an abstract, e.g. the metadata suggestions for ModelDB [9] , and PICO (population, intervention, control, and outcomes) extraction [10] , which tags words related to the PICO elements in randomized control trials. Given the novelty of the scientific domain of COVID-19 research, it is difficult to known what information characterizes this subfield and how it will be presented in the paper. Thus, a semiautomated human in the loop approach facilitates a solution. COVID-19 may affect the human immune system in different ways. These effects -which could be at the level of changes of gene expression, of proteins, of metabolites, of antibodies, etc -may vary by population (e.g. young vs old), disease severity (e.g. mild vs severe), etc, with each pattern of effects constituting an immune signature for the disease. For some diseases (e.g. cervical cancer [11] ), immune signatures have shown potential as predictors of survival or other clinical outcomes. Unfortunately, identifying papers containing human immune signatures and locating those immune signatures within publications is non-trivial. Immune signatures can appear in the text, figures or tables, with dozens of distinct signatures in a single publication, and may not be presented as the principle finding. We developed a semi-automated pipeline (Figure 1 ), which utilized human-in-the-loop learning. As part of this pipeline we have created and validated a literature classifier that uses the abstracts and titles to retrieve papers likely to contain human COVID-19 immune signatures from a corpus of scientific literature. The pipeline then uses author solicitation: authors were asked to fill out a structured form describing the immune signature(s) in their papers. Author-supplied signatures from over thirty such papers are available on our website at covid-signatures.org. Generic online platform. We developed a general purpose online literature review, author solicitation, and information sharing platform powered by the Django web framework (djangoproject.com) for templating and user management, a MongoDB database backend (mongodb.com), Bootstrap (getboostrap.com) for layout, and jQuery (jquery.com) for streamlined scripting (Figure 1 ). The pipeline logs timestamps and change history, including authenticated user IDs associated with the changes, to allow auditing and error recovery. To respect website visitor privacy, no cookies or other tracking mechanisms are sent to users that are not explicitly logged in; cookies are used to preserve authentication status across page loads. In particular, data providers who enter data using a special emailed link or via the unsolicited data entry process are not considered logged in and are not sent any browser tracking information. This generic platform is freely available at github.com/mcdougallab/pipeline. To adapt it for COVID-19 immune signatures, a JSON-encoded configuration file was used to specify database details, paper categories, explanatory text, data solicitation forms, email templates, etc. Reviewer interface. Expert review was performed using the aforementioned platform (Figure 1) . A limited set of rules based on whole-word matching (e.g. the word "patients" implies that the paper studied humans) were used to tag the abstracts so that reviewers could examine by tag if desired (see supplementary material Table 1 ). Three expert reviewers each with at least five years graduate computational immunology training, examined the papers in the queue to determine whether or not they contained immune signatures and, if so, what type. The reviewers were presented with a title that links to the paper full text, abstract, selected metadata, and buttons to indicate their conclusions. To support reviewer corrections to automatic database population, an edit button allowed changes to the title and URL which were then pushed to the server via an AJAX call. For papers with a COVID-19 immune signature, reviewers were asked to choose from three broad classes of immune signatures. We included two additional review queues: "let's discuss" for papers where the category was not obvious, and "review article" for work that may have a human immune signature but not be the primary source. An additional, auto-saved notes field allowed reviewers to make notes for themselves and for any future discussion. After 288 papers were reviewed, we tasked one of the expert reviewers with re-reviewing the papers to identify key words from the abstract that, in their judgment, made it more (e.g. "IL-8") or less (e.g. "influenza") likely that a paper would contain a COVID-19 immune signature. These identified terms were highlighted in the abstracts during the review phase (see supplementary material Table 2 ). For interoperability with other COVID-19 literature analysis efforts through the use of shared identifiers, we leveraged the Allen Institute's COVID-19 Open Research Dataset (CORD-19) [12] . CORD-19 provides a corpus with clear information retrieval benchmarks (see TREC-COVID challenge [4, 13, 14] ), standardized machine readable data and SPECTER document embedding of each title and ab-stract [15] . To focus on primary sources that could contain COVID-19 immune signatures, we filtered this dataset to exclude: • PubMed papers with "Comment", "Review", "Editorial", or "News" article type. • Papers from journals whose journal title includes "rev" as a whole word or as the start of a word to avoid review journals. • Papers published before December 1, 2019. • Papers that do not explicitly mention "COVID" or a related term (e.g. "2019-nCoV") in the paper title or abstract. CORD-19 is regularly updated (often weekly); we use these updates to add new papers to our pipeline. The results presented here were obtained with CORD-19 for the 8 th November 2021. Duplicate and near duplicate papers frequently end up in the corpus, as many papers are released on preprint services before publication in a journal and CORD-19 include both preprint and journal publications. The CORD-19 Unique Identifier is linked to a conceptual document, which may include multiple versions of the manuscript. Near duplicate papers were identified using the SPECTER embedding with documents within a certain distance grouped together. Trial and error found 30 units in the SPECTER embedding space excluded most duplicates while preserving distinct articles. Within a given group, only the most recently released paper was used for further analysis. There are near duplicates in the corpus where one of the duplicates is missing metadata and the other is not, this can lead to one being filtered from the dataset by our preprocessing, while the other is not. For this reason, we also remove entries with near duplicates in our excluded set. Two-stage SVM classifier. We developed a two-stage Support Vector Machine (SVM) based classifier for the filtered CORD-19 literature, using sklearn version 0.24.2 [16] with Python 3.6.10. For the first stage, the SVM model (polynomial kernel of degree 4) simply seeks to determine if a paper contains an immune signature or not; this SVM was trained by grouping the three immune signature classes into one super class and the "review article" and "no signature" classes into another super class. A second SVM model (polynomial kernel degree 5), trained on only the papers confirmed by our expert reviewers to contain a COVID-19 immune signature, was used to predict the type of immune signature that would be present for those papers predicted by the first classifier to contain an immune signature. Probabilities were obtained from the SVM classifiers using Platt scaling [17] . Both SVM models were trained and applied on the SPECTER embeddings of the title and abstract, not directly on the text. We store both the paper metadata and the predicted probabilities in the database. Once papers containing COVID-19 immune signatures have been identified, either manually or automatically, we contacted the corresponding author(s) to request details of the immune signatures in the paper, corresponding to our data model Figure 2 . Our platform provides a form with a unique URL for each entry in the database. This form identifies two classes of data -one that is global and applies to the entire form -and one that pertains to a specific fact about the paper (in our case, a specific immune signature), of which there can be many. The field names are configurable via the JSON configuration file, but for this project, the form asks for global data identifying the paper and contributor (reference, contributor, organization, and email address), and specific instance data about each immune signature (description, location in the paper, tissue, immune exposure, cohort, comparison, any repository ID, analysis platform, response components and response direction). Once the global data is entered, a button on the data entry form (visible to authorized, authenticated users only) generates an email from a template and the field data. The email recipients receive a link to a page for just their specific paper, which does not require a login. Each field on the form has associated help text and examples. All fields are editable by the email recipient except for the paper reference. An arbitrary number of immune signatures may be entered for each paper. When the contributors press the "submit" button, the entered data is stored in the database and logged in a separate file, allowing administrators to revert to a previous version in the case of accidental or malicious changes after initial data entry. A typical data entry form is shown in Figure 2 . To allow third-party manual solicitations, a "submit your immune signatures" button on the covid-signatures.org homepage opens an entry form that is the same as one seen by solicited contributors except without the pre-filled global fields and with an editable paper reference field. These entries are assigned an automatically generated internal identifier which the website administrators can later map to a CORD-19 identifier. the pipeline for analysis. We used sklearn 0.24.2 [16] to perform k-means clustering, for SVM and logistic regression classifiers with tolerance set to 10 −7 . Uniform manifold approximation and projection (UMAP) 0.4.6 [19] was used to visualize the clusters. SciPy 1.5.4 [20] was used for statistical tests. To evaluate the classifier, we used Natural Language Toolkit (NLTK) 3.5 [21] for tokenization, excluding English stop words and words with less than three characters. Word-Net [22] was used to lemmatize the words. TD-IDF computation was facilitated by the sklearn package, including 1and 2-grams. The logistic regression classification used inverse of regularization strength 50. When comparing word frequencies, we excluded words that occurred fewer than five times in the titles and abstracts of the selected papers. We used Gensim 3.8.3 [23] to perform Latent Dirichlet Allocation (LDA) [24] with 1000 iterations and 100 passes, on the filtered CORD-19 abstracts and titles, excluding words that occurred in more than 80% of abstracts or fewer than 5%. Our overall workflow involved the development of a training set of papers containing COVID-19 immune signatures, SPECTER and SVM-powered identification of papers likely to have these immune signatures, expert review of a subset of papers, data solicitation from the authors, and then data dissemination on the covid-signatures.org site (summarized in Figure 1 ). As we envision this workflow will generalize to other semi-structured data acquisition efforts, we developed a generic online platform to streamline its application. The steps in the pipeline correspond to those of a generic web service pipeline originally developed for ModelDB, a computational neuroscience model repository. Data acquisition utilized CORD-19 which required prepossessing; including substantially filtering of the dataset and removing duplicate entries. Triage divides the articles into one of 3 relevant or 2 not-relevant classes, which are then used to build a classifier to provide further articles for triage as well as directly for solicitation. Solicitation is done in a semi-automated fashion using an email template and an online form for the author to complete. Publicly available, after solicitation the information is made available via covid-signatures.org. (Figure 3 ). This significant sub-clustering suggests that the SPECTER embedding effectively preserves information on whether or not a paper contains an immune signature. These results suggest that we can use the SPECTER embedding as the basis to construct a classifier to identify papers with COVID-19 immune signatures. Inter-rater reliability of signature presence. To test the reliability of our expert reviews, we selected 100 articles at random from the set of papers to be reviewed independently by two reviewers. The two reviewers had 92% agreement on determining whether a paper contained a COVID-19 immune signature. Cohen's kappa coefficient [26] , a robust measure of inter-rater reliability that accounts for chance agreements, shows very good agreement (0.84, 0.73-0.95 95% confidence interval). Papers where reviewers did not agree on the presence of an immune signature were not included in the training set for the classifier. Classifier bootstrap and performance. A SVM classifier [16] was fit to the 316 papers unambiguously identified by reviewers as either containing or not containing immune signatures. To achieve reliable classifier predictions, we iteratively se- We evaluated the performance of the SVM classifier on our training set using LOOCV. The receiver operating characteristic (ROC) area under the curve (AUC) was 0.916 ( Figure 5A ). With a probability threshold of 0.5, the SVM classifier had a PPV of 86%, with accuracy 85%, sensitivity 92%, selectivity 74% and F1-Score 89% (Table 1) We evaluated whether the SPECTER embedding was essential for achieving the high performance of the classifier. We compared the SVM classifier using the SPECTER embeddings to the widely used approach of Term Frequency -Inverse Document Frequency (TF-IDF) [27] of the titles and abstracts, with both a SVM classifier and with logistic regres-sion. TF-IDF with SVM classifier achieved an AUC of 0.928 using LOOCV, which was very similar to the SPECTER approach. Likewise TF-IDF with logistic regression had similar performance with an AUC 0.928 ( Figure 5A ). As these alternative approaches performed similarly, we chose to use the SPECTER embedding for the rest of this study as it was supplied with the CORD-19 dataset and did not require any additional text processing. Table 3 . Independent expert review of articles showed substantial agreement between the three immune signature classes. Features driving the classification. To discover which features our classifiers were using to determine the presence of immune signature, we compared papers with high and low probabilities of containing COVID-19 immune signatures. For the high probability papers, we selected the 500 articles with the highest predicted probability to contain an immune signature based on the classifier. As a comparison group, we selected 500 papers with a low probability (∼ 0.1) of containing an immune signature. This low probability was chosen for the comparison set to avoid selecting marginal papers from the corpus and those not written in English. A log-likelihood comparison of word frequencies [28] between these two groups identified 171 words with signif- icantly different frequencies, of which 38% were more frequent in papers containing immune signatures (χ 2 1 with adjusted 5% threshold for 876 comparisons). The top 10 differences are shown (supplementary material Table 3 ) and are consistent with the focus of the classifier on human immunology, e.g. "patient", "health", "cell" and "severe". To further interrogate the classifier, we applied LDA topic modeling to the CORD-19 corpus filtered to include only potential relevant entries as described in methods ( Figure 6 ). When comparing the sets of papers described above, papers predicted to contain immune signatures were predominantly related to topics 6 and 8, which appear to describe immunological and clinical work. There were significant differences in the topic composition across all topics (Mann-Whitney U tests at the Dunn-Šidák corrected 5% significance level for 10 tests). Next we consider which features may be driving the "second stage" classifier in assigning different signature types. We chose a random sample of 500 papers most likely to con- Table 3) , with bold typeface indicating they occurred with greater relative frequency in that sample. These word frequencies suggest one way that the classifier predicts the paper contains an immune signature may be the presence of certain words associated with immune signatures (e.g. "patient", "cell", etc). In contrast, the type of immune signature may be determined by a reduction in the frequency of confounding words, such as "patient", "clinical", etc for type A immune signatures. We performed LDA topic modeling for different types of immune signatures. While papers with all types of signatures were predominately associated with topics 6 and 8, there were significant differences between them, with type A being 61% topic 6, type B 12% and type C only 3%. Topic modeling of the papers where we solicited further information showed they were predominately topics 6 and 8 ( Figure 6D ). There were significant differences between the contribution of topic 8 where authors responded compared with papers where authors did not, 5% level with Dunn-Šidák correction for 10 comparisons (Topic 6; Mann-Whitney U p=0.015, Topic 8; p=0.0045). We used SPECTER and machine learning to analyze titles and abstracts of papers from CORD-19 to predict the pres- To evaluate which features might be driving the classifier we consider the word frequencies and performed LDA topic modeling ( Figure 6 ). Word frequencies suggested the type of immune signature may be determined by the absence of confounding words, particularly for the broader signature types B and C, where the the significant differences in word frequencies were mostly due to words occurring less frequently [32] . Thus, signatures included references to "NK cell", "CD3 T cells" and "gammadelta T cells," which could then be mapped to terms from the Cell Ontology: "natural killer cell" (CL:0000623), "T cell" (CL:0000084) ,and "gamma-delta T cell" (CL:0000798), respectively. This gives us a data set that represents immune signatures as they are likely to appear in publications, which can assist in future development of methods to detect immune signatures and their component parts. We focused the machine learning and natural language processing on the paper titles and abstracts, but there is more information available in the full text and, in particular, the Identifying papers containing COVID-19 immune signatures and collecting data and contextual information (metadata) regarding the signatures can help speed scientific progress in this area. As has been seen with vaccination [35] and inflammation [36] , providing access to such a resource supports secondary and comparative analyses resulting in a broader understanding of immune system response. We have demonstrated this pipeline approach Figure 1is able to identify and classify relevant papers from a large and varied corpus Figure 5 , starting from only a few examples. Our method has also shown that authors are often willing to provide useful clinical and contextual information about their work. This paper describes the development of a pipeline to retrieve papers containing COVID-19 immune signatures and for their semi-automated curation. Within this pipeline, we incorporated machine learning to classify papers from a bootstrapped sample of an existing repository (CORD-19). We found that SPECTER embedding provides a good reduced representation of a paper and its relatedness to other papers that can be adopted for the purpose of identifying scientifically salient features of the paper (in this case immune signatures). However, SPECTER was not a necessary component, as TF-IDF with logistic regression has similar performance to the SPECTER approach. Thirty-one percent of authors of papers with immune signatures voluntarily provided semistructured representations in response to a request from our team, regardless of the immune signature type. Given its start as a neuroinformatics tool, the successful application to COVID-19 demonstrates that this pipeline approach is readily adaptable for other fields to identify papers containing scientifically relevant features, which can be further processed -by data solicitation, manual curation, or automated means -to extract the relevant data for presentation in a unified knowledge base or dashboard. Table 3 . Differences in relative frequencies between papers predicted to have COVID-19 immune signatures and the signature type. The ten words with largest differences in relative word frequencies are shown. Bold text indicates the word had a higher relative frequent in the set, light text that it was less frequent. Relative frequencies are given as the occurrences of the word as a percent of the corpus. National Science Foundation National Science Board. Publication output: Us trends and international comparisons Modern information retrieval On the current state of scholarly retrieval systems. Engineering, technology & applied science research Trec-covid: rationale and structure of an information retrieval shared task for covid-19 Pipeline to promote discovery and sharing of computational neuroscience research Uniprot: a worldwide hub of protein knowledge. Nucleic acids research Automated metadata suggestion during repository submission Pretraining to recognize pico elements from randomized controlled trial literature Identification of a prognostic immune signature for cervical cancer to predict survival and response to immune checkpoint inhibitors The covid-19 open research dataset. ArXiv Trec-covid: constructing a pandemic information retrieval test collection A comparative analysis of system features used in the trec-covid information retrieval challenge Specter: Document-level representation learning using citation-informed transformers Scikit-learn: Machine learning in Python Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods Umap: Uniform manifold approximation and projection for dimension reduction Scipy 1.0: fundamental algorithms for scientific computing in python Nltk: the natural language toolkit WordNet: An Electronic Lexical Database. Language, Speech, and Communication Software Framework for Topic Modelling with Large Corpora A new look at the statistical model identification. Automatic Control A coefficient of agreement for nominal scales A statistical interpretation of term specificity and its application in retrieval Comparing corpora using frequency profiling Twenty years of Mod-elDB and beyond: building essential modeling tools for the future of neuroscience Turning the tide of data sharing Construction of the literature graph in semantic scholar Reporting and connecting cell type names and gating definitions through ontologies Text miner for hypergraphs using output space sampling A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature Computational resources for high-dimensional immune analysis from the human immunology project consortium Compendium of immune signatures identifies conserved and species-specific biology in response to inflammation We thank Daniel Chawla and Bram Gerritsen for reviewing papers and providing feedback about the pipeline interface. S.H.K. receives consulting fees from Peraton. All other authors report no conflicts of interest. Table 2 . Terms identified by a reviewer that we highlight in the interface to facilitate classification.