key: cord-0826771-dwd4vzm4 authors: Soni, Sarvesh; Roberts, Kirk title: An Evaluation of Two Commercial Deep Learning-Based Information Retrieval Systems for COVID-19 Literature date: 2020-11-17 journal: J Am Med Inform Assoc DOI: 10.1093/jamia/ocaa271 sha: 9fea44570f4d1bda9afd37ba86e3fabc21063779 doc_id: 826771 cord_uid: dwd4vzm4 The COVID-19 pandemic has resulted in a tremendous need for access to the latest scientific information, leading to both corpora for COVID-19 literature and search engines to query such data. While most search engine research is performed in academia with rigorous evaluation, major commercial companies dominate the web search market. Thus, it is expected that commercial pandemic-specific search engines will gain much higher traction than academic alternatives, leading to questions about the empirical performance of these tools. This paper seeks to empirically evaluate two commercial search engines for COVID-19 (Google and Amazon) in comparison to academic prototypes evaluated in the TREC-COVID task. We performed several steps to reduce bias in the manual judgments to ensure a fair comparison of all systems. We find the commercial search engines sizably under-performed those evaluated under TREC-COVID. This has implications for trust in popular health search engines and developing biomedical search engines for future health crises. combining term-and neural-based retrieval models by balancing memorization and generalization dynamics [18] . We use the topics from Round 1 of the TREC-COVID challenge for our evaluation [2, 3] . These topics are information need statements for important COVID-19 topic areas. Each topic consists of three fields with increasing granularity: a (keyword-based) query, a (natural language) question, and a (longer descriptive) narrative. Four example topics are presented in Table 1 . Participants return a "run" consisting of a ranked list of documents for each topic. Round 1 used 30 topics and evaluated against the April 10, 2020 release of CORD-19. Table 1 . Four example topics from Round 1 of the TREC-COVID challenge. A category is assigned (for this paper, not TREC-COVID) to each topic based on both the topic's research field and function, which allows us to classify the performance of the systems on certain kinds of topics. Query : coronavirus social distancing impact Question : has social distancing had an impact on slowing the spread of COVID-19? Narrative : seeking specific information on studies that have measured COVID-19's transmission in one or more social distancing (or non-social distancing) approaches. We use the question and narrative fields to query the systems following the recommendations of the companies to use fully formed queries with questions and context. We use two variants for querying the systems: question only and question+narrative. As we accessed these systems in the first week of May 2020, the systems could be using the latest version of CORD-19 at that time (May 1 release). Thus, we filter the result list, only including those from the April 10 release. We compare the performance of the Amazon and Google systems with the five top submissions to TREC-COVID Round 1 (on the basis of bpref scores). It is valid to compare Amazon and Google systems with the submissions from Round 1 because all these systems are similarly built without using any relevance judgments from TREC-COVID. Relevance judgments (or assessments) for TREC-COVID are carried out by individuals with biomedical expertise. Pooling is used to select documents for assessment, consisting of the top-ranked results from different submissions. A document is judged as relevant, partially relevant, or not relevant. Since the two evaluated systems did not participate in the pooling, the official TREC-COVID judgments do not include many of their top documents. It has recently been shown that pooling effects can negatively impact post-hoc evaluation of systems that did not participate in the pooling [19] . Therefore, to create a level ground for comparison, we performed additional relevance assessments for the evaluated systems such that the top 10 documents from all the commercial runs are judged (following the pooling strategy of TREC-COVID for the submitted runs with priority 1). In total, 141 documents were assessed by two individuals involved in performing the relevance judgments for TREC-COVID. (not applicable), tangential (not relevant at all), partially tangential (not relevant but there is a common link with the topic, e.g., quarantine), partially relevant (answers only a part of the topic), and relevant (provides an answer to the topic). We keep partially relevant because the documents previously judged The number of documents used for each topic (topic-minimums) are shown in Figure 1 . Approximately, an average of 43 documents are evaluated per topic with a median of 40.5. This is another reason for using a topic-wise minimum rather than cutting off all the systems to the same level as the lowest return count (that would be 25 documents). Having a topic-wise cut-off allowed us to evaluate runs with the maximum possible depth while keeping the evaluation fair. The topic-wise count of newly annotated documents for relevance and error analysis are also included in Figure 1 . The evaluation results of our study are presented in Table 2 . Among the commercial systems that we evaluated, the Amazon question+narrative variant consistently performed better than any other variant in all the measures other than bpref. For bpref, the Google question-only variant performed best. Note that the best run from TREC-COVID (a run from the sabir team), after cutting using topic-minimums, still performed better than the other four TREC-COVID runs included in our evaluation. Interestingly, this best run also performed substantially better than all the variants of both commercial systems on all calculated metrics. We discuss more about this system below. The system performances (of all the commercial runs and the best run from TREC-COVID, referred to here as "sabir") using some of the standard evaluation metrics as classified by the topic categories are shown in Figure 2 . The Amazon system performed better than the Google system on almost all topic categories. In the functional category, all systems performed the best on "Treatment" whereas among the research field-based categories the best results were different for TREC-COVID and the commercial runs (sabir performed best on the "Clinical" category while most of the commercial variants on "Biological"). Sabir consistently outperformed the commercial system variants on all categories except "Biological" (among the research field categories) and "Effect" (among the function categories) in both of which a commercial system had an edge. The results from our error analysis are shown in Figure 3 . The commercial systems made about twice as many tangential errors as sabir. The commercial variants with the narrative part made slightly more errors in the first three categories than the corresponding variants with only the question. Note that the number of documents annotated as relevant during the error analysis is roughly the same for all the systems (thus do not create an unfair situation for any particular system). We evaluated two commercial IR systems targeted toward CORD-19. For comparison, we also included the five best runs from TREC-COVID. We annotated an additional 141 documents from the commercial system runs to ensure a fair comparison with the TREC-COVID runs. We found the best system from TREC-COVID in terms of bpref outperformed all commercial system variants on all evaluated measures. We illustrated the system performances in light of different categories of topics and further annotated a set of 660 documents to conduct an error analysis. The commercial systems often employ cutting-edge technologies, such as ACM and BERT, as part of their systems. Also, the availability of computational resources such as CPUs and GPUs may be better in industry than in academic settings. This follows a common concern in academia, namely that the resource requirements for advanced machine learning methods (e.g., GPT-3 [20] ) are well beyond the capabilities available to the vast majority of researchers. Instead, these results demonstrate the potential pitfalls of deploying a deep learning-based system without proper tuning. The sabir (sab20.*) system does not use machine learning at all: it is based on the very old SMART system [21] and does not utilize any biomedical resources. It is instead manually tuned based on an analysis of the data fields available in CORD-19. Subsequent rounds of TREC-COVID have since overtaken sabir (based indeed on machine learning with relevant training data). The lesson, then, for future emerging health events is that deploying "state-of-the-art" methods without event-specific data may be dangerous, and in the face of uncertainty simple may still be best. On the other hand, the strengths of the commercial systems must be acknowledged: they are capable serving large numbers of users, can be rapidly disseminated, and while their performance suffers compared to simpler systems, their performance is good enough that they are still likely "useful", though this term is much-debated in IR research. As evident from Figure 1 , many documents retrieved by the commercial systems were not part of the April 10 CORD-19 release. We queried these systems after another version of the CORD-19 dataset was released. This may have led to the retrieval of more articles from the new release of CORD-19. However, the relevance judgments used here are from the initial rounds of TREC-COVID and thus would not include all documents from the latest version of CORD-19. Thus, for a fair comparison, we pruned the document list and performed additional relevance judgments. We have included the evaluation results that would have resulted without our modifications in the supplemental material, which makes the commercial systems look far more inferior. Yet, as addressed, this would not have been a "fair" comparison and thus the corrective measures described above were necessary to ensure a scientifically valid comparison. We evaluated two commercial IR systems against the TREC-COVID data. To facilitate fair comparison, we cut all runs at different thresholds and performed more relevance judgments beyond those provided by TREC-COVID. We found the top performing system from TREC-COVID remained the best performing system, outperforming the commercial systems on all metrics. Interestingly, this best performing run comes from a simple system that does not apply machine learning. Thus, blindly applying machine learning without specific labeled data (a condition which may be necessary in a rapidlyemerging health crisis) may be detrimental to system performance. The authors thank Meghana Gudala and Jordan Godfrey-Stovall for conducting the additional retrieval assessments. This work was supported in part by the National Science Foundation (NSF) under award OIA-1937136. A bar chart with the number of documents for each topic as used in our evaluations (after filtering the documents based on the April 10th release of the CORD-19 dataset and setting a threshold at the minimum number of documents for any given topic). The total number of documents annotated additionally for relevance and error analysis are shown as circle and cross marks on the bars corresponding to each topic. Analysis of system performances on the basis of different categories of topics. Research Field -categories based on the field of study in biomedical informatics. Function -based on the functional aspect of COVID-19 as expressed in the topic's information need. Total number of documents retrieved by the systems (among the top 10 documents per topic) based on different categories of errors. NA to COVID-19 -document not applicable to COVID-19. Partially Tangential -not relevant but there is a common link with the topic (e.g., quarantine). Tangential -not relevant at all. Partially Relevant -answers only a part of the topic. Relevant -provides an answer to the topic. The results without taking into account our additional annotations, i.e., only using the relevance judgments from TREC-COVID rounds 1 and 2, are presented in Table S1 . Similarly, the results without setting an explicit threshold on the number of returned documents by the systems are shown in Table S2 . The results without any of the two modifications made by us are provided in Table S3 . The additional 141 relevance assessments are made available as part of the supplemental data. The topic categories assigned to each topic with respect to their research field and function are shown in Table S4 . Following are the error category annotation guidelines provided to the annotators for judgments. The guideline was implemented in 2 phases. In the first phase, a random set of 60 documents were annotated by both the annotators (on a 6-level relevance scale). The Cohen's kappa was 0.55. The relevance scale was revised (into a 5-level scale) during the reconciliation process (done by an independent adjudicator). In the second phase, all the remaining 600 documents were split between the two annotators. Our assessments for error analysis (for a total of 660 documents) are available in the supplemental data. The relevance/non-relevance scale for error analysis has 5 levels:  Do not choose this if the information is potentially applicable to COVID-19 when combined with other COVID-19-specific information.  Positive example -choose this option in the following case: o Here, the study can be applicable to COVID-19 in the context of given question.  Study is tangential to question information needs. o Here, there is no common link between the question and the document (other than coronavirus).  There is a common link between the question and the document other than COVID-19.  E.g., Question asks for best practices to quarantine but the document talks about the incidence of infection in people under home quarantine. o Here, the common link is quarantine, i.e., both the question and the document talks about quarantine.  E.g., Question asks for the origin of coronavirus but the document is about strategies to find coronavirus origin. o Here, the common link is origin of the virus, i.e., both the question and the document discuss something about the origin of the virus.  E.g., Question asks for the origin of coronavirus but the document is about a study that finds no evidence that the virus was lab engineered.  Same as TREC-COVID.  Though all the documents are previously annotated as either "Partially Relevant" or "Not Relevant", an option to annotate a document as "Relevant" is available. This is because of the subjective nature of these judgments. Question : what are best practices in hospitals and at home in maintaining quarantine? Narrative : seeking information on best practices for activities and duration of quarantine for those exposed and/ infected to virus. CORD-19: The Covid-19 Open Research Dataset TREC-COVID: Rationale and Structure of an Information Retrieval Shared Task for COVID-19 TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection Introducing medical language processing with Amazon Comprehend Medical Comprehend Medical: A Named Entity Recognition and Relationship Extraction Web Service Assessment of Amazon Comprehend Medical: Medication Information Extraction A Comparative Analysis of Speed and Accuracy for Three Off-the-Shelf De-Identification Tools Pre-training of Deep Bidirectional Transformers for Language Understanding Deep learning in clinical natural language processing: a methodical review Evaluation of Dataset Selection for Pre-Training and Fine-Tuning Transformer Language Models for Clinical Question Answering BioBERT: a pre-trained biomedical language representation model for biomedical text mining Publicly Available Clinical BERT Embeddings A Pretrained Language Model for Scientific Text AWS launches machine learning enabled search capabilities for COVID-19 dataset An NLU-Powered Tool to Explore COVID-19 Scientific Literature Zero-shot Neural Retrieval via Domain-targeted Synthetic Query Generation. ArXiv200414503 Cs Published Online First Characterizing Structural Regularities of Labeled Data in Overparameterized Models. ArXiv200203206 Cs Stat Published Online First On the reliability of test collections for evaluating systems of different types Language Models are Few-Shot Learners ArXiv200514165 Cs Published Online First: 4 Implementation of the SMART information retrieval system None.