key: cord-0678732-3uff3ds8 authors: Wang, Xue Jun; Grossman, Maura R.; Hyun, Seung Gyu title: Participation in TREC 2020 COVID Track Using Continuous Active Learning date: 2020-11-03 journal: nan DOI: nan sha: 0a0af20f552c58dd3d3b0f9d937d7648ba59f7fc doc_id: 678732 cord_uid: 3uff3ds8 We describe our participation in all five rounds of the TREC 2020 COVID Track (TREC-COVID). The goal of TREC-COVID is to contribute to the response to the COVID-19 pandemic by identifying answers to many pressing questions and building infrastructure to improve search systems [8]. All five rounds of this Track challenged participants to perform a classic ad-hoc search task on the new data collection CORD-19. Our solution addressed this challenge by applying the Continuous Active Learning model (CAL) and its variations. Our results showed us to be amongst the top scoring manual runs and we remained competitive within all categories of submissions. As the spread of COVID-19 continues around the globe, researchers, clinicians, and policy makers involved with its response are constantly searching for reliable information on the virus. This presents those of us in information retrieval (IR) and text processing communities with a unique opportunity to contribute to the response to this pandemic by building infrastructure to improve search systems and to help identify answers for some of today's most pressing questions [8] . The task of TREC-COVID is for participants to retrieve the most relevant documents from the CORD-19 data-set for a given set of topics. To address this challenge, we implemented a system based on CAL, following the work of Grossman and Cormack in [3, 4] , using the tool kit provided as part of the Baseline Model Implementation (BMI), created by Roegiest and Cormack in [6] , and ourselves as the human assessors. In this section, we discuss prior research on CAL. We then discuss prior research on BMI, which provides the tool kits we heavily relied upon for this challenge. Continuous Active Learning (CAL). CAL is a method for finding virtually all relevant information on a particular subject within a vast sea of electronically stored information (ESI): it repeatedly refines its understanding about which of the remaining documents are most likely to be of interest, based on the users' feedback regarding the documents already judged [4] . This protocol is most famously used in technologyassisted review (TAR) for electronic discovery in legal matters, achieving the best results reported in scientific literature to date [2] . Building on the CAL protocol, many implementations, such as BMI, have been highly successful at performing ad-hoc retrieval tasks, such as in the TREC 2015/2016 Total Recall Track [7, 5] and the TREC 2019 Decision Track [1] . Baseline Model Implementation (BMI). BMI is an augmented version of CAL. It is autonomous and was initially made available to participants of the TREC 2015/2016 Total Recall Tracks [7, 5] , as well as the TREC 2019 Decision Track [1] to provide a baseline for comparison. However, BMI turned out to be highly competitive, with none of the manual participants achieving consistently superior results to this fully automated method [6] . While BMI has been shown to generally outperform human-in-the-loop CAL implementations [6] , it requires labelled data, which was very limited, if available at all, for TREC-COVID; thus, we chose to insert a human back into the loop to make judgements. All other components, such as creating feature vectors, the learner, etc., were taken directly from the BMI tool kits. Document Set Processing. The document set used in the TREC-COVID Challenge is the COVID-19 Open Research Data-set (CORD-19). Our team opted to judge a document's relevancy using strictly the information available in the metadata file (year, authors, publisher, title, abstract) based on the work of Zhang et al. [9] which show that participants achieve higher recall using CAL when presented with only a single short excerpt rather than an entire document. CAL. The following shows an outline of our specific implementation of CAL. • STEP 1: Create a hypothetical relevant document, known as a synthetic document. To create the synthetic documents, we concatenated the query, question, and narrative components of the topics file provided by TREC-COVID, as shown in Figures 1 and 2 . • STEP 2: Use a machine-learning algorithm to suggest the next most-likely relevant document. The machine-learning algorithm we chose is Sofia-ML which Roegiest and Cormack used in their participation in the 2015 Total Recall Track [6] . • STEP 3: Review the suggested documents and provide relevance feedback to the learning algorithm, indicating whether each suggested document is actually relevant or not. To do this, we sorted the results given by Sofia-ML in decreasing order of confidence, presenting the top most result to the human assessor using a text based user interface. The judgement made by our human assessors is one of {0-not relevant, 1-partially relevant, 2-relevant}. This corresponds to the annotations made by biomedical experts as part of TREC-COVID following each round. As Sofia-ML does not distinguish between relevant judgements and partially relevant judgments, both were designated to be relevant in training. • STEP 4: Repeat Step 2 and 3 until very few, if any, of the suggested documents are relevant. Using the same stopping condition as in [5] , we aimed to stop when the following criterion was met: where m is the number of relevant documents reviewed, n is the number of irrelevant documents reviewed, a is a constant which determines how many non-relevant documents are to be reviewed in the course of finding each relevant document, and b is a constant which represents a fixed overhead for the number of irrelevant documents that must be reviewed. S-CAL. One of the major drawbacks of the CAL method outlined above is the impractical number of documents that must be reviewed when the number of relevant documents is large. Scalable Continuous Active Learning (S-CAL) [3] addresses this issue by 1. Segmenting the corpus into batches and allowing assessors to label only a small finite sample of documents from each successive batch. 2. Temporarily augmenting each training set by adding a set of 100 random documents from the corpus -which is, with high probability, not relevant for a large corpus -labelled not relevant. However, the stopping condition for S-CAL outlined in [3] is still infeasible to achieve with CORD-19 and our team size; thus, we exchange the initial dynamic stopping condition for a static goal of assessing 300 documents per topic. Hyper-parameter Tuning. Given the availability of labelled data after the first round, we performed hyper-parameter tuning on both the loop type and the lambda value to better fit CORD-19. Finding no significant differences in our tests, we decided to continue with our initial values taken from [6] , which were decided upon discussion with the author of Sofia-ML as well as their internal experiments. Creating Runs. To generate the results for our runs, we created lists of 1000 documents ordered as shown in Figure 3 . Key-Term Highlighting. Key-term highlighting is a feature commonly provided by IR systems, such as Google, to assist human readers in processing information. Following the online sample of CAL, as show in Figure 4 , given as a supplement to [4] , we chose to highlight the top five highest-scoring words from a document, according to Sofia-ML, in our UI for assessors, as show in Figure 5 . Table 1 shows the specifications of our system for each round of TREC-COVID and Table 2 shows our results. From these, we are able to make some interesting observations: 1. Despite our human assessors having provided more labelled documents in round 2 than round 1, our performance decreased. One possible explanation could be that, through the use of the key-term highlighting feature, our human assessor(s) exchanged quantity for quality resulting in an overall poorer model. Despite being able to provide more labelled documents in round 5 than 4, our performance once again decreased. One possible explanation could be that we did not perform the necessary quality control required for additional human assessors -once again, exchanging quantity for quality labels, resulting in an overall poorer performance. 3. The runs ordered by method Figure 3 (iii) consistently outperformed our other runs. This could imply that the documents judged to be not-relevant by our assessors are still more relevant than Sofia-ML's labelling of unseen documents. In this paper, we report on our participation to the TREC 2020 COVID Track rounds 1 though 5, describing our approach, results, and lessons learned. We initially use CAL [4] , implemented using tools from BMI's feature kit [6] , with ourselves as the annotators. The large human labelling effort required for our system motivated us to implement a key-term highlighting feature, use S-CAL [3] , and recruit more human assessors. The results in Table 3 show us to be among the top-scoring manual runs and competitive within all categories of submissions throughout all rounds. Our results in Table 2 also bring up an age-old question of quantity versus quality when it comes to data in IR. Document set processing, CAL, 1 assessor xj4wang run1: ordered by method (i) Being pressed for time, we were unable to reach our stopping condition, prematurely stopping after 40 document assessments for each topic. Using sort -rn instead of sort -rg resulting in documents with exponentially low confidence being sorted to the top during both the assessing process and the run creation. Same as Round 1, + Key-term highlighting xj4wang run3: ordered by method (i) Being pressed for time, we were unable to reach our stopping condition, prematurely stopping after 60 document assessments for each topic. Same as Round 2, ± Switching out CAL for S-CAL, + 1 additional assessor, total of 2 xj4wang run1: ordered by method (iii) xj4wang run2: ordered by method (ii) xj4wang run3: ordered by method (i) Being pressed for time, we were unable to reach our stopping condition for every topic. Same as Round 3, + 1 additional assessor, total of 3 Same as Round 3 Same as Round 4, + 2 additional assessor, total of 5 Same as Round 3 Uwaterloomds at the trec 2019 decision track Evaluation of machine-learning protocols for technologyassisted review in electronic discovery Scalability of continuous active learning for reliable highrecall text classification Continuous active learning for tar Trec 2016 total recall track overview Total recall track tools architecture overview Trec 2015 total recall track overview Trec-covid: Constructing a pandemic information retrieval test collection Effective user interaction for high-recall retrieval: Less is more A special thanks goes to Gordon Cormack for his valuable guidance and Anmol Singh for his insights. We would also like to thank Charlotte Stinson, Eric Sheen, and Solaiappan Alagappan for the time and effort they spent assessing these documents.