key: cord-0134518-h12imgn2 authors: Zhang, Xinliang Frederick; Sun, Heming; Yue, Xiang; Lin, Simon; Sun, Huan title: COUGH: A Challenge Dataset and Models for COVID-19 FAQ Retrieval date: 2020-10-24 journal: nan DOI: nan sha: 4a7066fb8ad3f5efbd35b55cdda2026ef0d9e0b0 doc_id: 134518 cord_uid: h12imgn2 We present a large, challenging dataset, COUGH, for COVID-19 FAQ retrieval. Similar to a standard FAQ dataset, COUGH consists of three parts: FAQ Bank, Query Bank and Relevance Set. The FAQ Bank contains ~16K FAQ items scraped from 55 credible websites (e.g., CDC and WHO). For evaluation, we introduce Query Bank and Relevance Set, where the former contains 1,236 human-paraphrased queries while the latter contains ~32 human-annotated FAQ items for each query. We analyze COUGH by testing different FAQ retrieval models built on top of BM25 and BERT, among which the best model achieves 48.8 under P@5, indicating a great challenge presented by COUGH and encouraging future research for further improvement. Our COUGH dataset is available at https://github.com/sunlab-osu/covid-faq. Many institutional websites today maintain an FAQ page to help users find relevant information for commonly asked questions. The FAQ retrieval task is defined as ranking FAQ items {(q i , a i )} 1 from a collection given a user query Q (Karan and Šnajder, 2016) . In contrast to common Information Retrieval (IR), FAQ retrieval often introduces 3 new challenges: 1) brevity of FAQ texts in comparison with IR documents; 2) need for topic-specific knowledge; 3) usage of the new question field in FAQ items (Karan and Šnajder, 2016; Sakata et al., 2019) . However, FAQ retrieval is under-studied compared with other IR applications such as opendomain QA (Chen and Yih, 2020) . In this work, we specifically study FAQ retrieval for COVID-19, a contagious and fatal pandemic which is still evolving on a daily basis. Many websites like CDC and WHO provide quality information on COVID-19 and update FAQ pages regularly. To gain better insights into FAQ retrieval research and advance COVID-19 information search, we present an FAQ dataset, COUGH 2 , consisting of FAQ Bank, Query Bank and Relevance Set, following the standard of constructing an FAQ dataset (Manning et al., 2008) . The FAQ Bank contains 15919 FAQ items scraped from 55 authoritative institutional websites (see a full list in Table A4 and A5). COUGH covers a wide range of topics on COVID-19, from general information about the virus to specific COVID-related instructions for a healthy diet. For evaluation, we further construct Query Bank and Relevance Set, including 1,236 crowd-sourced queries and their relevance to a set of FAQ items judged by annotators. Examples from COUGH are shown in Figure 1 . Our dataset poses several new challenges (e.g., Table 1 : Comparison of COUGH with representative counterparts. *: Extracted from existing resources (e.g., COVID-19 Twitter dataset (Chen et al., 2020) ). **: Not Applicable, either not in English or not publicly available. answer fields are longer and noisier, and harder to match, than question fields) to existing methods. The diversity of FAQ items, reflected in varying query forms and lengths as well as in narrative styles, also contributes to these challenges. The contribution of this work is two-fold. First, we construct a challenging dataset COUGH to aid the development of COVID-19 FAQ retrieval models. Second, we evaluate various FAQ retrieval models across different settings, explore their limitations, and encourage future work along this line. COVID-19 & FAQ Datasets. Since the outbreak of COVID-19, the community has witnessed many datasets released to advance the research of COVID-19. For example, CORD-19 (Wang et al., 2020) , CODA-19 (Huang et al., 2020) , COVID-Q (Wei et al., 2020) , Weibo-Cov (Hu et al., 2020), and Twitter dataset (Chen et al., 2020) . All of them aim to aggregate resources to combat COVID-19. The most related works to ours are Sun and Sedoc (2020) and Poliak et al. (2020) , both of which constructed a collection of COVID-19 FAQs by scraping authoritative websites. However, the dataset in the former work is not available yet and the latter work does not evaluate models on their dataset, and there is still a great need to understand how existing models would perform on the COVID-19 FAQ retrieval task. In the open domain, several FAQ datasets appeared recently, such as FAQIR (Karan and Šnajder, 2016), StackFAQ (Karan and Šnajder, 2018) and LocalGov (Sakata et al., 2019) . Unfortunately, as shown in Table 1 , the scale of existing FAQ datasets is too small, and answer lengths are much lower than those in COUGH, which may not characterize the difficulty of FAQ retrieval tasks in real-world scenarios. Moreover, in contrast to all prior datasets, COUGH covers multiple query forms (e.g., question and query string forms) and has many annotated FAQs for each user query, whereas queries in existing FAQ datasets are limited to the question form and have much fewer annotations. FAQ Retrieval Methods. FAQ retrieval focuses on retrieving the most-matched FAQ items given a user query (Karan and Šnajder, 2018 We developed scrapers 4 adapted from Poliak et al. (2020) , and add special features to COUGH dataset. Web scraping: We collect FAQ items from authoritative international organizations, state governments and other credible websites including reliable encyclopedias and medical forums. Moreover, we scrape three types of FAQs: question (i.e., an interrogative statement), query string (i.e., a string of words to elicit information) and forum (FAQs scrapped from medical forums) forms. Inspired by Manning et al. (2008) , we loosen the constraint that queries must be in question form since we want to study a more generic and challenging problem. We also scrape 6,768 non-English FAQs to increase lan-guage diversity. Overall, we scraped 15,919 FAQ items covering all three forms and 19 languages. Following Manning et al. (2008) ; Karan and Šnajder (2016), we do not crowdsource queries from scratch, but instead ask annotators to paraphrase our provided query templates. That way, we ensure that 1) collected queries are pertinent to COVID-19; 2) collected queries are not too simple; 3) the chance of getting similar user queries is reduced. Phase 1: Query Template Creation: We sample 5% of FAQ items from each English non-forum source 5 and use the question part as the template. For example, the templates of the two paraphrased queries in Figure 1 are "Can humans become infected with the COVID-19 from an animal source?" and "Can I get sick with COVID-19 from touching food, the food packaging, or food contact surfaces, if the coronavirus was present on it?". Phase 2: Paraphrasing for Queries: In this phase, each annotator is expected to give three paraphrases for each query template. Besides providing shallow parapharases (e.g., word substitution), annotators are encouraged to give deep paraphrases (i.e., grammatically different but semantically similar/same) to simulate the noisy and diverse environment in real scenarios. In the end, we obtain 1,236 human-paraphrased user queries. Phase 1: Initial Candidate Pool Construction: For each user query, as suggested by previous work (Manning et al., 2008; Karan and Šnajder, 2016; Sakata et al., 2019) , we run 4 models (see Section 5.2), BM25 (Q-q), BM25 (Q-q+a), BERT (Q-q), and BERT (Q-a) fine-tuned on COUGH, to instantiate a candidate FAQ pool. Each model complements the others and contributes its top-10 relevant FAQ items. We then take the union to remove duplicates, giving an average pool size of 32.2. Phase 2: Human Annotation: Each annotator gives each Query, FAQ item tuple a score based on the annotation scheme (i.e., 4/Matched, 3/Useful, 2/Useless and 1/Non-relevant) 6 adapted from Karan and Šnajder (2016); Sakata et al. (2019) . In order to alleviate the annotation bias, each tuple has at least 3 annotations. In the finalized Set, we keep all raw scores and include: 1) mean of annotations; 5 Each source contributes at least one item to ensure wide topic coverage and similar sampled FAQ items are removed. 6 Table A .2 details the meaning of these four scores. 2) four suggested aggregation schemes to obtain binary labels (as detailed in Appendix B). Users of COUGH can also try other aggregation measures. Among 1,236 user queries, there are 35 "unanswerable" queries that have no associated positive FAQ item. Besides the generic goal of large size, diversity, and low noise, COUGH features 5 additional aspects. Varying Query Forms: As indicated in Table 2 , there are multiple query forms. In evaluation, we include both question (Question1 and 3 in Figure 1 ) and query string (Question2 in Figure 1 ) forms. These two distinct forms are different in terms of query format (interrogative v.s. declarative), average answer length (123.89 v.s. 89.60) and topics. Question form is usually related to general information about the virus while query string form is often searching for more specific instructions concerning COVID-19 (e.g., healthy diet during pandemic). Answer Nature: Table 1 shows the answer fields in COUGH are much longer than those in any prior dataset. We also observe that answers might contain some contents which are not directly pertinent to the query, partially resulting in the long length nature. For example, in COUGH, the answer to a query "What is novel coronavirus" contains extra information about comparisons with other viruses. Such lengthy and noisy nature of answers manifest the difficulty of FAQ retrieval in real scenarios. Language Correctness in Query Bank: Most queries in our Query Bank are properly spelled and grammatically correct, so we can prioritize investigating the model performance under a less noisy setting. Furthermore, our dataset can support a controlled study on the impact of spelling and grammatical errors: One can simulate various kinds of spelling and grammatical errors and inject them in a controlled manner into the Query Bank and systematically evaluate how the model performance changes under different levels of noises. Large-scale Relevance Annotation: Many existing FAQ datasets overlooked annotation scale (Ta- ble 1); yet, that would hurt the evaluation reliability since many true positive Query, FAQ item tuples were omitted. Following Manning et al. (2008), for each user query, we constructed a large-scale candidate pool to reduce the chance of missing true positive tuples. The annotation procedure yielded 39760 annotated tuples, each of which is annotated by at least 3 people to reduce annotation bias. Multilinguality: COUGH includes 6768 FAQ items covering 18 non-English languages, and statistics of non-English items can be found in Table 2 . Figure 2 shows the language distribution (excluding English) of FAQ items in COUGH dataset. Like English FAQ items, non-English FAQ items are also presented in both question and query string forms. The detailed breakdown of non-English portion by sources and languages is shown in Table A5 . However, due to budget limit, we did not proceed to the annotation phase for non-English data, so there is no non-English human-paraphrased user query or relevance judgement. Annotation Quality: We discard low-quality paraphrased queries (∼24%) and relevance annotations (∼11%). Further, we show that ∼74% of annotated tuples have high agreements where multiple people vote for the same relevance class. More details of quality checking can be found in Section 8.1. In this work, we focus on unsupervised sparse and dense retrievers and discuss their limitations. Supervised learning is less popular for this task since it's too costly to collect a large-scale Query Bank and its associated relevance judgement (Sakata et al., 2019; Mass et al., 2020) . Further, there are 3 configurable modes, Q-q, Q-a and Q-q+a, where a user query Q can be learned to match with question q, answer a or the concatenation q+a. (1) BM25 is a nonlinear combination of term frequency, document frequency and document length. (2) BERT (Devlin et al., 2019) is a pretrained language model. We use its variant, Sentence-BERT (Reimers and Gurevych, 2019), to encode Q, q and a separately to generate sentence representations. Fine-tuning: Similar to Henderson et al. (2017); Karpukhin et al. (2020), we leverage in-batch negatives 7 to fine-tune BERT on FAQ bank. For Q-q mode, we use GPT-2 (Radford et al., 2019) to generate synthetic questions to match with Q. For Q-a mode, an FAQ item (q, a) itself is a positive pair Re-rank: In Q-a mode, answers are quite long, so the importance of selecting most-related spans from relevant answers to catch the nuance is amplified. As detailed in Reimers and Gurevych (2019); Humeau et al. (2020), cross-encoder can perform self-attention between query and answer, resulting in a richer extraction mechanism. We re-rank 8 top-10 retrieved answers using cross-encoder BERT. (3) CombSum (Mass et al., 2020) first computes three matching scores between the user query and FAQ items via BM25 (Q-q), BERT (Q-q) and finetuned BERT (Q-a) models. Then, the three scores are normalized and combined by averaging. We also evaluate with no BERT (Q-a) included. Evaluation Setting: For the scope of this work, we only evaluate on 1,201 "answerable" English non-forum FAQ items, and leave the "unanswerable", non-English and forum ones for future research as great challenges have been observed under current setting. However, we encourage investigators to utilize those three categories for other potential applications (e.g., multi-lingual IR, transfer learning in IR). Evaluation Metrics: Following previous work (Manning et al., 2008; Karan and Šnajder, 2016, 2018; Sakata et al., 2019; Mass et al., 2020) and nDCG@5 (Normalized Discounted Cumulative Gain) as evaluation metrics. 6 Analysis Quantitative Analysis. Models' results, based on aggregation scheme A: annotated tuples with mean score ≥ 3 are positives, are listed in Table 3 . Results under other schemes are in appendix B.1. The current best P@5 and MAP, 48.8 and 37.3, are not satisfying, showing a large room for improvement, confirming that COUGH is challenging. We observe that Q-q mode consistently performs better than Q-a mode. This is because question fields are more similar to user queries than answer fields. As shown in Section 4, the answer nature (lengthy and noisy), albeit well characterizes the FAQ retrieval task in real scenarios, does bring up a great challenge. Utilizing the cross-encoder for re-ranking can yield better results since it can select query-aware features from answers. This is a possible step towards handling long and noisy answers better. We also find that fine-tuning under the Q-a mode can improve the performance (e.g., from 9.6 to 37.1 under P@5), but might hurt it under the Q-q mode due to noises introduced by synthetic queries. Moreover, the best overall performances are achieved by BERT (Q-q) and CombSum, which are in line with Mass et al. (2020) . However, CombSum without fine-tuned BERT (Q-a) performs worse than the original one. It indicates that answer fields can serve as supplementary resources for the missing information in the question field. Qualitative Analysis. To understand fine-tuned BERT (Q-q) better, we conduct case analyses in Table 4 to show its major types of errors, hoping to further improve it in the future. Currently, fine-tuned BERT (Q-q) suffers from the following issues: 1) biased towards responses with similar texts (e.g., "antibody tests" and "antibody testing"); 2) fails to capture the semantic similarities under complex environments (e.g., pragmatic reasoning is required to understand that "limited abiltity" indicates results are not accurate for diagnosing . Interesting future work includes: 1) handling long and noisy answer fields, e.g., via salient span selection; 2) further improving semantic understanding or reasoning skills, beyond lexical match. In this paper, we introduce COUGH, a large challenging dataset for COVID-19 FAQ retrieval. COUGH features varying query forms, long and noisy answers, and multilinguality. COUGH also serves as a better evaluation benchmark since it has quality larger-scale relevance annotations. We discuss the limitations of current FAQ retrieval models via comprehensive experiments, and encourage future research to further improve FAQ retrieval. We thank our hired AMT workers for their annotations. We thank all anonymous reviewers for their helpful comments. We thank Emmett Jesrani for revising an earlier version of the paper. This research was sponsored in part by the Patient-Centered Outcomes Research Institute Funding ME-2017C1-6413, the Army Research Office under cooperative agreements W911NF-17-1-0412, NSF Grant IIS1815674, NSF CAREER 1942980, and Ohio Supercomputer Center (Center, 1987) . The views and conclusions contained herein are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notice herein. Table 5 : Comparison of base costs to reference tasks. Base Cost per Unit: the cost of annotating one single item (e.g., one QA pair, one paraphrase). All costs are in US cents. *: Additional bonus were rewarded for quality annotators. For example, for our relevance judgements task, we award 1 dime for every 100 quality annotations. 8 Ethical Considerations 8.1 Dataset IRB approval. All FAQ items were collected in a manner which is consistent with the terms of use of original sources and the intellectual property and privacy rights of the original authors of the texts (i.e., source owners). This project is approved by IRB (institutional review board) at our institution as Exempt Research, which is a human subject study that presents no greater than minimal risk to participants. We consulted data officers at our institution about copyrights. They informed us that "Website content is generally copyrighted. However, you could claim the concept of fair use which allows the use of copyrighted material without permission from the copyright holder when it is used for research, scholarship, and teaching". We also consulted Section 107 9 of U.S. Copyright Act and ensured that our collection action fell under fair use category. We release our dataset under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License 10 . Annotation via crowdsourcing. Crowdsourcing involved in this work was conducted on Amazon Mechanical Turk (AMT). In the crowdsourcing step, all participants were required to read and sign an informed consent form before participating and they would not be allowed to proceed without signing. AMT mechanism, automatically anonymizing annotators' identities, ensures that the participants' privacy rights were inherently respected in the crowdsouring process. We determined the compensation for each annotation task by evaluating similar tasks on AMT. Table 5 shows the costs of reference tasks at the time we published our tasks. Overall, taking cognitive complexity into consideration, our base cost per unit is on the same level 9 https://www.copyright.gov/title17/92chap1.html#107 10 https://creativecommons.org/licenses/by-nc-sa/4.0/ or higher than reference tasks. Thus, we can safely conclude that crowd workers participating in our annotation tasks were fairly compensated. Besides, the overall total cost is $2,683. Considering our competitive base cost per unit and additional generous bonus 11 , we believe that participated annotators are well motivated to contribute high-quality annotations. Quality check. During crowdsourcing phase, we filtered out low-quality annotations. Specifically, we only kept 76.45% of human-paraphrased queries for the construction of Query Bank by manually checking every single paraphrased query. When constructing the Relevance Set, for each annotator, we sampled a certain number of annotations. If the sampled annotations didn't pass the screening, we dropped all annotations made by that annotator and republished the work again. After such iterative checking, we only kept 89.20% of annotations in the end. After crowdsourcing, we conducted post-hoc quality checking on both Query Bank and Relevance Set. We manually checked all 1,236 user queries and found that all of them make sense, are related to COVID-19 and properly written. Due to the subjectivity of the relevance judgement task, we evaluated the quality of the relevance annotations in two ways: 1) We find that 73.5% of Query, FAQ item tuples have high agreements where multiple people vote for the same relevance class; 2) We re-judge the relevance on randomly sampled 1000 tuples by hiring two research assistants and it turns out that the matching level 12 is 76.5%. Overall, the post-hoc checking confirms that our COUGH dataset is of high quality. Annotation Protocols. To further help ethics com-mittees and the public judge the fairness of our annotation process, the annotation protocols for both annotation tasks are listed in Appendix A. Figure A1 and A2 show the interfaces designed for the annotation process. We published our annotation batches on Amazon Mechanical Turk platform. Annotation protocols are provided below to facilitate future research in FAQ retrieval. Figure A1 and A2 show the user interfaces designed for both annotation tasks. For this task, you are expected to give one shallow paraphrase and two deep paraphrases for the query template. Note that query can be either in question form or query string form. Shallow paraphrase: Applying word substitution, sentence reordering and other lexical tricks (e.g. extracting salient phrases from response) to the original query to come up with another query without changing the meaning. Deep paraphrase: The paraphrased ones should look dramatically (i.e. grammatically) different from the original query which is more than shallow paraphrasing. However, the paraphrased query should share the same (or almost same) semantic meaning as the original query. For this task, you will see a FAQ item retrieved by an automatic tool for a particular user query, and your job is to judge the relevance of the FAQ item based on the annotation scheme shown below. identical to the user query, and answer part of FAQ well answers the user query.) Useful: The candidate FAQ doesn't perfectly match the user query but may still give some or enough information to help answer the user query. (Query part of FAQ is semantically similar to the user query, and you can either extract or infer some information from the answer which could be useful to the user query. Or alternatively, the candidate FAQ provides too much extra information which is not necessary.) Useless: The candidate FAQ is topically related to the user query but doesn't provide useful information. (Query part of FAQ is somewhat related to the user query, but you can't get any useful information out of the answer part to confidently answer the user query.) Non-relevant: The candidate FAQ is completely unrelated to the query. D. For each annotated Query, FAQ item tuple, we convert "Matched" and "Useful" to positive annotations, and "Useless" and "Non-relevant" to negative annotations. We then apply majority voting using converted binary annotations. Results based on aggregation schemes B, C and D are shown in Tables A1, A2 and A3, respectively. Results based on aggregation scheme A are shown in Table 3 . We first preprocess user query and FAQ items with nltk porter stemmer 5 14 . For baselines including BM25 15 and Sentence-BERT 16 , we take the standard off-the-shelf version. More specifically, we keep the default k1 as 2 and b as 0.75 for BM25 over Q-q, Q-a and Q-q+a settings. When deploying synthetic query generation model (i.e., GPT2), hyper-parameters are set as instructed by Mass et al. (2020) (see their Section 3.4). We adopt the in-batch negatives training strategy to fine-tune both Sentence-BERT and cross-encoder BERT. For both BERT models, we use the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 1e-5 and fine-tune up to 10 epochs. We set the batch sizes as 24 and 4 for Sentence-BERT and crossencoder BERT, respectively. All experiments are conducted using one single GeForce GTX 2080 Ti 12 GB GPU (with significant CPU resources). 14 https://www.nltk.org/ 15 https://pypi.org/project/rank-bm25/ 16 https://github.com/UKPLab/sentence-transformers and we use distilbert-base-nli-stsb-quora-ranking model card. BONUS ARE POSIBLE! $0.6 bonus will be awarded as long as finishing 30 HITs with high quality! Write your paraphrases: This message is only visible to you and will not be shown to Workers. You can test completing the task below and click "Submit" in order to preview the data and format of the submitted results. Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD Figure A1 : User interface for Query Bank construction task. This message is only visible to you and will not be shown to Workers. You can test completing the task below and click "Submit" in order to preview the data and format of the submitted results. Matched---------------(FAQ perfectly matches the user query) Useful---------------(FAQ doesn't perfectly match the user query but may still give some or enough information to help answer the user query) Useless---------------(FAQ is topically related to the user query but doesn't provide useful information) Non-relevant---------------(FAQ is completely unrelated to the query) Query: Why are we social distancing? We need to limit in-person interactions to slow the spread of disease enough to keep our health care system from being overwhelmed. That means keeping enough beds and equipment in place so that hospitals can treat the sickest COVID-19 patients and continue to treat everyone else who has life-threatening conditions. Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD Figure A2 : User interface for Relevance Set construction task. California Department of Health 28 Government of Canada 131 Children's Hospital Los Angeles 73 Cleveland Clinic 15 CNN 112 Government of Colorado 66 Delaware Department of Health 71 Food and Drug Administration In this work, we introduce four aggregation schemes to obtain binary labels. A. Annotated Query, FAQ item tuples with mean score ≥ 3 are positives. B. Annotated Query, FAQ item tuples with mean score > 3 are positives. C. Annotated Query, FAQ item tuples that have at least one 13 "Matched" annotation are positives.