key: cord-0519023-vmkduwps authors: Thakur, Nandan; Reimers, Nils; Ruckl'e, Andreas; Srivastava, Abhishek; Gurevych, Iryna title: BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models date: 2021-04-17 journal: nan DOI: nan sha: 807600ef43073cd9c59d4208ee710e90cf14efa8 doc_id: 519023 cord_uid: vmkduwps Existing neural information retrieval (IR) models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their out-of-distribution (OOD) generalization capabilities. To address this, and to facilitate researchers to broadly evaluate the effectiveness of their models, we introduce Benchmarking-IR (BEIR), a robust and heterogeneous evaluation benchmark for information retrieval. We leverage a careful selection of 18 publicly available datasets from diverse text retrieval tasks and domains and evaluate 10 state-of-the-art retrieval systems including lexical, sparse, dense, late-interaction and re-ranking architectures on the BEIR benchmark. Our results show BM25 is a robust baseline and re-ranking and late-interaction-based models on average achieve the best zero-shot performances, however, at high computational costs. In contrast, dense and sparse-retrieval models are computationally more efficient but often underperform other approaches, highlighting the considerable room for improvement in their generalization capabilities. We hope this framework allows us to better evaluate and understand existing retrieval systems, and contributes to accelerating progress towards better robust and generalizable systems in the future. BEIR is publicly available at https://github.com/UKPLab/beir. Major natural language processing (NLP) problems rely on a practical and efficient retrieval component as a first step to find relevant information. Challenging problems include open-domain question-answering [8] , claim-verification [60] , duplicate question detection [78] , and many more. Traditionally, retrieval has been dominated by lexical approaches like TF-IDF or BM25 [55] . However, these approaches suffer from lexical gap [5] and are able to only retrieve documents containing keywords present within the query. Further, lexical approaches treat queries and documents as bag-of-words by not taking word ordering into consideration. So far, it is unclear how well existing trained neural models will perform for other text domains or textual retrieval tasks. Even more important, it is unclear how well different approaches, like sparse embeddings vs. dense embeddings, generalize to out-of-distribution data. In this work, we present a novel robust and heterogeneous benchmark called BEIR (Benchmarking IR), comprising of 18 retrieval datasets for comparison and evaluation of model generalization. Prior retrieval benchmarks [19, 50] have issues of a comparatively narrow evaluation focusing either only on a single task, like question-answering, or on a certain domain. In BEIR, we focus on Diversity, we include nine different retrieval tasks: Fact checking, citation prediction, duplicate question retrieval, argument retrieval, news retrieval, question answering, tweet retrieval, bio-medical IR, and entity retrieval. Further, we include datasets from diverse text domains, datasets that cover broad topics (like Wikipedia) and specialized topics (like COVID-19 publications), different text types (news articles vs. Tweets), datasets of various sizes (3.6k -15M documents), and datasets with different query lengths (average query length between 3 and 192 words) and document lengths (average document length between 11 and 635 words). We use BEIR to evaluate ten diverse retrieval methods from five broad architectures: lexical, sparse, dense, late interaction, and re-ranking. From our analysis, we find that no single approach consistently outperforms other approaches on all datasets. Further, we notice that the in-domain performance of a model does not correlate well with its generalization capabilities: models fine-tuned with identical training data might generalize differently. In terms of efficiency, we find a trade-off between the performances and the computational cost: computationally expensive models, like re-ranking models and late interaction model perform the best. More efficient approaches e.g. based on dense or sparse embeddings can substantially underperform traditional lexical models like BM25. Overall, BM25 remains a strong baseline for zero-shot text retrieval. Finally, we notice that there can be a strong lexical bias present in datasets included within the benchmark, likely as lexical models are pre-dominantly used during the annotation or creation of datasets. This can give an unfair disadvantage to non-lexical approaches. We analyze this for the TREC-COVID [65] dataset: We manually annotate the missing relevance judgements for the tested systems and see a significant performance improvement for non-lexical approaches. Hence, future work requires better unbiased datasets that allow a fair comparison for all types of retrieval systems. With BEIR, we take an important step towards a single and unified benchmark to evaluate the zero-shot capabilities of retrieval systems. It allows to study when and why certain approaches perform well, and hopefully steers innovation to more robust retrieval systems. We release BEIR and an integration of diverse retrieval systems and datasets in a well-documented, easy to use and extensible open-source package. BEIR is model-agnostic, welcomes methods of all kinds, and also allows easy integration of new tasks and datasets. More details are available at https://github.com/UKPLab/beir. To our knowledge, BEIR is the first broad, zero-shot information retrieval benchmark. Existing works [19, 50] do not evaluate retrieval in a zero-shot setting in depth, they either focus over a single task, small corpora or on a certain domain. This setting hinders for investigation of model generalization across diverse set of domains and task types. MultiReQA [19] consists of eight Question-Answering (QA) datasets and evaluates sentence-level answer retrieval given a question. It only tests a single task and five out of eight datasets are from Wikipedia. Further, MultiReQA evaluates retrieval over rather small corpora: six out of eight tasks have less than 100k candidate sentences, which benefits dense retrieval over lexical as previously shown [54] . KILT [50] consists of five knowledge-intensive tasks including a total of eleven datasets. The tasks involve retrieval, but it is not the primary task. Further, KILT retrieves documents only from Wikipedia. Information retrieval is the process of searching and returning relevant documents for a query from a collection. In our paper, we focus on text retrieval and use document as a cover term for text of any length in the given collection and query for the user input, which can be of any length as well. Traditionally, lexical approaches like TF-IDF and BM25 [55] have dominated textual information retrieval. Recently, there is a strong interest in using neural networks to improve or replace these lexical approaches. In this section, we highlight a few neural-based approaches and we refer the reader to Lin et al. [37] for a recent survey in neural retrieval. Retriever-based Lexical approaches suffer from the lexical gap [5] . To overcome this, earlier techniques proposed to improve lexical retrieval systems with neural networks. Sparse methods such as docT5query [48] identified document expansion terms using a sequence-to-sequence model that generated possible queries for which the given document would be relevant. DeepCT [11] on the other hand used a BERT [13] model to learn relevant term weights in a document and generated a pseudo-document representation. Both methods still rely on BM25 for the remaining parts. Similarly, SPARTA [79] learned token-level contextualized representations with BERT and converted the document into an efficient inverse index. More recently, dense retrieval approaches were proposed. They are capable of capturing semantic matches and try to overcome the (potential) lexical gap. Dense retrievers map queries and documents in a shared, dense vector space [18] . This allowed the document representation to be pre-computed and indexed. A bi-encoder neural architecture based on pre-trained Transformers has shown strong performance for various open-domain question-answering tasks [19, 31, 35, 43] . This dense approach was recently extended by hybrid lexical-dense approaches which aims to combine the strengths of both approaches [17, 57, 42] . Another parallel line of work proposed an unsupervised domain-adaption approach [35, 43] for training dense retrievers by generating synthetic queries on a target domain. Lastly, ColBERT [32] (Contextualized late interaction over BERT) computes multiple contextualized embeddings on a token level for queries and documents and uses an maximum-similarity function for retrieving relevant documents. Re-ranking-based Neural re-ranking approaches use the output of a first-stage retrieval system, often BM25, and re-ranks the documents to create a better comparison of the retrieved documents. Significant improvement in performance was achieved with the cross-attention mechanism of BERT [46] . However, at a disadvantage of a high computational overhead [53] . BEIR aims to provide a one-stop zero-shot evaluation benchmark for all diverse retrieval tasks. To construct a comprehensive evaluation benchmark, the selection methodology is crucial to collect tasks and datasets with desired properties. For BEIR, the methodology is motivated by the following three factors: (i) Diverse tasks: Information retrieval is a versatile task and the lengths of queries and indexed documents can differ between tasks. Sometimes, queries are short, like a keyword, while in other cases, they can be long like a news article. Similarly, indexed documents can sometimes be long, and for other tasks, short like a tweet. (ii) Diverse domains: Retrieval systems should be evaluated in various types of domains. From broad ones like News or Wikipedia, to highly specialized ones such as scientific publications in one particular field. Hence, we include domains which provide a representation of real-world problems and are diverse ranging from generic to specialized. (iii) Task difficulties: Our benchmark is challenging and the difficulty of a task included has to be sufficient. If a task is easily solved by any algorithm, it will not be useful to compare various models used for evaluation. We evaluated several tasks based on existing literature and selected popular tasks which we believe are recently developed, challenging and are not yet fully solved with existing approaches. (iv) Diverse annotation strategies: Creating retrieval datasets are inherently complex and are subject to annotation biases (see Section 6 for details), which hinders a fair comparison of approaches. To reduce the impact of such biases, we selected datasets which have been created in many different ways: Some where annotated by crowd-workers, others by experts, and others are based on the feedback from large online communities. In total, we include 18 English zero-shot evaluation datasets from 9 heterogeneous retrieval tasks. As the majority of the evaluated approaches are trained on the MS MARCO [45] dataset, we also report performances on this dataset, but don't include the outcome in our zero-shot comparison. We would like to refer the reader to Appendix D where we motivate each one of the 9 retrieval tasks and 18 Table 8 . We additionally provide dataset licenses in Appendix E, and links to the datasets in Table 5 . Table 1 summarizes the statistics of the datasets provided in BEIR. A majority of datasets contain binary relevancy judgements, i.e. relevant or non-relevant, and a few contain fine-grained relevancy judgements. Some datasets contain few relevant documents for a query (< 2), while other datasets like TREC-COVID [65] can contain up to even 500 relevant documents for a query. Only 8 out of 19 datasets (including MS MARCO) have training data denoting the practical importance for zero-shot retrieval benchmarking. All datasets except ArguAna [67] have short queries (either a single sentence or 2-3 keywords). Figure 1 shows an overview of the tasks and datasets in the BEIR benchmark. Information Retrieval (IR) is ubiquitous, there are lots of datasets available within each task and further even more tasks with retrieval. However, it is not feasible to include all datasets within the benchmark for evaluation. We tried to cover a balanced mixture of a wide range of tasks and datasets and paid importance not to overweight a specific task like question-answering. Future datasets can easily be integrated in BEIR, and existing models can be evaluated on any new dataset quickly. The BEIR website will host an actively maintained leaderboard 2 with all datasets and models. The datasets present in BEIR are selected from diverse domains ranging from Wikipedia, scientific publications, Twitter, news, to online user communities, and many more. To measure the diversity in domains, we compute the domain overlap between the pairwise datasets using a pairwise weighted Jaccard similarity [26] score on unigram word overlap between all dataset pairs. For more details on the theoretical formulation of the similarity score, please refer to Appendix F. Figure 2 shows a heatmap denoting the pairwise weighted jaccard scores and the clustered force-directed placement diagram. Nodes (or datasets) close in this graph have a high word overlap, while nodes far away in the graph have a low overlap. From Figure 2 , we observe a rather low weighted Jaccard word overlap across different domains, indicating that BEIR is a challenging benchmark where approaches must generalize well to diverse out-of-distribution domains. The BEIR software 3 provides an is an easy to use Python framework (pip install beir) for model evaluation. It contains extensive wrappers to replicate experiments and evaluate models from wellknown repositories including Sentence-Transformers [53] , Transformers [72] , Anserini [74] , DPR [31] , Elasticsearch, ColBERT [32] , and Universal Sentence Encoder [75] . (Normalised Cumulative Discount Gain) for any top-k hits. One can use the BEIR benchmark for evaluating existing models on new retrieval datasets and for evaluating new models on the included datasets. Datasets are often scattered online and are provided in various file-formats, making the evaluation of models on various datasets difficult. BEIR introduces a standard format (corpus, queries and qrels) and converts existing datasets in this easy universal data format, allowing to evaluate faster on an increasing number of datasets. Depending upon the nature and requirements of real-world applications, retrieval tasks can be either be precision or recall focused. To obtain comparable results across models and datasets in BEIR, we argue that it is important to leverage a single evaluation metric that can be computed comparably across all tasks. Decision support metrics such as Precision and Recall which are both rank unaware are not suitable. Binary rank-aware metrics such as MRR (Mean Reciprocal Rate) and MAP (Mean Average Precision) fail to evaluate tasks with graded relevance judgements. We find that Normalised Cumulative Discount Gain (nDCG@k) provides a good balance suitable for both tasks involving binary and graded relevance judgements. We refer the reader to Wang et al. [71] for understanding the theoretical advantages of the metric. For our experiments, we utilize the Python interface of the official TREC evaluation tool [63] and compute nDCG@10 for all datasets. We use BEIR to compare diverse, recent, state-of-the-art retrieval architectures with a focus on transformer-based neural approaches. We evaluate on publicly available pre-trained checkpoints, which we provide in Table 6 . Due to the length limitations of transformer-based networks, we use only the first 512 word pieces within all documents in our experiments across all neural architectures. We group the models based on their architecture: (i) lexical, (ii) sparse, (iii) dense, (iv) late-interaction, and (v) re-ranking. Besides the included models, the BEIR benchmark is model agnostic and in future different model configurations can be easily incorporated within the benchmark. (i) Lexical Retrieval: (a) BM25 [55] is a commonly-used bag-of-words retrieval function based on token-matching between two high-dimensional sparse vectors with TF-IDF token weights. We use Anserini [36] with the default Lucene parameters (k=0.9 and b=0.4). We index the title (if available) and passage as separate fields for documents. In our leaderboard, we also tested Elasticsearch BM25 and Anserini + RM3 expansion, but found Anserini BM25 to perform the best. (ii) Sparse Retrieval: (a) DeepCT [11] uses a bert-base-uncased model trained on MS MARCO to learn the term weight frequencies (tf). It generates a pseudo-document with keywords multiplied with the learnt term-frequencies. We use the original setup of Dai and Callan [11] in combination with BM25 with default Anserini parameters which we empirically found to perform better over the tuned MS MARCO parameters. (b) SPARTA [79] computes similarity scores between the non-contextualized query embeddings from BERT with the contextualized document embeddings. These scores can be pre-computed for a given document, which results in a 30k dimensional sparse vector. As the original implementation is not publicly available, we re-implemented the approach. We fine-tune a DistilBERT [56] Due to resource constraints, we cap the maximum number of target documents in each dataset to 100K. For retrieval, we continue to fine-tune the TAS-B model using in-batch negatives on the synthetic queries and document pair data. Note, GenQ creates an independent model for each task. (iv) Late-Interaction: (a) ColBERT [32] encodes and represents the query and passage into a bag of multiple contextualized token embeddings. The late-interactions are aggregated with sum of the max-pooling query term and a dot-product across all passage terms. We use the ColBERT model as a dense-retriever (end-to-end retrieval as defined [32] ): first top-k candidates are retrieved using ANN with faiss [29] (faiss depth = 100) and ColBERT re-ranks by computing the late aggregated interactions. We train a bert-base-uncased model, with maximum sequence length of 300 on the MS MARCO dataset for 300K steps. (v) Re-ranking model: (a) BM25 + CE [70] reranks the top-100 retrieved hits from a first-stage BM25 (Anserini) model. We evaluated 14 different cross-attentional re-ranking models that are publicly available on the HuggingFace model hub and found that a 6-layer, 384-h MiniLM [70] cross-encoder model offers the best performance on MS MARCO. The model was trained on MS MARCO using a knowledge distillation setup with an ensemble of three teacher models: BERT-base, BERT-large, and ALBERT-large models following the setup in Hofstätter et al. [24]. In this section, we evaluate and analyze how retrieval models perform on the BEIR benchmark. Table 2 reports the results of all evaluated systems on the selected benchmark datasets. As a baseline, we compare our retrieval systems against BM25. Figure 3 shows, on how many datasets a respective model is able to perform better or worse than BM25. 1. In-domain performance is not a good indicator for out-of-domain generalization. We observe BM25 heavily underperforms neural approaches by 7-18 points on in-domain MS MARCO. However, BEIR reveals it to be a strong baseline for generalization and generally outperforming many other, more complex approaches. This stresses the point, that retrieval methods must be evaluated on a broad range of datasets. In-domain and zero-shot performances on BEIR benchmark. All scores denote nDCG@10. The best score on a given dataset is marked in bold, and the second best is underlined. Corresponding Recall@100 performances can be found in Table 9 . ‡ indicates the in-domain performances. perform well in-domain on MS MARCO, they completely fail to generalize well by under performing BM25 on nearly all datasets. In contrast, document expansion based docT5query is able to add new relevant keywords to a document and performs strong on the BEIR datasets. It outperforms BM25 on 11/18 datasets while providing a competitive performance on the remaining datasets. 3. Dense retrieval models with issues for out-of-distribution data. Dense retrieval models (esp. ANCE and TAS-B), that map queries and documents independently to vector spaces, perform strongly on certain datasets, while on many other datasets perform significantly worse than BM25. For example, dense retrievers are observed to underperform on datasets with a large domain shift compared from what they have been trained on, like in BioASQ, or task-shifts like in Touché-2020. DPR, the only non-MSMARCO trained dataset overall performs the worst in generalization on the benchmark. 4. Re-ranking and Late-Interaction models generalize well to out-of-distribution data. The cross-attentional re-ranking model (BM25+CE) performs the best and is able to outperform BM25 on almost all (16/18) datasets. It only fails on ArguAna and Touché-2020, two retrieval tasks that are extremely different to the MS MARCO training dataset. The late-interaction model ColBERT computes token embeddings independently for the query and document, and scores (query, document)pairs by a cross-attentional like MaxSim operation. It performs a bit weaker than the cross-attentional re-ranking model, but is still able to outperform BM25 on 9/18 datasets. It appears that cross-attention and cross-attentional like operations are important for a good out-of-distribution generalization. TAS-B provides the best zero-shot generalization performance among its dense counterparts. It outperforms ANCE on 14/18 and DPR on 17/18 datasets respectively. We speculate that the reason lies in a strong training setup in combination of both in-domain batch negatives and Margin-MSE losses for the TAS-B model. This training loss function (with strong ensemble teachers in a Knowledge Distillation setup) shows strong generalization performances. 6. TAS-B model prefers to retrieve documents with shorter lengths. TAS-B underperforms ANCE on two datasets: TREC-COVID by 17.3 points and Touché-2020 by 7.8 points. We observed that these models retrieve documents with vastly different lengths as shown in Figure 4 . On TREC-COVID, TAS-B retrieves documents with a median length of mere 10 words versus ANCE with 160 words. Similarly on Touché-2020, 14 words vs. 89 words with TAS-B and ANCE respectively. As discussed in Appendix H, this preference for shorter or longer documents is due to the used loss function. Tradeoff between performance and retrieval latency The best out-of-distribution generalization performances by re-ranking top-100 BM25 documents and with late-interaction models come at the cost of high latency (> 350 ms), being slowest at inference. In contrast, dense retrievers are 20-30x faster (< 20ms) compared to the re-ranking models and follow a low-latency pattern. On CPU, the sparse models dominate in terms of speed (20-25ms). Lexical, re-ranking and dense methods have the smallest index sizes (< 3GB) to store 1M documents from DBPedia. SPARTA requires the second largest index to store a 30k dim sparse vector while ColBERT requires the largest index as it stores multiple 128 dim dense vectors for a single document. Index sizes are especially relevant when document sizes scale higher: ColBERT requires~900GB to store the BioASQ (~15M documents) index, whereas BM25 only requires 18GB. Creating a perfectly unbiased evaluation dataset for retrieval is inherently complex and is subject to multiple biases induced by the: (i) annotation guidelines, (ii) annotation setup, and by the (iii) human annotators. Further, it is impossible to manually annotate the relevance for all (query, document)-pairs. Instead, existing retrieval methods are used to get a pool of candidate documents which are then marked for their relevance. All other unseen documents are assumed to be irrelevant. This is a source for selection bias [39] : A new retrieval system might retrieve vastly different results than the system used for the annotation. These hits are automatically assumed to be irrelevant. Many BEIR datasets are found to be subject to a lexical bias, i.e. a lexical based retrieval system like TF-IDF or BM25 has been used to retrieve the candidates for annotation. For example, in BioASQ, candidates have been retrieved for annotation via term-matching with boosting tags [61] . Creation of Signal-1M (RT) involved retrieving tweets for a query with 7 out of these 8 techniques relying upon lexical term-matching signals [59] . Such a lexical bias disfavours approaches that don't rely on lexical matching, like dense retrieval methods, as retrieved hits without lexical overlap are automatically assumed to be irrelevant, even though the hits might be relevant for a query. In order to study the impact of this particular type of bias, we conducted a study on the recent TREC-COVID dataset. TREC-COVID used a pooling method [38, 40] to reduce the impact of the aforementioned bias: The annotation set was constructed by using the search results from the various systems participating in the challenge. Table 4 shows the Hole@10 rate [73] for the tested systems, i.e., how many top-10 hits is each system retrieving that have not been seen by annotators. The results reveal large differences between approaches: Lexical approaches like BM25 and docT5query have a rather low Hole@10 value of 6.4% and 2.8%, indicating that the annotation pool contained the top-hits from lexical retrieval systems. In contrast, dense retrieval systems like ANCE and TAS-B have a much higher Hole@10 of 14.4% and 31.8%, indicating that a large fraction of hits found by these systems have not been judged by annotators. Next, we manually added for all systems, the missing annotation (or holes) following the original annotation guidelines. During annotation, we were unaware of the system who retrieved the missing annotation to avoid a preference bias. In total, we annotated 980 query-document pairs in TREC-COVID. We then re-computed nDCG@10 for all systems with this additional annotations. As shown in Table 4 , we observe that lexical approaches improves only slightly, e.g. for docT5query just from 0.713 to 0.714 after adding the missing relevance judgements. In contrast, for the dense retrieval system ANCE, the performance improves from 0.654 (slightly below BM25) to 0.735, which is 6.7 points above the BM25 performance. Similar improvements are noticed in ColBERT (5.8 points). Even though many systems contributed to the TREC-COVID annotation pool, the annotation pool is still biased towards lexical approaches. In this work, we presented BEIR: a heterogeneous benchmark for information retrieval. We provided a broader selection of target tasks ranging from narrow expert domains to open domain datasets. We included nine different retrieval tasks spanning 18 diverse datasets. By open-sourcing BEIR, with a standardized data format and easy-to-adapt code examples for many different retrieval strategies, we take an important steps towards a unified benchmark to evaluate the zero-shot capabilities of retrieval systems. It hopefully steers innovation towards more robust retrieval systems and to new insights which retrieval architectures perform well across tasks and domains. We studied the effectiveness of ten different retrieval models and demonstrate, that in-domain performance cannot predict how well an approach will generalize in a zero-shot setup. Many approaches that outperform BM25 on an in-domain evaluation, perform poorly on the BEIR datasets. Cross-attentional re-ranking, late-interaction ColBERT, and the document expansion technique docT5query performed overall well across the evaluated tasks. Our study on annotation selection bias highlights the challenge of evaluating new models on existing datasets: Even though TREC-COVID is based on the predictions from many systems, contributed by a diverse set of teams, we found largely different Hole@10 rates for the tested systems, negatively affecting non-lexical approaches. Better datasets, that use diverse pooling strategies, are needed for a fair evaluation of retrieval approaches. By integrate a large number of diverse retrieval systems into BEIR, creating such diverse pools becomes significantly simplified. We provide the following additional sections in detail and information that complement discussions in the main paper: • Limitations of the BEIR benchmark in Appendix B. • Training and in-domain evaluation task details in Appendix C. • Description of all zero-shot tasks and datasets used in BEIR in Appendix D. • Details of dataset licenses in Appendix E. • Overview of the weighted jaccard similarity metric in Appendix F. • Overview of the capped recall at k metric in Appendix G. • Length preference for dense retrieval system in Appendix H. Even though we cover a wide range of tasks and domains in BEIR, no benchmark is perfect and has its limitations. Making those explicit is a critical point in understanding the results on the benchmark and, for future work, to improve up-on the benchmark. Although we aim for a diverse retrieval evaluation benchmark, due to the limited availability of multilingual retrieval datasets, all datasets covered in the BEIR benchmark are currently English. It is worthwhile to add more multilingual datasets [2, 77] (in consideration of the selection criteria) as a next step for the benchmark. Future work could include multi-and cross-lingual tasks and models. Most of our tasks have average document lengths up-to a few hundred words roughly equivalent to a few paragraphs. Including tasks that require the retrieval of longer documents would be highly relevant. However, as transformer-based approaches often have a length limit of 512 word pieces, a fundamental different setup would be required to compare approaches. 3. Multi-factor Search: Until now, we focused on pure textual search in BEIR. In many real-world applications, further signals are used to estimate the relevancy of documents, such as PageRank [49] , recency [16], authority score [33] or user-interactions such as click-through rates [51] . The integration of such signals in the tested approaches is often not straight-forward and is an interesting direction for research. Retrieval can often be performed over multiple fields. For example, for scientific publication we have the title, the abstract, the document body, the authors list, and the journal name. So far we focused only on datasets that have one or two fields. In our benchmark, we focus on evaluating models that are able to generalize well for a broad range of retrieval tasks. Naturally in real-world, for some few tasks or domains, specialized models are available which can easily outperform generic models as they focus and perform well on a single task, lets say on question-answering. Such task-specific models do not necessarily need to generalize across all diverse tasks. We use the MS MARCO Passage Ranking dataset [45] , which contains 8.8M Passages and an official training set of 532,761 query-passage pairs for fine-tuning for a majority of retrievers. The dataset contains queries from Bing search logs with one text passage from various web sources annotated as relevant. We find the dataset useful for training, in terms of covering a wide variety of topics and providing the highest number of training pairs. It has been extensively explored and used for finetuning dense retrievers in recent works [46, 17, 15] . We use the official MS MARCO development set for our in-domain evaluation which has been widely used in prior research [46, 17, 15] . It has 6,980 queries. Most of the queries have only 1 document judged relevant; the labels are binary. Following the selection criteria mentioned in Section 3, we include 18 evaluation datasets that span across 9 heterogeneous tasks. Each dataset mentioned below contains a document corpus denoted by T and test queries for evaluation denoted by Q. We additionally provide dataset website links in Table 5 and intuitive examples in Table 8 . We now describe each task and dataset included in the BEIR benchmark below: Bio-medical information retrieval is the task of searching relevant scientific documents such as research papers or blogs for a given scientific query in the biomedical domain [28] . We consider a scientific query as input and retrieve bio-medical documents as output. TREC-COVID [65] is an ad-hoc search challenge based on the CORD-19 dataset containing scientific articles related to the COVID-19 pandemic [69] . We include the July 16, 2020 version of CORD-19 dataset as corpus T and use the final cumulative judgements with query descriptions from the original task as queries Q. NFCorpus [7] contains natural language queries harvested from NutritionFacts (NF). We use the original splits provided alongside all content sources from NF (videos, blogs, and Q&A posts) as queries Q and annotated medical documents from PubMed as corpus T. BioASQ [61] Task 8b is a biomedical semantic question answering challenge. We use the original train and test splits provided in Task 8b as queries Q and collect around 15M articles from PubMed provided in Task 8a as our corpus T. Retrieval in open domain question answering [8] is the task of retrieving the correct answer for a question, without a predefined location for the answer. In open-domain tasks, model must retrieve over an entire knowledge source (such as Wikipedia). We consider the question as input and the passage containing the answer as output. Natural Questions [34] contains Google search queries and documents with paragraphs and answer spans within Wikipedia articles. We did not use the NQ version from ReQA [1] as it focused on queries having a short answer. As a result, we parsed the HTML of the original NQ dataset and include more complex development queries that often require a longer passage as answer compared to ReQA. We filtered out queries without an answer, or having a table as an answer, or with conflicting Wikipedia pages. We retain 2,681,468 passages as our corpus T and 3452 test queries Q from the original dataset. HotpotQA [76] contains multi-hop like questions which require reasoning over multiple paragraphs to find the correct answer. We include the original full-wiki task setting: utilizing processed Wikipedia passages as corpus T. We held out randomly sampled 5447 queries from training as our dev split. We use the original (paper) task's development split as our test split Q. FiQA-2018 [44] Task 2 consists of opinion-based question-answering. We include financial data by crawling StackExchange posts under the Investment topic from 2009-2017 as our corpus T. We randomly sample out 500 and 648 queries Q from the original training split as dev and test splits. Twitter is a popular micro-blogging website on which people post real-time messages (i.e. tweets) about their opinions on a variety of topics and discuss current issues. We consider a news headline as input and retrieve relevant tweets as output. Signal-1M Related Tweets [59] task retrieves relevant tweets for a given news article title. The Related Tweets task provides news articles from the Signal-1M dataset [10] which we use as queries Q. We construct our twitter corpus T by manually scraping tweets from the provided tweet-ids in the relevancy judgements using Python package: Tweepy (https://www.tweepy.org). TREC-NEWS [58] 2019 track involves background linking: Given a news headline, we retrieve relevant news articles that provide important context or background information. We include the original shared task query description (single sentence) as our test queries Q and the TREC Washington Post as our corpus T. For simplicity, we convert the original exponential gain relevant judgements to linear labels. [64] provides a robust dataset focusing on evaluating on poorly performing topics. We include the original shared task query description (single sentence) as our test queries Q and the complete TREC disks 4 and 5 documents as our corpus T. Argument retrieval is the task of ranking argumentative texts in a collection of focused arguments (output) in order of their relevance to a textual query (input) on different topics. ArguAna Counterargs Corpus [67] involves the task of retrieval of the best counterargument to an argument. We include pairs of arguments and counterarguments scraped from the online debate portal as corpus T. We consider the arguments present in the original test split as our queries Q. Touché-2020 [6] Task 1 is a conversational argument retrieval task. We use the conclusion as title and premise for arguments present in args.me [66] as corpus T. We include the shared Touché-2020 task data as our test queries Q. The original relevance judgements (qrels) file also included negative judgements (-2) for non-arguments present within the corpus, but for simplicity we substitute them as zero. Duplicate question retrieval is the task of identifying duplicate questions asked in community question answering (cQA) forums. A given query is the input and the duplicate questions are the output. CQADupStack [25] is a popular dataset for research in community question-answering (cQA). The corpus T comprises of queries from 12 different StackExchange subforums: Android, English, Gaming, Gis, Mathematica, Physics, Programmers, Stats, Tex, Unix, Webmasters and Wordpress. We utilize the original test split for our queries Q, and the task involves retrieving duplicate query (title + body) for an input query title. We evaluate each StackExchange subforum separately and report the overall mean scores for all tasks in BEIR. Quora Duplicate Questions dataset identifies whether two questions are duplicates. Quora originally released containing 404,290 question pairs. We add transitive closures to the original dataset. Further, we split it into train, dev, and test sets with a ratio of about 85%, 5% and 10% of the original pairs. We remove all overlaps between the splits and ensure that a question in one split of the dataset does not appear in any other split to mitigate the transductive classification problem [27] . We achieve 522,931 unique queries as our corpus T and 5,000 dev and 10,000 test queries Q respectively. Entity retrieval involves retrieving unique Wikipedia pages to entities mentioned in the query. This is crucial for tasks involving Entity Linking (EL). The entity-bearing query is the input and the entity abstract and title are retrieved as output. DBPedia-Entity-v2 [21] is an established entity retrieval dataset. It contains a set of heterogeneous entity-bearing queries Q containing named entities, IR style keywords, and natural language queries. The task involves retrieving entities from the English part of DBpedia corpus T from October 2015. We randomly sample out 67 queries from the test split as our dev set. Citations are a key signal of relatedness between scientific papers [9] . In this task, the model attempts to retrieve cited papers (output) for a given paper title as input. SCIDOCS [9] contains a corpus T of 30K held-out pool of scientific papers. We consider the direct-citations (1 out of 7 tasks mentioned in the original paper) as the best suited task for retrieval evaluation in BEIR. The task includes 1k papers as queries Q with 5 relevant papers and 25 (randomly selected) uncited papers for each query. Fact checking verifies a claim against a big collection of evidence [60] . The task requires knowledge about the claim and reasoning over multiple documents. We consider a sentence-level claim as input and the relevant document passage verifying the claim as output. FEVER [60] The Fact Extraction and VERification dataset is collected to facilitate the automatic fact checking. We utilize the original paper splits as queries Q and retrieve evidences from the pre-processed Wikipedia Abstracts (June 2017 dump) as our corpus T. [14] is a dataset for verification of real-world climate claims. We include the original dataset claims as queries Q and retrieve evidences from the same FEVER Wiki corpus T. We manually included few Wikipedia articles (25) missing from our corpus, but present within our relevance judgements. SciFact [68] verifies scientific claims using evidence from the research literature containing scientific paper abstracts. We use the original publicly available dev split from the task containing 300 queries as our test queries Q, and include all documents from the original dataset as our corpus T. The authors of 4 out of the 19 datasets in the BEIR benchmark (NFCorpus, FiQA-2018, Quora, Climate-Fever) do not report the dataset license in the paper or a repository; We overview the rest: The weighted Jaccard similarity J(S, T ) [26] is intuitively calculated as the unique word overlap for all words present in both the datasets. More formally, the normalized frequency for an unique word k in a dataset is calculated as the frequency of word k divided over the sum of frequencies of all words in the dataset. S k is the normalized frequency of word k in the source dataset S and T k for the target dataset T respectively. The weighted Jaccard similarity between S and T is defined as: where the sum is over all unique words k present in datasets S and T . Recall at k is calculated as the fraction of the relevant documents that are successfully retrieved within the top k extracted documents. More formally, the R@k score is calculated as: where Q is the set of queries, A i is the set of relevant documents for the ith query, and A i is a scored list of documents provided by the model, from which top k are extracted. However measuring recall can be counterintuitive, if a high number of relevant documents (> k) are present within a dataset. For example, consider a hypothetical dataset with 500 relevant documents for a query. Retrieving all relevant documents would produce a maximum R@100 score = 0.2, which is quite low and unintuitive. To avoid this we cap the recall score (R_cap@k) at k for datasets if the number of relevant documents for a query greater than k. It is defined as: where the only difference lies within the denominator where we compute the minimum of k and |A i |, instead of |A i | present in the original recall. As we show in Figure 4 , TAS-B prefers retrieval of shorter documents, and in comparison, ANCE retrieves longer documents. The difference is especially extreme for the TREC-COVID dataset: TAS-B retrieves lots of top hit documents containing only a title and an empty abstract, while ANCE retrieves top hit documents with a non-empty abstract. Identifying the source for this contrasting behaviour is difficult, as TAS-B and ANCE use different models (DistilBERT vs. RoBERTa-base), a different loss function (InfoNCE [62] vs. Margin-MSE [24] with in-batch negatives), and different hard negative mining strategies. Hence, we decided to harmonize the training setup and to alter the training by just one aspect: The similarity function. Dense models require a similarity function to retrieve relevant documents for a given query within an embedding space. This similarity function is also used during training dense models with the InfoNCE [62] loss: using n in-batch negatives for each query q and a scaling factor τ . where d + denotes the relevant (positive) document for query q. Commonly used similarity functions (sim(q, d)) are cosine-similarity or dot-product. We trained two distilbert-base-uncased models with an identical training setup on MS MARCO (identical training parameters) and only changed the similarity function from cosine-similarity to dot-product. As shown in Table 10 , we observe significant performance differences for some BEIR datasets. For TREC-COVID, the dot-product model achieves the biggest improvement with 15.3 points, while for a majority on other datasets, it performs worse than the cosine-similarity model. We observe that these (nearly) identical models retrieve documents with vastly different lengths as shown in the violin plots in Table 10 . For all datasets, we find the cosine-similarity model to prefer shorter documents over longer ones. This is especially severe for TREC-COVID: a large fraction of the scientific papers (approx. 42k out of 171k) consist only of publication titles without an abstract. The cosine-similarity model prefers retrieving these documents. In contrast, the dot-product model primarily retrieves longer documents, i.e., publications with an abstract. Cosine-similarity uses vectors of unit length, thereby having no notion of the encoded text length. In contrast, for dot-product, longer documents result in vectors with higher magnitudes which can yield higher similarity scores for a query. Further, as we observe in Figure 5 , relevance judgement scores are not uniformly distributed over document lengths: for some datasets, longer documents are annotated with higher relevancy scores, while in others, shorter documents are. This can be either due to the annotation process, e.g., the candidate selection method prefers short or long documents, or due to the task itself, where shorter or longer documents could be more relevant to the user information need. Hence, it can be more advantageous to train a model with either cosine-similarity or dot-product depending upon the nature and needs of the specific task. Senate launches bill to remove immunity for websites hosting illegal content, spurred by Backpage.com <Paragraph> The legislation, along with a similar bill in the House, sets the stage for a battle between Congress and some of the Internet's most powerful players, including Google and various free-speech advocates, who believe that Congress shouldn't regulate Web content or try to force websites to police themselves more rigorously... What were the causes for the Islamic Revolution relative to relations with the U.S.? <Paragraph> BFN [Editorial: "Sow the Wind and Reap the Whirlwind"] Yesterday marked the 14th anniversary of severing of diplomatic relations between the Islamic Republic and the United States of America. Several occasions arose in the last decade and a half for improving Irano-American relations... Should the government allow illegal immigrants to become citizens? <Title> America should support blanket amnesty for illegal immigrants. <Paragraph> Undocumented workers do not receive full Social Security benefits because they are not United States citizens " nor should they be until they seek citizenship legally. Illegal immigrants are legally obligated to pay taxes... Command to display first few and last few lines of a file <Title> Combing head and tail in a single call via pipe <Paragraph> On a regular basis, I am piping the output of some program to either 'head' or 'tail'. Now, suppose that I want to see the first AND last 10 lines of piped output, such that I could do something like ./lotsofoutput | headtail... How long does it take to methamphetamine out of your blood? <Paragraph> How long does it take the body to get rid of methamphetamine? DBPedia Paul Auster novels <Title> The New York Trilogy <Paragraph> The New York Trilogy is a series of novels by Paul Auster. Originally published sequentially as City of Glass (1985), Ghosts (1986) and The Locked Room (1986), it has since been collected into a single volume. <Title> Application of CFD in building performance simulation for the outdoor environment: an overview <Paragraph> This paper provides an overview of the application of CFD in building performance simulation for the outdoor environment, focused on four topics... Climate-FEVER Sea level rise is now increasing faster than predicted due to unexpectedly rapid ice melting. <Title> Sea level rise <Paragraph> A sea level rise is an increase in the volume of water in the world 's oceans, resulting in an increase in global mean sea level. The rise is usually attributed to global climate change by thermal expansion of the water in the oceans and by melting of Ice sheets and glaciers... Table 9 : In-domain and zero-shot retrieval performance on BEIR datasets. Scores denote Recall@100. The best retrieval performance on a given dataset is marked in bold, and the second best performance is underlined. ‡ indicates in-domain retrieval performance. shows the capped Recall@100 score (Appendix G). ReQA: An Evaluation for End-to-End Answer Retrieval Models XOR QA: Cross-lingual Open-Retrieval Question Answering Modeling of the question answering task in the yodaqa system Semantic Parsing on Freebase from Question-Answer Pairs Bridging the lexical chasm: statistical approaches to answer-finding Overview of Touché 2020: Argument Retrieval A full-text learning to rank dataset for medical information retrieval Reading Wikipedia to Answer Open-Domain Questions SPECTER: Document-level Representation Learning using Citation-informed Transformers What do a Million News Articles Look like? Context-Aware Term Weighting For First Stage Passage Retrieval BERT: Pre-training of deep bidirectional transformers for language understanding Billion-scale similarity search with GPUs TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension Dense Passage Retrieval for Open-Domain Question Answering ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT Authoritative Sources in a Hyperlinked Environment Natural Questions: a Benchmark for Question Answering Research Cicero Nogueira dos Santos, Ramesh Nallapati, Zhiheng Huang, and Bing Xiang. 2020. Embedding-based Zero-shot Retrieval through Query Generation Toward reproducible baselines: The open-source IR reproducibility challenge Pretrained Transformers for Text Ranking: BERT and Beyond Fairness in Information Retrieval On Biases in Information retrieval models and evaluation The Curious Incidence of Bias Corrections in the Pool RoBERTa: A Robustly Optimized BERT Pretraining Approach Sparse, Dense, and Attentional Representations for Text Zero-shot Neural Passage Retrieval via Domain-targeted Synthetic Question Generation WWW'18 Open Challenge: Financial Opinion Mining and Question Answering MS MARCO: A Human Generated MAchine Reading COmprehension Dataset Passage Re-ranking with BERT From doc2query to docTTTTTquery. Online preprint Document Expansion by Query Prediction The PageRank Citation Ranking: Bringing Order to the Web Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. 2020. KILT: a Benchmark for Knowledge Intensive Language Tasks How Does Clickthrough Data Reflect Retrieval Quality? Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes The Probabilistic Relevance Framework: BM25 and Beyond DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index TREC 2019 News Track Overview. In TREC A Data Collection for Evaluating the Retrieval of Related Tweets to News Articles FEVER: a Large-scale Dataset for Fact Extraction and VERification An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition Representation Learning with Contrastive Predictive Coding Pytrec_eval: An Extremely Fast Python Interface to trec_eval Overview of the TREC 2004 Robust Retrieval Track TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection Building an Argument Search Engine for the Web Retrieval of the Best Counterargument without Prior Topic Knowledge Fact or Fiction: Verifying Scientific Claims MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers A theoretical analysis of NDCG ranking measures Transformers: State-of-the-Art Natural Language Processing Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval Anserini: Enabling the Use of Lucene for Information Retrieval Research Multilingual Universal Sentence Encoder for Semantic Retrieval HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval Multi-factor duplicate question detection in stack overflow SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval Original dataset website (link) for all datasets present in BEIR. Model Public Model Checkpoints (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] No supplemental material attached to this submission. Further supplemental material can be found in our repository mentioned in the URL. (d) Did you discuss whether and how consent was obtained from people whose data you're using/curating? [N/A] Used datasets provide a specific dataset license, which we follow. (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [No] We re-use existing datasets, which most are freely available. Most datasets are from less sensitive sources, like Wikipedia or scientific publications, where don't expect personally identifiable information. Checking for offensive content in more than 50 million documents is difficult and removing it would alter the underlying dataset.5. If you used crowdsourcing or conducted research with human subjects...