key: cord-0457082-kelaue6y authors: Xin, Ji; Xiong, Chenyan; Srinivasan, Ashwin; Sharma, Ankita; Jose, Damien; Bennett, Paul N. title: Zero-Shot Dense Retrieval with Momentum Adversarial Domain Invariant Representations date: 2021-10-14 journal: nan DOI: nan sha: a69d0b93d2ffcf25c964d93cc9b1adb73085232d doc_id: 457082 cord_uid: kelaue6y Dense retrieval (DR) methods conduct text retrieval by first encoding texts in the embedding space and then matching them by nearest neighbor search. This requires strong locality properties from the representation space, i.e, the close allocations of each small group of relevant texts, which are hard to generalize to domains without sufficient training data. In this paper, we aim to improve the generalization ability of DR models from source training domains with rich supervision signals to target domains without any relevant labels, in the zero-shot setting. To achieve that, we propose Momentum adversarial Domain Invariant Representation learning (MoDIR), which introduces a momentum method in the DR training process to train a domain classifier distinguishing source versus target, and then adversarially updates the DR encoder to learn domain invariant representations. Our experiments show that MoDIR robustly outperforms its baselines on 10+ ranking datasets from the BEIR benchmark in the zero-shot setup, with more than 10% relative gains on datasets with enough sensitivity for DR models' evaluation. Source code of this paper will be released. Rather than matching texts in the bag-of-words space, Dense Retrieval (DR) methods first encode texts into a dense embedding space (Lee et al., 2019b; Xiong et al., 2021) and then conduct text retrieval using efficient nearest neighbor search (Chen et al., 2018; Guo et al., 2020; Johnson et al., 2021) . With pre-trained language models and dedicated fine-tuning techniques, the learned representation space has significantly advanced the first stage retrieval accuracy of many language systems, including web search (Xiong et al., 2021) , grounded generation , open domain question answering Izacard & Grave, 2020) , etc. Purely using the learned embedding space for retrieval has raised concerns on the generalization ability, especially in scenarios without the luxury of dedicated supervision signals. Many have observed diminishing advantages of DR models in various datasets if they are not fine-tuned with task-specific labels, i.e., in the zero-shot setup (Thakur et al., 2021) . However, in many scenarios outside commercial web search, zero-shot is the norm. Obtaining training labels is difficult and sometimes infeasible, for example, in the medical domain where annotation requires strong expertise or is even prohibited because of privacy constraints. The lack of zero-shot ability hinders the democratization of advancements in dense retrieval from data-rich domains to everywhere else. Many equally if not more important real-world search scenarios still rely on unsupervised exact match methods like BM25, which are developed decades ago (Robertson & Jones, 1976) . Even within the search system, generalization ability of first stage DR models is notably worse than subsequent reranking models (Thakur et al., 2021) . Reranking models, similar to many classification models, only require a decision boundary between relevant and irrelevant query-document pairs (q-d pairs) in the representation space. In comparison, DR needs good local alignments in the entire space to support nearest neighbor matching, which is much harder for representation learning. In Figure 1 , we use t-SNE (van der Maaten & Hinton, 2008) to illustrate this challenge. We show learned representations of a BERT-based reranker (Nogueira & Cho, 2019 ) and a BERT-based dense retriever (Xiong et al., 2021) , in zero-shot transfer from the web domain (Bajaj et al., 2016) to medical (Voorhees et al., 2021) . The representation space learned for reranking yields two manifolds with a clear decision boundary; data points in the target domain naturally cluster with their corresponding classes from the source domain, leading to good generalization. In comparison, the representation space learned for DR is more scattered. Target domain data points are grouped separately from those of the source domain; it is nearly impossible for the learned nearest neighbor locality to generalize from source to the isolated target domain region. Source Positive Source Negative Target Positive Target Negative Source Query Target Query Figure 1 : T-SNE plots of embedding space of a BERT reranker for q-d pairs and ANCE dense retriever for queries/documents. All models are trained on web search as the source domain and applied on medical search as the target domain. In this paper, we present Momentum Adversarial Domain Invariant Representations learning (MoDIR), to improve the generalization ability of zero-shot dense retrieval (ZeroDR). We first introduce an auxiliary domain classifier that is trained to discriminate source embeddings from target ones. Then the DR encoder is not only updated to encode queries and relevant documents together in the source domain, but also trained adversarially to confuse the domain classifier and to push for a more domain invariant embedding space. To ensure stable and efficient adversarial learning we propose a momentum method that trains the domain classifier with a momentum queue of embeddings saved from previous iterations. Our experiments evaluate the generalization ability of dense retrieval with MoDIR using 15 retrieval tasks from the BEIR benchmark (Thakur et al., 2021) . On these retrieval tasks from various domains including biomedical, finance, scientific, etc., MoDIR significantly improves the zero-shot accuracy of ANCE (Xiong et al., 2021) , a recent state-of-the-art DR model trained with web search data. Without using any target domain training labels, the improvements from MoDIR are stable, robust, and also significant on tasks where evaluation labels have sufficient coverage for DR (Thakur et al., 2021) . Our studies also verify the necessity of our momentum approach, without which the domain classifier fails to capture the domain gaps, and the adversarial training does not learn domain invariant representations, resulted in little improvement in ZeroDR. Our further analyses reveal several interesting behaviors of MoDIR and its learned embedding space. During the adversarial training process, the target domain embeddings are gradually pushed towards the source domain and eventually absorbed as a subgroup of the source. In the learned representation space, our manual examinations find various cases where a target domain query is located close to source queries resembling similar information needs. This indicates that ZeroDR's generalization ability comes from the combination of information overlaps of source/target domains, and MoDIR's ability to identify the right correspondence between them. The rest of this paper is organized as follows: The next section presents how MoDIR learns domain invariant representations for ZeroDR; Section 3 and Section 4 discuss our experimental settings and evaluation results; We recap related works in Section 5 and conclude in Section 6. In this work, we aim to improve the zero-shot ability of DR in the unsupervised domain adaptation setting (UDA) (Long et al., 2016) : Given a source domain with sufficient training signals, the goal is to transfer the DR model to a target domain, with access to its data but not any label. This is the common case when applying DR in real-world scenarios: in target domains (e.g., medical), example queries and documents are available but relevance annotations require domain expertise, while in the source domain (e.g., web search), training signals are available at large scale (Ma et al., 2020; Thakur et al., 2021) . The standard design of DR is to use a dual-encoder model (Lee et al., 2019b; , where an encoder g takes as input a query/document and encodes it into a dense vector, and then the relevance score of a query-document pair x = (q, d) is computed using a simple similarity function: where θ g is the collection of parameters of g and sim is a similarity function that supports efficient nearest neighbor search (Johnson et al., 2021) , for example, cosine similarity or dot product. The training of DR uses labeled q-d pairs in the source domain x s = (q s , d s ). With relevant q-d pair as x s+ and irrelevant pair as x s− , the DR encoder g is trained to minimize the ranking loss L R : where L R is a standard ranking loss function. In this paper, without loss of generality, we inherit the settings of ANCE (Xiong et al., 2021) that sample negatives x s− using the DR model being trained. Other components are also kept the same with ANCE: g is fine-tuned from RoBERTa-base (Liu et al., 2019) and outputs the embedding of the last layer's [CLS] token, L R is the Negative Log Likelihood (NLL) loss, and sim is the dot product. To capture the source and target domain differences and enable adversarial learning for domain invariance, MoDIR introduces a domain classifier f on top of the DR model's query/document embeddings to predict their probability of being source or target. We simply use a linear layer on top of a data embedding e as the model architecture of f : The linear layer is often sufficient to distinguish both domains in the high-dimensional representation space. The challenge is mainly on the training side. As illustrated in Figure 1 , DR's representation space focuses more on locality than forming manifolds, and therefore it is more difficult to learn the domain boundary in this case. Learning f using a large number of data points enumerated after each DR model update is costly, while updating f per data batch results in an unstable estimation of domain boundary given the scattered representation space. As shown in Figure 2 , we introduce momentum learning to balance the efficiency and robustness of the domain classifier learning. We maintain a momentum queue Q that includes embeddings from multiple past batches as the training data for f . Specifically, for each source domain training data x s , we sample q-d pairs x t from the target domain, and add embeddings of x s and x t to Q. The momentum queue Q at step k includes embeddings from source and target for all recent n batches: where B k−n+1:k are the data from the past n batches, with n as the momentum step. We ensure the 1:1 ratio between source and target data and also 1:1 between positive and negative source data. Note that e is the detached embedding, for example, of the query q s : where Φ is the stop-gradient operator, i.e., gradients of e q s will not be back propagated to θ g . This enables efficient momentum learning since only W f requires gradients in the process. At each iteration, the domain classifier is updated by minimizing the following discrimination loss: where L D is a standard classification loss. In this way, the domain classifier is trained with signals from multiple batches, leading to a faster and more robust estimation of the domain boundary. With an estimated domain boundary from the domain classifier f , MoDIR adversarially trains the encoder g to generate domain invariant representations that f cannot distinguish, by minimizing an adversarial loss L M . Here we choose the widely used Confusion loss (Tzeng et al., 2017) : where x ∈ {x s , x t } is a q-d pair from either source or target, as the confusion loss aims to push for random classification probability for any data points. It reaches the minimum when the embeddings are domain invariant and the domain classifier predict 50%-50% probability for all data. To push for domain invariance, we freeze the domain classifier and update parameters of the encoder: We use the hyperparameter λ to balance the learning of DR ranking in the source domain (Equation (2)) and the learning of domain invariance (Equation (9)). To summarize, for each training batch in the source domain, the domain classifier f and the encoder g are optimized by: where f is trained to estimate the boundary between source/target and g is trained to provide domain invariant representations while capturing the relevance matches in the source domain. Datasets We choose the MS MARCO passage dataset (Bajaj et al., 2016) as the source domain dataset and choose the 15 publicly available datasets gathered in the BEIR benchmark (Thakur et al., 2021) as target domain datasets. These datasets cover a large number of various domains, including biomedical, finance, scientific, etc. We treat each target domain dataset separately and produce an individual model for each of them, following standard unsupervised domain adaptation setup (Long et al., 2016) . Details of the datasets can be found in Appendix A. Evaluation for DR Target domain datasets do not always have an ideal coverage for relevance labels. The annotation procedure of many datasets requires some retrieval models to generate candidates for labeling, which are mainly sparse models at the time of construction. Therefore, the evaluation of these datasets is not only biased towards sparse models but also less sensitive to dense models. High Hole rates (a hole is a predicted q-d pair without annotation) are often observed for dense models (Xiong et al., 2021; Thakur et al., 2021) . In fact, ANCE underperforms sparse methods such as BM25 on TREC-COVID with the original annotation, but after adding extra labels based on ANCE's prediction, its scores greatly improve, achieving the state of the art (Thakur et al., 2021) . Nevertheless, TREC-COVID is the dataset with the lowest hole rates for DR models since participating systems include dense ones, and is one of the best to measure the progress of ZeroDR. In the ZeroDR setting, there is no access to relevance labels in the target domain during training/validation. Therefore, choosing the optimal hyperparameters is impossible without directly tuning on the test set. In our experiments, most of our hyperparameters are kept the same with ANCE. We also use exactly the same experimental setting and evaluate checkpoints after a fixed number of training steps (10k) for all target domain datasets. This evaluation setup may not yield the optimal empirical results for MoDIR, but it is the closest to ZeroDR in the real world. Please refer to Appendix B for detailed hyperparameters. Baselines As a first stage retrieval method, MoDIR's baselines include BM25 (Robertson & Jones, 1976) , DPR , and ANCE (Xiong et al., 2021) . The original DPR is trained on NQ (Kwiatkowski et al., 2019) , and we train another DPR model on MARCO to eliminate training dataset differences. The nDCG scores of BM25, DPR-NQ, and ANCE are taken from the BEIR paper (verified to be consistent with our runs); DPR-MARCO and MoDIR are from our own evaluation. BEIR also reports results of other retrieval methods, such as docT5query , TAS-B (Hofstätter et al., 2021) , GenQ , ColBERT (Khattab & Zaharia, 2020) , etc. However, they are not directly comparable with MoDIR since they may include stronger supervision signals, data augmentation, and/or expensive late interaction, so they are orthogonal with MoDIR and can be combined for better empirical results. Our main baseline is ANCE, which MoDIR is built upon and is also shown to be the state of the art on TREC-COVID (Thakur et al., 2021) . This section evaluates the effectiveness of MoDIR, its momentum training, and the benefits of domain invariant representations. Table 1 shows the overall ZeroDR accuracy of MoDIR and baselines on the BEIR benchmark (Thakur et al., 2021) . MoDIR improves ANCE's overall effectiveness in the ZeroDR setting. On datasets with low hole rates (good label coverage), the gains are significant; on datasets with high hole rates, which are less sensitive to DR model improvement, the gains are less significant but still stable. Moreover, results of MoDIR are obtained without hyperparameter tuning or checkpoint selection, and therefore present a fair comparison in the realistic ZeroDR setting. Our ablation studies evaluate the importance of the momentum method and the effects of other experimental setups. We use the two datasets with the best label coverage, TREC-COVID and Touché, and show the results in Table 2 . MoDIR's default setting is underlined. Firstly, we evaluate the accuracy of MoDIR without the momentum method, i.e., we do not maintain the momentum queue, but simply update the domain classifier with embeddings of the current batch. Without momentum, MoDIR's improvement over ANCE diminishes. Secondly, we evaluate MoDIR with other two choices of adversarial loss (Equation (9)): Minimax and GAN (Tzeng et al., 2017) . GAN loss is less stable as expected (Tzeng et al., 2017) , while Minimax performs comparatively to Confusion. This shows that MoDIR can also be applied with other domain adaptation training methods. Thirdly, we vary the momentum step n without changing the rest experimental settings. We find that n mainly impacts the balance between learning nearest neighbor locality and learning the domain invariance, so it is an important hyperparameter for MoDIR. In this subsection we evaluate the impact of momentum in adversarial training. To quantify domain invariance, we use Domain Classification Accuracy (Domain-Acc), which includes two measurements based on the choices of domain classifier: (1) The domain classifier is trained globally on source and target embeddings until convergence, which leads to Global Domain-Acc. (2) We take the domain classifier used in MoDIR's training (f in Section 2.2), and record its accuracy when it is applied on a new batch, which leads to Local Domain-Acc. Global Domain-Acc measures the real degree of domain invariance: it is lower when the embeddings of the two domains are not easily separable. Local Domain-Acc is an approximation provided by the domain classifier f . A large gap between Local and Global accuracy means that the domain boundary estimated by f is inaccurate. In Figure 3 , we compare Global and Local Domain-Acc on the TREC-COVID dataset when momentum is/isn't used . With momentum, Local Domain-Acc quickly increases to be comparable with Global Domain-Acc. The domain classifier f (used in MoDIR's training) converges quickly and Global Domain-Acc starts to decrease. Embeddings from the two domains become less separable as the result of effective adversarial training. Note that Local Domain-Acc does not decrease because f has seen and memorized almost all data, while Global Domain-Acc's domain classifier is always tested on unseen data. This shows that momentum helps with the balance of adversarial training, ensuring its convergence towards a domain invariant representation space. On the other hand, when momentum is not used, there exists a long lasting gap between Local and Global Domain-Acc, showing that f does not capture the domain boundary well. As a result, the two domains remain (almost) linearly separable in the embedding space, as shown by the fact that Global Domain-Acc does not decrease, and the model fails to learn domain invariance. The results are shown in Table 3 . With momentum, both KNN-Source% and nDCG gradually increase as training proceeds. This shows that when target domain embeddings are pushed towards the source domain, ranking performance of the target domain also improves. On TREC-COVID, MoDIR eventually reaches a state-of-the-art 0.724 for first stage retrievers. On the other hand, without momentum, KNN-Source% and nDCG scores hardly increase. We also use t-SNE (van der Maaten & Hinton, 2008) to visualize the learned representation space at different training steps in Figure 4 . Before training with MoDIR, the two domains are well separated in the representation space learned by ANCE. With more MoDIR training steps, the target domains are pushed towards the source domain and gradually becomes a subset of it. Without momentum, the two domains remain separated, as observed in Table 3 . We study the correlation between ZeroDR accuracy and domain invariance. We use Global Domain-Acc as the indicator of domain invariance and plot it with the corresponding ZeroDR accuracy during training in Figure 5 . Global Domain-Acc starts at near 100%, showing that source and target embeddings are linearly separable with the one-layer domain classifier. It decreases as training proceeds, and when the learned representation space becomes more domain invariant, the ZeroDR accuracy in the target domain improves alongside. This shows that domain invariance is the source of improvements of ZeroDR's effectiveness. We also record that the DR accuracy on the source domain (MARCO) decreases by no more than 0.5%. This indicates that the high dimensional embedding space has sufficient capacity to learn domain invariant representations while maintaining relevance matching in the source domain. Table 4 . In the first case, MoDIR pays more attention to "transmission", and potentially retrieves more documents about transmission of diseases, thereby improving the nDCG score; documents about "coronavirus" also are likely to be retrieved by MoDIR since it is a very noticeable word. In the second case, it focuses on "mRNA" more than "vaccine". However, since the mRNA vaccine is relatively new with few appearances in the MARCO dataset, the shift in focus fails to improve MoDIR's effectiveness for this query. These examples help reveal the source of generalization ability on ZeroDR. For the DR models to be able to generalize, the source domain itself needs to include information that covers the relevance needs of the target domain; if there is no such information, as in the second example, generalization becomes a challenge. Where the source domain has such coverage, MoDIR is able to align target queries to source ones with similar information needs in its domain invariant representation space, and such alignments enable DR models to generalize. In December 2020, the Pfizer-BioNTech COVID vaccine became the first approved mRNA vaccine, according to https://en.wikipedia.org/wiki/MRNA_vaccine. In this section, we recap related work in dense retrieval and adversarial domain adaptation. Dense Retrieval Compared to conventional sparse methods for first stage retrieval, dense retrieval (DR) with Transformer-based models (Vaswani et al., 2017) such as BERT and RoBERTa (Liu et al., 2019) conduct retrieval in the dense embedding space (Lee et al., 2019a; Guu et al., 2020; Luan et al., 2021) . Compared with its sparse counterparts, DR improves retrieval efficiency and also provides comparable or even superior effectiveness for in-domain datasets. One of the most important research questions for DR is how to obtain meaningful negative training instances. DPR uses BM25 to find stronger negatives in addition to in-batch random negatives. RocketQA (Qu et al., 2021) uses cross-batch negatives and also filters them with a strong reranking model. ANCE (Xiong et al., 2021) uses an asynchronously updated negative index of the being-trained DR model to retrieve global hard negatives. Recently, the challenges of DR models' generalization in the zero-shot setting has attracted much attention (Thakur et al., 2021; Zhang et al., 2021; Li & Lin, 2021) . One way to improve ZeroDR is by synthetic query generation (Liang et al., 2020; , which first trains a doc2query model that learns to generate queries in the source domain given their relevant documents, and then applies the NLG model on target domain documents to generate queries. The target domain documents and generated queries form weak supervision labels in the target domain to train DR models. Our method differs from these approaches and focuses on directly improving the generalization ability of the learned representation space. Adversarial Domain Adaptation Unsupervised domain adaptation (UDA) has been studied extensively for computer vision applications. For example, maximum mean discrepancy (Long et al., 2013; Tzeng et al., 2014; Sun & Saenko, 2016) measures domain difference with a pre-defined metric and explicitly minimizes the difference. Following the advent of GAN (Goodfellow et al., 2014) , adversarial training for UDA is proposed: an auxiliary domain classifier learns to discriminate source and target domains, while the main classifier model is adversarially trained to confuse the domain classifier (Ganin & Lempitsky, 2015; Bousmalis et al., 2016; Tzeng et al., 2017; Luo et al., 2017) . The adversarial method does not require pre-defining the domain difference metric, allowing more flexible domain adaptation. MoDIR builds upon the success of these UDA methods and introduces a new momentum learning technique that is necessary to learn domain invariant representations in the ZeroDR setting. In this paper, we present MoDIR, a new representation learning method that improves the zero-shot generalization ability of dense retrieval models. We first show that dense retrieval models differ from classification models in their emphases of locality in the representation space. Then we present a momentum-based adversarial training method that robustly pushes text encoders to provide a more domain invariant representation space for dense retrieval. Our experiments on ranking datasets from the BEIR benchmark demonstrate robust and significant improvements of MoDIR on the zero-shot accuracy of ANCE, a recent state-of-the-art DR model. We conduct a series of studies to show the effects of our momentum method in learning domain invariant representations. Without momentum, the adversarial learning is unstable as the inherent variance of the DR embedding space hinders the convergence of the domain classifier. With momentum training, the model is able to fuse the target domain data into the source domain representation space, and thus discovers related information from the source domain and improves generalization, without requiring any target domain training labels. We view MoDIR as an initial step of zero-shot dense retrieval, an area demanding democratization of the rapid advancements to many real-world scenarios. Our approach inherits the success of domain adaptation techniques and upgrades them by addressing the unique challenges of ZeroDR. How to better understand the dynamics of representation learning for DR and further improve its effectiveness, robustness, and generalization ability is a future research direction with potential impacts in both representation learning research and also real-world applications. We provide the following information to ensure our proposed method is reproducible: • All datasets are publicly available and details can be found in Section 3 and Appendix A. • Detailed experimental setups can be found in Appendix B. • Model validation and evaluation details are discussed in Section 3. • Source code and model checkpoints will be made public when the paper is published. Target domain datasets used in our experiments are from the following domains: • General-domain (Wikipedia): DBPedia (Hasibi et al., 2017) , HotpotQA (Yang et al., 2018) , FEVER (Thorne et al., 2018) , and NQ (Kwiatkowski et al., 2019) . • Bio-medical: TREC-COVID (Voorhees et al., 2021) , NFCorpus (Boteva et al., 2016) , and BioASQ (Tsatsaronis et al., 2015) . • Finance: FiQA (Maia et al., 2018) . • Controversial arguments: Touché (Bondarenko et al., 2020) and ArguAna (Wachsmuth et al., 2018) . • Duplicate questions: Quora (Thakur et al., 2021) and CQADupStack (Hoogeveen et al., 2015) . • Scientific: SciFact (Wadden et al., 2020) , SCIDOCS (Cohan et al., 2020) , and Climate-FEVER (Diggelmann et al., 2020) B D E S We follow the design of ANCE for DR encoder's modeling and training. We initialize the encoder with the publicly released ANCE checkpoint , and randomly initialize the domain classifier. Detailed hyperparameter choices are shown in Table 5 . We also use an exponential decay routine for the hyperparameter λ to improve training stability, where the value is reduced continuously and shrunk to a half every 10k steps. A human generated machine reading comprehension dataset Overview of Touché 2020: Argument Retrieval A full-text learning to rank dataset for medical information retrieval Domain separation networks Pre-training tasks for embedding-based large-scale retrieval SPTAG: A library for fast approximate nearest neighbor search SPECTER: Documentlevel representation learning using citation-informed transformers Annual Meeting of the Association for Computational Linguistics BERT: Pre-training of deep bidirectional transformers for language understanding CLIMATE-FEVER: A dataset for verification of real-world climate claims Unsupervised domain adaptation by backpropagation Generative adversarial nets Accelerating large-scale inference with anisotropic vector quantization Realm: Retrievalaugmented language model pre-training Dbpedia-entity v2: A test collection for entity search Efficiently teaching an effective dense retriever with balanced topic aware sampling Cqadupstack: A benchmark data set for community question-answering research Leveraging passage retrieval with generative models for open domain question answering Billion-scale similarity search with gpus Dense passage retrieval for open-domain question answering Efficient and effective passage search via contextualized late interaction over bert Natural questions: A benchmark for question answering research Latent retrieval for weakly supervised open domain question answering Latent retrieval for weakly supervised open domain question answering Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks Encoder adaptation of dense passage retrieval for open-domain question answering Embedding-based zero-shot retrieval through query generation A robustly optimized bert pretraining approach Transfer feature learning with joint distribution adaptation Unsupervised domain adaptation with residual transfer networks Sparse, dense, and attentional representations for text retrieval Label efficient learning of transferable representations acrosss domains and tasks Zero-shot neural retrieval via domain-targeted synthetic query generation Zero-shot neural passage retrieval via domain-targeted synthetic question generation Www'18 open challenge: Financial opinion mining and question answering Passage re-ranking with BERT Document ranking with a pretrained sequence-to-sequence model RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering Relevance weighting of search terms Deep coral: Correlation alignment for deep domain adaptation BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models FEVER: a largescale dataset for fact extraction and VERification Sergios Petridis, Dimitris Polychronopoulos, et al. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition Deep domain confusion: Maximizing for domain invariance Adversarial discriminative domain adaptation Visualizing data using t-sne Attention is all you need Constructing a pandemic information retrieval test collection. SIGIR Forum Retrieval of the best counterargument without prior topic knowledge Fact or fiction: Verifying scientific claims Approximate nearest neighbor negative contrastive learning for dense text retrieval HotpotQA: A dataset for diverse, explainable multi-hop question answering TyDi: A multi-lingual benchmark for dense retrieval