1 Introduction

It is well known that the effectiveness of IR pipelines increases with larger models [2, 22, 24, 25, 28]. For instance, multi-billion parameter rankers and dense models achieve top positions on leaderboards of IR benchmarks and competitions [10,11,12]. These large models leverage increased representation capacity, enabling them to encode features that might elude smaller models. However, deploying these large models is not without its challenges. The computational costs are substantial, often requiring specialized hardware such as GPUs or TPUs to operate in latency-critical applications. The high cost is directly related to the large number of parameters that these models contain, as they require hardware with high memory and compute capacity and because the latency scales almost linearly with the number of parameters. In a production environment, this means higher operating costs and reduced scalability.

To address these challenges, there have been efforts to create more efficient models without significantly reducing effectiveness. One such approach is model distillation [17]. Distilled models, such as MiniLM [32], use a teacher or an ensemble of larger models to transfer knowledge to a smaller student model. Rosa et al. [30] show that MiniLM surpassed the zero-shot effectiveness of monoT5-base, which is a seq2seq model trained for binary classification, in IR tasks despite being an order of magnitude smaller in size. This has shown that knowledge transfer via model distillation is not only feasible but also effective. However, most distillation techniques have been geared towards optimizing effectiveness on specific benchmark tasks and do not focus on out-of-domain effectiveness. Rosa et al. also show that while smaller models are capable of achieving high in-domain results, similar to their larger counterparts, the disparity in effectiveness becomes evident in out-of-domain scenarios. As the concept of out-of-domain is subjective, we define it as a test distribution that is significantly different from the training distribution. A straighforward example of a out-of-domain scenario is when a model is trained on chemistry-related data and tested on legal data. However, we recognize that this distinction blurs in many scenarios.

Fig. 1.
figure 1

Effectiveness on the BEIR benchmark [31]. All models are based on monoT5 [25], applying different fine-tuning methods.

Usually, training a retrieval model requires human-annotated hard labels informing which passage is relevant for each query. However, with the advance of Large Language Models (LLMs), it has become possible to generate synthetic queries for passages, providing a feasible approach for data augmentation [1, 2, 4, 19, 26]. Our work introduces a method for the generation of synthetic data specifically designed for distilling rankers that increases their out-of-domain effectiveness. We present InRanker, a distilled model derived from monoT5-3B [25], that uses the predictions of the teacher directly with both synthetic, generated from an out-of-domain corpus, and real query-document pairs. Effectively, this approach converts any corpus to be in-domain, since the model will be trained using queries from the target domain. As a result, this approach leads to reduced model sizes while maintaining improved out-of-domain effectiveness as presented in Fig. 1. The methodology, results, and ablation experiments are presented in detail in the following sections.

Fig. 2.
figure 2

Pipeline for generating the synthetic triples <query, passage, soft label> for the InRanker model.

2 Related Work

The research community has been using LLMs in a variety of tasks aimed at increasing the availability of data and improving the effectiveness of existing systems. Magister et al. [20] employed synthetic text generated by PaLM 540B [8] and GPT-3 175B [5] to transfer knowledge to smaller models such as T5. Fu et al. [15] successfully specialized student models in multi-step reasoning using FlanT5 [9] and code-davinci-002 as teachers. However, all these works rely on training the student models using synthetic text rather than directly using the soft labels. Furthermore, Muhamed et al. [21] distilled cross-attention scores of a language model for click-through-rate prediction, achieving better results when exposed to contextual features such as tabular data. Wang et al. [32] distilled the self-attention module, which is a crucial part of transformers, and successfully transferred knowledge to a variety of tasks.

Previous studies have also explored training a student from soft labels produced by a teacher: Hofstätter et al. [18] proposed a cross-architecture knowledge distillation approach using the MarginMSE loss. Similarly, Formal et al. [14] used the MarginMSE loss to distill knowledge to sparse neural models. Finally, Hashemi et al. [16] proposed a method for generating synthetic data for domain adaptation of dense passage retrievers. This approach involves creating new queries and a target collection, along with pseudo-labels extracted using a BERT cross-encoder. However, they did not evaluate the model’s effectiveness on datasets to which it was not domain-adapted. The existing research has mainly focused on in-domain evaluation, where the goal has been to increase the effectiveness of the student model on test datasets whose domain is similar to the datasets it was trained on. Our study also focuses on the robustness of the student and its ability to perform well even in out-of-domain scenarios, similar to the abilities of the larger teacher model.

3 Methodology

Our proposed method consists of two key phases of distillation, each designed with specific objectives to maximize the model’s zero-shot effectiveness. The first phase uses real-world data to familiarize the student model with the ranking task, while the second phase uses synthetic data designed to improve zero-shot generalization and improve the model’s effectiveness on a specific dataset. The dataset used to distill InRanker consists of {query, passage, logits} triplets, where the logits (soft labels) originate from a teacher model that has been trained for the relevance task. For the first stage, we chose to use query-document pairs from the MS MARCO [23] dataset, given their variety, the large number of annotated pairs, and its demonstrated effectiveness in enhancing retrieval effectiveness [29]. Next, we source the synthetic queries from InPars [2], which used an LLM to generate queries for the datasets in BEIR in a few-shot manner.

Distilling rerankers involves using the Mean Squared Error (MSE) loss to match the logits of the teacher and the student, as part of a two-phase pipeline illustrated in Fig. 2. The first phase consists of two steps: (1) generating the teacher logits given a query and either a positive (relevant) or a negative (non-relevant) passage, where the negatives are randomly sampled using BM25 on the top-\(k=1000\) candidates, and the positives are sampled from the human-annotated pairs; and (2) training InRanker given the queries and passages as input using the MSE loss to match the student logits to those of the teacher, who remains frozen during training. This approach can be beneficial as it removes the need for making hard decisions about a passage’s relevance, i.e. determining a threshold to obtain binary relevance labels, and instead focuses on a soft target objective aimed at aligning the student’s perception of relevance with that of the teacher.

The second phase, with a focus on zero-shot effectiveness, uses the same two steps. However, instead of employing real queries sourced from a costly human-annotation process, it uses synthetic queries generated by an LLM based on randomly sampled documents from the corpus. In this scenario, the positive document is the one used to create the query, and the negatives are collected using the same top-k sampling approach as before.

We also perform zero-mean normalization on the teacher logits for each query-document pair, independent of the overall dataset distribution. This approach intends to make the data distribution symmetric for each query-document pair, thereby minimizing the bias that InRanker is required to learn. Formally:

$$\begin{aligned} \begin{aligned} {L'}_\textrm{true} &= L_\textrm{true} - \frac{L_\textrm{true} + L_\textrm{false}}{2} \\ {L'}_\textrm{false} &= L_\textrm{false} - \frac{L_\textrm{true} + L_\textrm{false}}{2} \end{aligned} \end{aligned}$$
(1)

with \(L_\textrm{true}\) and \(L_\textrm{false}\) denoting the teacher’s logits for the relevant and non-relevant classes, respectively, and \(L'\) being the normalized values. This results in the following loss for each training example:

$$\begin{aligned} \mathcal {L}_\textrm{MSE} = ([Y_\textrm{true} - {L'}_\textrm{true}]^2 + [Y_\textrm{false} - {L'}_\textrm{false}]^2) \end{aligned}$$
(2)

with \(Y_\textrm{true}\) and \(Y_\textrm{false}\) representing the logits of the student.

Due to the training objective described in Eq. (2), the model no longer determines the relevance of passages and instead focuses on replicating the teacher’s output, thus eliminating the need for tunning a relevance threshold that would be needed to produce a binary label. With this approach, we can easily expand the out-of-domain knowledge of distilled models by generating new queries for documents using an LLM and fine-tuning the distilled model using the teacher’s logits. In the experiments section, we demonstrate the effectiveness of this approach in enhancing the student model’s effectiveness across 16 datasets of BEIR simultaneously. We present the hyperparameters used for training and the dataset curation in Appendix A, and we discuss variations of the training loss in Appendix C.

4 Experiments

4.1 English Knowledge Distillation Results

We distilled monoT5-3B to models with parameters ranging from 60M to 3B, using combinations of the following configurations:

Human Hard: representing the common approach for training rankers with human-annotated hard (i.e., binary) labels from the MS MARCO passage ranking dataset. In this case, a vanilla cross-entropy loss is used:

$$\begin{aligned} \mathcal {L}_\textrm{CE} = -\log P_\textrm{relevant} - \log P_\mathrm {non-relevant} \end{aligned}$$
(3)

where \(P_\textrm{relevant}\) and \(P_\mathrm {non-relevant}\) are the probabilities assigned by the model to the relevant and non-relevant query-document pair, respectively. Non-relevant pairs are sampled from the top-1000 retrieved by BM25.

Human Soft: representing a distillation step for matching the logits of a teacher and a student model, using real (human-generated) queries from the ranking dataset as inputs, but without the binary relevance judgments for targets.

Synthetic Soft: representing a distillation step for matching the logits of the two models, similar to the previous configuration, but using exclusively synthetic queries generated from the corresponding BEIR corpora with InPars [2, 19].

Table 1. Distillation results (nDCG@10) on 16 BEIR datasets. The model marked with * represents the teacher model. We did not train InRanker-3B on human soft labels due to computational constraints.

From Table 1, we see that both distillation steps were essential for improving the average nDCG@10 score compared to the model trained solely using human hard labels from MS MARCO.Footnote 1 As a result, InRanker-60M (row 3) and InRanker-220M (row 6), despite being 50x and 13x smaller than the teacher model, were able to improve their effectiveness on the BEIR benchmark significantly. Moreover, models trained exclusively on MS MARCO soft labels (rows 2 & 5) saw an increase in effectiveness in comparison to training on solely hard labels (rows 1 & 4), corroborating findings from previous studies regarding the effectiveness of soft labels [14, 16,17,18]. Furthermore, we observed an increase in the effectiveness even in self-distillation training (row 8), where the student learns soft labels generated by itself. We hypothesize that the improvement stems from the extra knowledge provided by the language model used to generate the synthetic queries. We did not provide results for the 3B model trained on both human soft and synthetic soft due to computational costs.

Furthermore, in Table 2, we present a effectiveness comparison between InRanker, Promptagator [13], and RankT5 [33]. Although we used monoT5-3B as a teacher for our experiments, which has a lower effectiveness on average when compared to Promptagator or RankT5-3B, our method is model-agnostic and thus one could use a stronger teacher model and anticipate even stronger results. Nonetheless, InRanker remains competitive in both model groups of 220M and 3B parameters, outperforming the other two baselines in 6 out of the 10 evaluated datasets, despite the average score not reflecting this due to Promptagator and RankT5 attaining a significantly higher score in two datasets: ArguAna and Touché.

Table 2. Comparison of the effectiveness for various reranking models, measured by nDCG@10 on the BEIR benchmark. The model marked with * represents the teacher model used for training InRanker. Bolded scores correspond to the best effectiveness on a specific dataset for a given model size, while underlined scores indicate the best effectiveness overall.

4.2 Portuguese Knowledge Distillation Results

To further assess the efficacy of the technique in different languages, we evaluated InRanker on a Portuguese dataset for information retrieval: QUATI [6]. Instead of using the same T5 model, we started from PTT5, a Portuguese fine-tuned version of T5 [7, 27]. We used the same two-step training approach as before, but with a strategy that allowed us to distill a Portuguese model using an English teacher (monoT5-3B). Given the availability of a translated version of MS MARCO in Portuguese [3], the first step (human soft) involved training the model using the Portuguese text, while matching the soft labels generated by the teacher using the original English text. This approach enabled us to leverage a stronger model that is not available in Portuguese for the distillation process. In the second step, involving synthetic soft labels from BEIR, we trained using the English text, as there is no translated version of BEIR available in Portuguese.

Table 3. InRanker results on QUATI, a Portuguese evaluation dataset for information retrieval using PTT5-v2 [27]. All synthetic soft labels were generated using the BEIR datasets.

The results of this evaluation are presented in Table 3. We conclude that, similarly to English, the distillation process was able to improve the effectiveness of models in a zero-shot manner, as the models were trained using only real data from MS MARCO and synthetic data from BEIR. Remarkably, InRanker-740M surpassed the effectiveness of the mT5-3.7B on QUATI. Note that for the QUATI evaluation we used the same prompt presented in the paper to annotate all unjudged documents using gpt-4-turbo. Therefore all results are presented with a jugded@10 of 100%. Further results using synthetic data from QUATI and mixed training data are presented in Appendix E.

4.3 Ablations

In this section, we present our ablation experiments aimed at validating the best configuration for distilling monoT5-3B into smaller T5-based models, as well as assessing their zero-shot capabilities. The initial experiments we conducted focused on evaluating how distillation would affect the model’s effectiveness on novel dataset distributions that were not seen during training, i.e., we did not generate synthetic queries for them. To achieve this, we created two subsets, each containing 8 randomly selected datasets from 16 datasets of BEIRFootnote 2, which we named sample sets 1 and 2 and used only one set for training per experiment. The datasets that were used for training are designated as the “in-domain” category, while the remaining datasets, i.e. the other 8 datasets that are not part of the training set, represent the “out-of-domain” (O.O.D.) category.

Table 4. Comparison of the in-domain vs out-of-domain effectiveness of our method, measured by nDCG@10. The model marked with * represents the teacher model used for the knowledge distillation process.

Impact of Soft Knowledge Distillation on O.O.D. Effectiveness. Our first ablation experiment focused on evaluating the initial distillation process using the MS MARCO dataset with soft labels. To accomplish this, we generated logits with monoT5-3B and trained both T5-base and T5-small models for 10 epochs. As shown in Table 4, rows 1–2 & 5–6, both models demonstrated an improvement in their nDCG@10 scores compared to the baseline, which was trained using the hard labels from MS MARCO. Remarkably, the overall score increased in both scenarios, even though the models were not exposed to any BEIR passages during this phase.Footnote 3

Adding Soft Synthetic Targets as a Second Distillation Phase. For the next experiment, we applied a second distillation step with synthetic soft labels on top of the model that we acquired from the last phase (monoT5 w/ soft human labels). For that, we used the 100K synthetic queries generated by InPars for each dataset indicated as “in-domain” and trained for 10 epochs. As shown in Table 4, rows 3 & 7, while it was expected that the in-domain datasets would have an increase in their nDCG@10 scores, we observe that the out-of-domain datasets also had improvements, suggesting that the model’s generalization capabilities were enhanced.

Using Hard Human Targets for the First Distillation Phase. Finally, we investigated the impact of skipping the first phase of distillation on MS MARCO logits, and instead starting from a model that was trained on hard human labels (monoT5-small and monoT5-base) and directly training using the synthetic soft BEIR targets. As we can see in Table 4, rows 3–4 & 7–8, when comparing with the model that was trained using the soft human targets, the overall effectiveness was reduced. From this, we conclude that the distillation step that includes the soft human targets on MS MARCO is beneficial, as it improves the model’s effectiveness in both in-domain and out-of-domain scenarios.

Upper Bound for Soft Distillation. To estimate the upper bound of the effectiveness that these models could attain through distillation, we repeated the process using real queries from BEIR, (i.e., the validation queries) instead of the synthetic ones. Results presented in Table 5 show that for both model sizes, there was an increase in effectiveness for the in-domain datasets, as the model was exposed to the evaluation queries during training. However, we also observed an increase in effectiveness for out-of-domain datasets, indicating that the synthetic queries used for training could be improved.

Table 5. Upper bound effectiveness (nDCG@10) using real queries from BEIR for the distillation datasets. Bold indicates the best between using synthetic and real queries.

5 Conclusion

This paper introduces a method for distilling the knowledge of information retrieval models and improve upon previous work how to better use synthetic data, aimed at improving the out-of-domain effectiveness of students. The study reveals that, through this knowledge distillation process, smaller models can achieve results comparable to the teacher, even in the context of multilingual transfer knowledge. This approach is particularly significant for applications where computational resources are limited, in production environments, or for languages with a lack of available models that can serve as a teacher. The methodology involves two steps of distillation: (1) using a human-curated corpus, and (2) using synthetic data generated by an LLM. Consequently, our work shows that it is possible to improve a reranker’s capabilities in specific domains without requiring additional human-annotated labels. Finally, we observe that synthetic query generation could be improved since the real queries achieved a better out-of-domain effectiveness compared to the model trained solely on synthetic ones. However, the presented method has limitations. Specifically, it is not clear how to adapt the proposed loss to train dense retrievers, which tipically use a contrastive loss.