key: cord-0461186-jvcckgr2 authors: Muennighoff, Niklas title: SGPT: GPT Sentence Embeddings for Semantic Search date: 2022-02-17 journal: nan DOI: nan sha: aa9abef5d3d6c845acbfb5ce47d53eb6d2223133 doc_id: 461186 cord_uid: jvcckgr2 GPT transformers are the largest language models available, yet semantic search is dominated by BERT transformers. We present SGPT-BE and SGPT-CE for applying GPT models as Bi-Encoders or Cross-Encoders to symmetric or asymmetric search. SGPT-BE produces semantically meaningful sentence embeddings by contrastive fine-tuning of only bias tensors and a novel pooling method. A 5.8 billion parameter SGPT-BE outperforms the best available sentence embeddings by 6% setting a new state-of-the-art on BEIR. It outperforms the concurrently proposed OpenAI Embeddings of the 175B Davinci endpoint, which fine-tunes 250,000 times more parameters. SGPT-CE uses log probabilities from GPT models without any fine-tuning. A 6.1 billion parameter SGPT-CE sets an unsupervised state-of-the-art on BEIR. It beats the supervised state-of-the-art on 7 datasets, but significantly loses on other datasets. We show how this can be alleviated by adapting the prompt. SGPT-BE and SGPT-CE performance scales with model size. Yet, increased latency, storage and compute costs should be considered. Code, models and result files are freely available at https://github.com/Muennighoff/sgpt. : Given a query q, documents d1−3, SGPT ranks the documents with scores s1−3. Semantic search consists of two parts: Search refers to finding the top k answers from a document corpus given a query. Semantic refers to understanding the documents and queries beyond keywords. Transformers [27] are the dominant semantic architecture [4, 26] competing with non-semantic models like BM25 [24] . However, they have been limited to BERT-like encoder transformers [5, 23, 6, 19] . Meanwhile, GPT-like decoder transformers [20] have been the focus of recent scaling efforts of up to 580 billion parameters [13] . Yet, it remains unclear how to extract competitive GPT sentence embeddings and use them for semantic search. In this work, we investigate how to apply decoder transformers to semantic search and make use of their scale to outperform current methods. We distinguish four settings: Cross-Encoder vs Bi-Encoder, Symmetric vs Asymmetric. See Figure 1 and Section 2. In the Bi-Encoder setting, we propose SGPT-BE using position-weighted mean pooling and contrastive fine-tuning of only bias tensors (BitFit [32] ). BitFit is within +2 to -2% of full fine-tuning performance for SBERT [23] and SGPT despite changing <0.1% of pre-trained parameters. When controlling for size, SGPT is within +1 to -3% of SBERT performance. When scaling up, SGPT-BE-5.8B sets state-of-the-art results on BEIR and USEB for asymmetric and symmetric search. In the Cross-Encoder setting, we propose SGPT-CE using log probability extraction of pre-trained GPT models. It can be used for symmetric or asymmetric search by changing the prompt. Unsupervised SGPT-CE-6.1B is 6% worse than supervised SGPT-BE-5.8B on BEIR. In summary, our contributions are three-fold: • For SGPT-BE in Section 4, we develop a new pooling method and show the usefulness of bias-only fine-tuning for embeddings. At 5.8B parameters, it produces the best natural language embeddings available by a margin of 7% for the example of semantic search. • For SGPT-CE in Section 3, we show how to use GPT for search via log probabilities without fine-tuning. At 6.1B parameters, it has the best unsupervised performance on BEIR by a margin of 8%. • We provide free, more performant alternatives to OpenAI Search or Similarity Embeddings and the OpenAI Search endpoint available at https://github.com/Muennighoff/sgpt In this section, we explain two dimensions fundamental to our work: Cross-Encoders vs Bi-Encoders and Symmetric vs Asymmetric Search. Cross-Encoders encode query and document at the same time. BERT is used as a Cross-Encoder by separating the query from the document with a [SEP ] token [5] . They are then passed through the transformer network together. Each new query requires k forward passes given a corpus of k documents. Bi-Encoders encode query and document separately. SBERT [23] extends BERT to the Bi-Encoder setting via supervised fine-tuning and a pooling operation across the sequence output. The resulting document vectors can be cached. A new query requires only one forward pass through the transformer to produce the query vector. The query vector can then be scored against the cached document vectors with a similarity function. Embeddings from Bi-Encoders can be used for non-search tasks such as clustering or as input features of machine learning models. Cross-Encoders tend to outperform Bi-Encoders [25] , but are slower as vectors cannot be cached. To balance the trade-offs, multi-stage architectures have been proposed [14, 26] . In a two-stage re-ranking setup, the first model processes the entire corpus and the second model is only used on the top k documents returned by the first. In Section 3, we use Bi-Encoder (BM25) + Cross-Encoder re-ranking. Asymmetric Search means queries and documents are not interchangeable. Finding answers given a question is an asymmetric search problem. Commonly, documents are much longer than queries [26] . We evaluate asymmetric search experiments on BEIR [26] , a recently proposed benchmark consisting of 19 asymmetric search datasets. Symmetric Search means queries and documents are interchangeable. Finding duplicate questions, where both queries and documents are questions, is a symmetric search problem. We evaluate symmetric search experiments on USEB [29] , Quora from BEIR [26] and STS-B. In Quora, queries are question titles and documents are question texts. They are often the same with average word lengths of 9.53 and 11.44, respectively [26] . Hence, we consider it more of a symmetric search task. We include Quora in both symmetric and asymmetric experiments. Given a query q, and a document corpus D, we are interested in the most likely document d * . Using Bayes' Theorem this can be expressed as: Note that P (q) is irrelevant as it is always the same when taking the arg max over D. Due to variable document lengths and contents it is easier to compare P (q|d) than P (d|q). We hence compute the joint probability of the query tokens q i,..,n given the document tokens embedded in a prompt P as p(q i , ..., q n |p 1 , ..., p i−1 ). As long as P (d) does not vary excessively across the corpus D, this simplification should produce reasonable scores. In practice, we use log probabilities [3, 21] , computed via the log of the softmax of the model output. To have a constant query length n + 1 − i and avoid abrupt text changes, documents are truncated from the left until the input fits the model's maximum sequence length. We apply these methods to re-rank top k documents returned by BM25 [24] . While re-ranking with BM25 bottlenecks performance, it speeds up experiments. It is not a necessary part of the architecture and therefore not depicted in Figure 1 . We experiment with publicly available pre-trained decoder transformers with 125M, 1.3B, 2.7B and 6.1B parameters [1, 28] . [3] . In brackets are numbers for cpt-text models recently provided in [17] . They differ likely due to removing the language modeling head. We sub-select 6 small datasets from BEIR [26] and perform a search over 12 prompts. The prompts and results are in Table 7 and Table 8 . We select the prompt with the best average score, P G . In Table 1 , we benchmark the resulting SGPT-CE (SGPT-Cross-Encoder). We compare with OpenAI's Search endpoint, which is to be distinguished from their Embeddings endpoint. Please refer to Table 6 in the Bi-Encoder section for a benchmark with the OpenAI Embeddings endpoint. We provide parameter estimates for the OpenAI model names in Table 2 . We also compare with the current stateof-the-art on BEIR [26] , a BERT-based Cross-Encoder. BM25+CE consists of a pre-trained BERT model that is further fine-tuned on MS-MARCO [18] in a supervised fashion [26] . SGPT-CE consists solely of the pre-trained GPT model. However, SGPT-CE-6.1B has almost 15x more parameters than BM25+CE significantly increasing latency. In the Re-rank Top 100 setting, the top 100 documents as returned by BM25 are re-ranked by the respective model. While SGPT-CE-6.1B wins on more datasets than the encoder-based state-of-the-art, its average score is worse. This can be alleviated by not using the same prompt P G for all datasets. We show in Section 3.2 that SGPT-CE-6.1B can beat BM25+CE on Quora by changing the prompt. Table 9 . Dataset labels are ordered by the Max Re-rank=100 6.1B performance. The higher on the y-axis, the more bottlenecked is the Cross-Encoder by BM25's performance. Table 3 : SGPT-CE symmetric search results on Quora. The sum of log probabilities from {query} is used as the re-rank score. Overflowing tokens are truncated from the left of {doc}. Scores are nDCG@10. In Figure 2 , we investigate how performance scales with model size. As we are in a re-ranking setup, the Cross-Encoder performance is bounded by the documents returned by BM25. We provide the BM25 bounds and additional model results in Table 9 . In a Re-rank Top 10 setting, the model is significantly bottlenecked by BM25. SGPT-CE-6.1B reaches around 80% of the maximum possible performance. We hence observe high jumps in performance for datasets like HotpotQA or TREC-COVID as we move to top 100. In fact, the 0.791 nDCG@10 on TREC-COVID in Table 1 is not possible in a Re-rank Top 10 setting as the bound is at 0.750, see Table 9 . From the results, we infer that performance scales both as we re-rank more documents or increase model size. We use the same methods outlined in Section 3.1.1, but adapt the prompt for symmetric search. We show this on the example of Quora in Table 3 . In Section 2, we have explained why Quora is closer to symmetric search than asymmetric search. By adapting the prompt, SGPT-CE-6.1B improves by 6% outperforming all Quora results in Table 1 . Table 4 : SGPT parameter overview. Due to the removal of the final language modeling head SGPT-BE-5.8B has 206M parameters less than SGPT-CE-6.1B or GPT-J-6.1B. GPT-Neo models tie the language modeling head weights with the input embeddings, hence there is no weight difference. SBERT-Base SGPT-125M SGPT-1.3B SGPT-2.7B SGPT-5.8B Transformer (T.) BERT GPT-Neo GPT-Neo GPT-Neo GPT-J Total Like in Section 3.1.1, we first experiment with decoder transformers that have only gone through unsupervised pre-training. In the Bi-Encoder setting, a pooling operation is commonly applied to the model's hidden states to reduce them to a vector whose size is irrespective of sequence length. SBERT [23] showed that a MEAN pooling mechanism outperforms [CLS] and MAX strategies for a BERT encoder. Due to the causal attention mask in an auto-regressive decoder transformer, tokens do not attend to future tokens like in an encoder transformer. Hence, only the last token has attended to all tokens in a sequence. To account for this information mismatch, we propose to give later tokens a higher weight using a position-weighted mean pooling method: where S is the sequence length, h i the ith hidden state and v the query or document embedding. We compare weighted mean pooling with last token pooling, where the hidden state of the final token is the embedding, and regular mean pooling. We follow recent work [9, 12, 17] and perform supervised contrastive learning with in-batch negatives. Given matching query-doc pairs , we optimize the cost function: where f θ is the SGPT model outputting a fixed-size vector, σ cosine similarity and τ a temperature parameter set to 20 in our experiments. We use GradCache [8] to train with large batch sizes in a limited memory setting. We train on SNLI [2] and MNLI [31] . We limit the model sequence length to 75 tokens during both training and inference. For large models, we fine-tune only bias parameters and freeze the rest of the model. This has been recently proposed as BitFit [32] for BERT encoders. It has been shown to be competitive with full fine-tuning in various scenarios [11, 30, 15] . Table 4 shows the number of parameters trained for BitFit models. Due to fewer gradient updates, BitFit significantly reduces GPU memory and time required per step. Further, adding a BitFit checkpoint to an instance with an existing full model will only require storing the different biases. An instance already serving a 22.5GB fp32 GPT-J-6B model requires an additional 22MB of storage to serve an SGPT-5.8B-bitfit model. Figure 3 shows average precisions on USEB [29] across different methods and layers. In the unsupervised setting, decoder transformers strongly underperform encoders. However, after finetuning on the same dataset with the same hyperparameters, decoders (SGPT) with 125M parameters [29] . However, fragments may be in-domain due to the large pre-training data of the transformer models. SGPT-0.1B-weightedmean-nli performs 2% worse than SBERT-base-nli-v2 on USEB, but improves on Quora by 1%. Note that there is still a size difference of 14% between the two models. ♦: Results from [29] except when marked with †. CQADupstack and SciDocs differ from the same-name datasets in BEIR. We provide the code for running OpenAI similarity endpoints on USEB. Feel free to message the author if you would like to fund them (around 300 USD for Curie). closely trail the 110M parameter encoder (SBERT) for the 12th layer. When increasing SGPT size ten-fold, the last layer performance increases beyond that of SBERT models. Weighted mean pooling outperforms other pooling mechanisms for decoders. Table 5 provides performance on the individual USEB datasets, Quora and STS-B. STS-B scores should not be the focus of comparison due to the drawbacks highlighted in [29] . Despite training on less than 0.1% of parameters BitFit models are within +2 to -2% of fully fine-tuned ones. We investigate gradients in Figure 4 . BitFit degrades performance more for decoders than encoders. This could be due to the missing bias parameters, see Table 4 . [32] highlights the importance of the query bias vector for BERT, which is not present for SGPT models. SGPT-5.8B-weightedmeannli-bitfit sets an out-of-domain state-of-the-art on USEB, but is outperformed by models trained in-domain in [29] . We observed performance gains by increasing the training batch size. SGPT-5.8Bweightedmean-nli-bitfit is trained with a batch size of 1024. Other models use lower batch sizes, which partly explains the performance jump to the 5.8B model. In Table 12 , we provide results for a SGPT-5.8B-weightedmean-nli-bitfit trained with a lower batch size of 48. If not otherwise specified, we follow the same setup as in Section 4.1.1. For asymmetric search, we train on MS-MARCO [18] . We limit the model sequence length to 300 tokens during both training and inference. We follow concurrent work [17] and add enclosing brackets to help the model distinguish between query and document. We embed the tokens of query q in two brackets as [q 0−n ]. For documents, we use curly brackets: {d 0−n }. We add the token ids of the brackets to the already tokenized text to avoid the tokens intermingling. We refer to these special brackets as specb. Table 6 benchmarks SGPT-BE-5.8B (SGPT-5.8B-weightedmean-msmarco-specb-bitfit) on BEIR [26] with: (a) BM25 [24] , a non-semantic fast baseline (b) SGPT-CE-6.1B from Section 3 (c) BM25+CE [26] , the current overall state-of-the-art on BEIR (d) TAS-B [10] , the original Bi-Encoder state-of-the-art on BEIR (e) Contriever [12] , a similar training scheme as [17] but using an encoder transformer (f ) GTR-XXL [19] , the current Bi-Encoder state-of-the-art on BEIR with 4.8 billion parameters using the BERT-like encoder transformer of T5 [22] (g) cpt-text, a GPT-like decoder transformer architecture concurrently proposed in [17] . Corresponding parameter estimates are in Table 2 . SGPT-5.8B achieves the best average nDCG@10 both on the BEIR subset selected in [17] and on the full BEIR benchmark. It outperforms the roughly same-sized cpt-text-L and the 30x larger cpt-text-XL by 8.1% and 4.2%, respectively. Yet, cpt-text models have gone through an additional unsupervised training stage [17] and are fully trained. SGPT-BE-5.8B fine-tunes just 700K parameters, 0.0004% of the parameters fine-tuned for cpt-text-XL [17] . See Table 2 for sizes. We suspect much of the difference to come from the cpt-text model's inferior last token pooling as shown in Figure 3 . SGPT-BE-5.8B improves on the overall state-of-the-art, a Cross-Encoder, by 3%. It improves on the previously best sentence embeddings (Bi-Encoder) on BEIR, GTR-XXL, by 7%. However, these improvements come at a significant cost. GTR-XXL has 20% fewer parameters and its embeddings have 768 dimensions. SGPT-BE-5.8B produces embeddings with 4096 dimensions, hence requiring about 5x more storage. It took the model six days on one Nvidia A100 GPU to encode the entire BioASQ corpus with 15M documents and an average 200 words each [26] . Its comparatively low performance on BioASQ may be improved by increasing the sequence length limit beyond 300, however, requiring additional compute. For SGPT-CE-6.1B, the sequence length limit was 2048 for the combined prompt on all datasets. The high performance on TREC-COVID for SGPT models could be due to the different pre-training datasets. The SGPT pre-training dataset, The Pile [7] , contains data until mid-2020. This may give the models an information advantage on Covid-19. Lastly, we highlight that on Quora SGPT-BE-5.8B-msmarco is outperformed by SGPT-BE-5.8B-nli from Table 5 . Given our classification of Quora as a symmetric search task in Section 2, this supports our overall distinction between asymmetric and symmetric search. We advise users of our models to classify their tasks as symmetric or asymmetric and use the appropriate model. For non-classifiable embedding tasks, both may work, but we recommend experimenting with embeddings from the symmetric models in Section 4.1 first. This work presented SGPT. Building on SBERT, we proposed modifications to GPT models to use them as Cross-or Bi-Encoders for semantic search. SGPT-BE uses position-weighted mean pooling and fine-tuning of only bias tensors. At scale, it produces new state-of-the-art sentence embeddings. The model can be used for semantic search or other embedding tasks. We recommend using SGPT-BE-5.8B when compute and storage are of high availability and maximum performance is desired. SGPT-CE extracts log probabilities of pre-trained GPT models to produce unsupervised state-ofthe-art search results. The setup presented can only be used for semantic search. Storage can be limited, but compute should be of high availability for SGPT-CE-6.1B. The prompt and max re-rank parameter can be adjusted depending on performance and latency requirements. Future research could fine-tune a GPT Cross-Encoder on MSMARCO similar to the BM25+CE model. We suspect that this should outperform the presented non-fine-tuned SGPT-CE model as well as SGPT-BE if enough documents are re-ranked. Further, the combination of SGPT with GPT for generative search results could be interesting. Possibly, SGPT embeddings could be injected into GPT models to generate answers. Lastly, a detailed study of the disadvantages of the missing biases in large GPT models could be helpful to consider their inclusion in the training of future large language models. Re-rank Top 0 Re-rank Top 10 Re-rank Top 100 Table 11 : SGPT-BE experiments on a subset of the 5 smallest BEIR datasets by corpus size. The best checkpoint for all models is displayed. specb=special brackets. bs=batch size. bitfitwte=BitFit + Word Token Embeddings are trained. The idea was to help the model learn the special role of the brackets. It did not help. asym=Two-tower model with separate transformers for queries and documents. SGPT-125M-weightedmeanmsmarco-specb performs 3% worse than SBERT-base-msmarco on average. SGPT-125M-weightedmeanmsmarco-specb-bitfit performs 1% worse than SBERT-base-msmarco-bitfit on average. Interestingly, Curie beats SGPT-5.8B on this subset, but does not on the bigger subset in Table 6 . Scores are nDCG@10. Table 12 : Additional results on USEB, Quora and STS-B. Metrics are average precision for USEB, nDCG@10 for Quora and Spearman correlation for STS-B. bf=BitFit. bs=batch size. OOD=Out-of-domain, to contrast these numbers from in-domain numbers in [29] . However, fragments may be in-domain due to the large pre-training data of the transformer models. CQADupstack and SciDocs differ from the same-name datasets in BEIR. Tri Songz, Phil Wang, and Samuel Weinbach. 2021. GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch A large annotated corpus for learning natural language inference Universal sentence encoder Bert: Pre-training of deep bidirectional transformers for language understanding SPLADE v2: Sparse lexical and expansion model for information retrieval The pile: An 800gb dataset of diverse text for language modeling Scaling deep contrastive learning batch size under memory limited setup Simcse: Simple contrastive learning of sentence embeddings Efficiently teaching an effective dense retriever with balanced topic aware sampling Lora: Low-rank adaptation of large language models Towards Unsupervised Dense Information Retrieval with Contrastive Learning Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World's Largest and Most Powerful Generative Language Model Colbert: Efficient and effective passage search via contextualized late interaction over bert Cutting down on prompts and parameters: Simple few-shot learning with language models Www'18 open challenge: financial opinion mining and question answering Text and Code Embeddings by Contrastive Pre-Training MS MARCO: A human generated machine reading comprehension dataset Improving language understanding by generative pre Language models are unsupervised multitask learners Exploring the limits of transfer learning with a unified text-to-text transformer Sentence-bert: Sentence embeddings using siamese bert-networks The probabilistic relevance framework: BM25 and beyond Augmented sbert: Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models Attention is all you need GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model Nils Reimers, and Iryna Gurevych. 2021. TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning Ahmed Hassan Awadallah, and Jianfeng Gao. 2021. LiST: Lite Self-training Makes Efficient Few-shot Learners A broad-coverage challenge corpus for sentence understanding through inference BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models We thank Constantin Eichenberg and Samuel Weinbach for insightful discussions and valuable feedback throughout the project. We thank Robert Baldock, Marco Bellagente and Koen Oostermeijer for reading drafts of this paper. This work has been supported by OpenAI under the academic access program.