key: cord-0210362-w3k0s1ga
authors: Meng, Zaiqiao; Liu, Fangyu; Shareghi, Ehsan; Su, Yixuan; Collins, Charlotte; Collier, Nigel
title: Rewire-then-Probe: A Contrastive Recipe for Probing Biomedical Knowledge of Pre-trained Language Models
date: 2021-10-15
journal: nan
DOI: nan
sha: fff229da5a4fea06f99210e846bd28ecabe859b8
doc_id: 210362
cord_uid: w3k0s1ga

Knowledge probing is crucial for understanding the knowledge transfer mechanism behind the pre-trained language models (PLMs). Despite the growing progress of probing knowledge for PLMs in the general domain, specialised areas such as biomedical domain are vastly under-explored. To catalyse the research in this direction, we release a well-curated biomedical knowledge probing benchmark, MedLAMA, which is constructed based on the Unified Medical Language System (UMLS) Metathesaurus. We test a wide spectrum of state-of-the-art PLMs and probing approaches on our benchmark, reaching at most 3% of acc@10. While highlighting various sources of domain-specific challenges that amount to this underwhelming performance, we illustrate that the underlying PLMs have a higher potential for probing tasks. To achieve this, we propose Contrastive-Probe, a novel self-supervised contrastive probing approach, that adjusts the underlying PLMs without using any probing data. While Contrastive-Probe pushes the acc@10 to 28%, the performance gap still remains notable. Our human expert evaluation suggests that the probing performance of our Contrastive-Probe is still under-estimated as UMLS still does not include the full spectrum of factual knowledge. We hope MedLAMA and Contrastive-Probe facilitate further developments of more suited probing techniques for this domain.

Pre-trained language models (PLMs; Devlin et al. 2019; ) have orchestrated incredible progress on myriads of few-or zero-shot language understanding tasks, by pre-training model parameters in a task-agnostic way and transferring knowledge to specific downstream tasks via finetuning (Brown et al., 2020; Petroni et al., 2021) . To better understand the underlying knowledge transfer mechanism behind these achievements, many knowledge probing approaches and benchmark datasets have been proposed (Petroni et al., 2019; Jiang et al., 2020a; Zhong et al., 2021) . This is typically done by formulating knowledge triples as cloze-style queries with the objects being masked (see Table 1 ) and using the PLM to fill the single (Petroni et al., 2019) or multiple (Ghazvininejad et al., 2019) [Mask] token(s) without further fine-tuning.

In parallel, it has been shown that specialised PLMs (e.g., BioBERT; Lee et al. 2020, Blue-BERT; Peng et al. 2019 and PubMedBERT; Gu et al. 2020 ) substantially improve the performance in several biomedical tasks (Gu et al., 2020) . The biomedical domain is an interesting testbed for investigating knowledge probing for its unique challenges (including vocabulary size, multi-token entities), and the practical benefit of potentially disposing the expensive knowledge base construction process. However, research on knowledge probing in this domain is largely under-explored.

To facilitate research in this direction, we present a well-curated biomedical knowledge probing benchmark, MedLAMA, that consists of 19 thoroughly selected relations. Each relation contains 1k queries (19k queries in total with at most 10 answers each), which are extracted from the large UMLS (Bodenreider, 2004) biomedical knowledge graph and verified by domain experts. We use automatic metrics to identify the hard examples based on the hardness of exposing answers from their query tokens. See Table 1 for a sample of easy and hard examples from MedLAMA.

A considerable challenge in probing in biomedical domain is handling multi-token encoding of the answers (e.g. in MedLAMA only 2.6% of the answers are single-token, while in the English set of mLAMA; Kassner et al. 2021, 98% are singletoken) , where all existing approaches (i.e., mask predict; Petroni et al. 2019, retrieval-based; , and generation-based; Gao et al. 2020) struggle to be effective. 2 For example, the mask predict approach (Jiang et al., 2020a) which performs well in probing multilingual knowledge achieves less than 1% accuracy on MedLAMA.

To address the aforementioned challenge, we propose a new method, Contrastive-Probe, that first adjusts the representation space of the underlying PLMs by using a retrieval-based contrastive learning objective (like 'rewiring' the switchboard to the target appliances Liu et al. 2021c ) then retrieves answers based on their representation similarities to the queries. Notably, our Contrastive-Probe does not require using the MLM heads during probing, which avoids the vocabulary bias across different models. Additionally, retrievalbased probe is effective for addressing the multitoken challenge, as it avoids the need to generate multiple tokens from the MLM vocabulary. We show that Contrastive-Probe facilitates abso-2 Prompt-based probing approaches such as Auto-Prompt (Shin et al., 2020a) , SoftPrompt (Qin and Eisner, 2021) , and OptiPrompt (Zhong et al., 2021) need additional labelled data for fine-tuning prompts, but we restrict the scope of our investigation to methods that do not require task data. lute improvements of up-to ∼5% and ∼21% on the acc@1 and acc@10 probing performance compared with the existing approaches.

We further highlight that the elicited knowledge by Contrastive-Probe is not gained from the additional random sentences, but from the original pre-trained parameters, which echos the previous finding of Liu et al. (2021b) ; Glavaš and Vulić (2021) ; Su et al. (2021 Su et al. ( , 2022 . Additionally, we demonstrate that different state-of-the-art PLMs and transformer layers are suited for different types of relational knowledge, and different relations requires different depth of tuning, suggesting that both the layers and tuning depth should be considered when infusing knowledge over different relations. Furthermore, expert evaluation of PLM responses on a subset of MedLAMA highlights that expert-crafted resources such as UMLS still do not include the full spectrum of factual knowledge, indicating that the factual information encoded in PLMs is richer than what is reflected by the automatic evaluation.

The findings of our work, along with the proposed MedLAMA and Contrastive-Probe, highlight both the unique challenges of the biomedical domain and the unexploited potential of PLMs. We hope our research to shed light on what domainspecialised PLMs capture and how it could be better resurfaced, with minimum cost, for probing.

To facilitate research of knowledge probing in the biomedical domain, we create the MedLAMA benchmark based on the largest biomedical knowledge graph UMLS (Bodenreider, 2004) . UMLS 3 is a comprehensive metathesaurus containing 3.6 million entities and more than 35.2 million knowledge triples over 818 relation types which are integrated from various ontologies, including SNOMED CT, MeSH and the NCBI taxonomy.

Creating a LAMA-style (Petroni et al., 2019) probing benchmark from such a knowledge graph poses its own challenges: (1) UMLS is a collection of knowledge graphs with more than 150 ontologies constructed by different organisations with very different schemata and emphasis; (2) a significant amount of entity names (from certain vocabularies) are unnatural language (e.g., t(8;21)(q22;q22) denoting an observed karyotypic abnormality) which can hardly be understood by the existing PLMs, with tokenisation tailored for natural language; (3) some queries (constructed from knowledge triples) can have up to hundreds of answers (i.e., 1-to-N relations), complicating the interpretation of probing performance; and (4) some queries may expose answers in themselves (e.g., answer within queries), making it challenging to interpret relative accuracy scores. Selection of Relationship Types. In order to obtain high-quality knowledge queries, we conducted multiple rounds of manual filtering on the relation level to exclude uninformative relations or relations that are only important in the ontological context but do not contain interesting semantics as a natural language (e.g, taxonomy and measurement relations). We also excluded relations with insufficient triples/entities. Then, we manually checked the knowledge triples for each relation to filter out those that contain unnatural language entities and ensure that their queries are semantically meaningful. Additionally, in the cases of 1-to-N relations where there are multiple gold answers for the same query, we constrained all the queries to contain at most 10 gold answers. These steps resulted in 19 relations with each containing 1k randomly sampled knowledge queries. Table 2 shows the detailed relation names and their corresponding prompts. Easy vs. Hard Queries. Recent works (Poerner et al., 2020; Shwartz et al., 2020) have discovered

Type Answer space MLM Fill-mask (Petroni et al., 2019) MP PLM Vocab X-FACTR (Jiang et al., 2020a) MP PLM Vocab Generative PLMs GB PLM Vocab Mask average RB KG Entities Contrastive-Probe (Ours) RB KG Entities that PLMs are overly reliant on the surface form of entities to guess the correct answer of a knowledge query. The PLMs "cheat" by detecting lexical overlaps between the query and answer surface forms instead of exercising their abilities of predicting factual knowledge. For instance, PLMs can easily deal with the triple <Dengue virus live antigen CYD serotype 1, may-prevent, Dengue> since the answer is part of the query. To mitigate such bias, we also create a hard query set for each relation by selecting a subset of their corresponding 1k queries using token and matching metrics (i.e., exact matching and ROUGE-L (Lin and Och, 2004) ). For more details see the Appendix. We refer to the final filtered and original queries as the hard sets and full sets, respectively. 

While the pioneer works in PLM knowledge probing mainly focused on the single-token entities, many recent works have started exploring the solutions for the multi-token scenario Jiang et al., 2020a; De Cao et al., 2021) . These knowledge probing approaches can be categorised, based on answer search space and reliance on MLM head, into three categories: mask predict, generation-based, and retrieval-based. used approaches to probe knowledge for masked PLMs (e.g. BERT). The mask predict approach uses the MLM head to fill a single mask token for a cloze-style query, and the output token is subjected to the PLM vocabulary (Petroni et al., 2019) . Since many real-world entity names are encoded with multiple tokens, the mask predict approach has also been extended to predict multitoken answers using the conditional masked language model (Jiang et al., 2020a; Ghazvininejad et al., 2019) . Figure 2 (a) shows the prediction process. Specifically, given a query, the probing task is formulated as: 1) filling masks in parallel independently (Independent); 2) filling masks from left to right autoregressively (Order); 3) filling tokens sorted by the maximum confidence greedily (Confidence). After all mask tokens are replaced with the initial predictions, the predictions can be further refined by iteratively modifying one token at a time until convergence or until the maximum number of iterations is reached (Jiang et al., 2020a) . For example, Order+Order represents that the answers are initially predicted by Order and then refined by Order. In this paper we examined two of these approaches, i.e. Independent and Order+Order, based on our initial exploration.

Generation-based. Recently, many generation based PLMs have been presented for text generation tasks, such as BART and T5 (Raffel et al., 2020) . These generative PLMs are trained with a de-noising objective to restore its original form autoregressively Raffel et al., 2020) . Such an autoregressive generation process is analogous to the Order probing approach, thus the generative PLMs can be directly used to generate answers for each query. Specifically, we utilize the cloze-style query with a single [Mask] token as the model input. The model then predicts the answer entities that cor-respond to the [Mask] token in an autoregressive manner. An illustration is provided in Figure 2 (b).

Retrieval-based. Mask predict and Generationbased approaches need to use the PLM vocabulary as their search spaces for answer tokens, which may generate answers that are not in the answer set. In particular, when probing the masked PLMs using their MLM heads, the predicted result might not be a good indicator for measuring the amount of knowledge captured by these PLMs. This is mainly because the MLM head will be eventually dropped during the downstream task fine-tuning while the MLM head normally accounts for more than 20% of the total PLM parameters. Alternatively, the retrieval-based probing are applied to address this issue. Instead of generating answers based on the PLM vocabulary, the retrieval-based approach finds answers by ranking the knowledge graph candidate entities based on the query and entity representations, or the entity generating scores.

To probe PLMs on MedLAMA, we use mask average , an approach that takes the average log probabilities of entity's individual tokens to rank the candidates. The retrieval-based approaches address the multi-token issue by restricting the output space to the valid answer set and can be used to probe knowledge in different types of PLMs (e.g. BERT vs. fastText; ). However, previous works only report results based on the type-restricted candidate set (e.g. relation) which we observed to decay drastically under the full entity set.

Probe which pre-trains on a small number of sentences sampled from the PLM's original pre-training corpora with a contrastive selfsupervising objective, inspired by the Mirror-BERT (Liu et al., 2021b) . Our contrastive pretraining does not require the MLM head or any additional external knowledge, and can be completed in less than one minute on 2 × 2080Ti GPUs. Self-supervised Contrastive Rewiring. We randomly sample a small set of sentences (e.g. 10k, see §5.2 for stability analysis of Contrastive-Probe on several randomly sampled sets), and replace their tail tokens (e.g. the last 50% excluding the full stop) with a [Mask] token. Then these transformed sentences are taken as the queries of the cloze-style self-retrieving game. In the following we show an example of transforming a sentence into a cloze-style query: where "reduces coronavirus infections" is marked as a positive answer of this query. Given a batch, the cloze-style self-retrieving game is to ask the PLMs to retrieve the positive answer from all the queries and answers in the same batch. Our Contrastive-Probe tackles this by optimising an InfoNCE objective (Oord et al., 2018) ,

where f (·) is the PLM encoder (with the MLM head chopped-off and [CLS] as the contextual representation), N is batch size, x i and x p are from a query-answer pair (i.e., x i and x p are from the same sentence), N i contains queries and answers in the batch, and τ is the temperature. This objective function encourages f to create similar representations for any query-answer pairs from the same sentence and dissimilar representations for queries/answers belonging to different sentences.

Retrieval-based Probing. For probing step, the query is created based on the prompt-based template for each knowledge triple , as shown in the following: 

In this section we conduct extensive experiments to verify whether Contrastive-Probe is effective for probing biomedical PLMs. First, we experiment with Contrastive-Probe and existing probing approaches on MedLAMA benchmark ( §5.1). Then, we conduct in-depth analysis of the stability and applicability of Contrastive-Probe in probing biomedical PLMs ( §5.2). Finally, we report an evaluation of a biomedical expert on the probing predictions and highlight our findings ( §5.3).

Contrastive-Probe Rewiring. We train our Contrastive-Probe based on 10k sentences which are randomly sampled from the PubMed texts 5 using a mask ratio of 0.5. The best hyperparameters and their tuning options are provided in Appendix. Probing Baselines. For the mask predict approach, we use the original implementation of X-FACTR (Jiang et al., 2020a) , and set the beam size and the number of masks to 5. Both mask predict and retrieval-based approaches are tested under both the general domain and biomedical domain BERT models, i.e. Bert-based-uncased (Devlin et al., 2019) , BlueBERT (Peng et al., 2019) , BioBERT , PubMedBERT (Gu et al., 2020) . 6 For generation-based baselines, we test five PLMs, namely BART-base (Lewis et al., 5 We sampled the sentences from a PubMed corpus used in the pre-training of BlueBERT (Peng et al., 2019) . 6 The MLM head of PubMedBERT is not publicly available and cannot be evaluated by X-FACTR and mask average. 2020), T5-small and T5-base (Raffel et al., 2020) that are general domain generation PLMs, and SciFive-base & SciFive-large (Phan et al., 2021) that are pre-trained on large biomedical corpora.

Comparing Various Probing Approaches. Table 4 shows the overall results of various probing baselines on MedLAMA. It can be seen that the performances of all the existing probing approaches (i.e. generative PLMs, X-FACTR and mask predict) are very low (<1% for acc@1 and <4% for acc@10) regardless of the underlying PLM, which are not effective indicators for measuring knowledge captured. In contrast, our Contrastive-Probe obtains absolute improvements by up-to ∼ 5% and ∼ 21% on acc@1 and acc10 respectively comparing with the three existing approaches, which validates its effectiveness on measuring the knowledge probing performance. In particular, PubMedBERT model obtains the best probing performance (5.71% in accuracy) for these biomedical queries, validating its effectiveness of capturing biomedical knowledge comparing with other PLMs (i.e. BERT, BlueBERT and BioBERT). Benchmarking with Contrastive-Probe. To further examine the effectiveness of PLMs in capturing biomedical knowledge, we benchmarked several state-of-the-art biomedical PLMs (including pure pre-trained and knowledge-enhanced models) on MedLAMA through our Contrastive-Probe. Table 5 shows the probing results over the full and hard sets. In general, we can observe that these biomedical PLMs always perform better than general-domain PLMs (i.e., BERT). Also, we observe the decay of performance of all these models on the more challenging hard set queries. While PubMedBERT performs the best among all the pure pre-trained models, SapBERT (Liu et al., 2021a) and CoderBERT (Yuan et al., 2020) (which are the knowledge infused PubMedBERT) further push performance to 8% and 30.41% on acc@1 and acc@10 metrics respectively, highlighting the benefits of knowledge infusion pre-training. Comparison per Answer Length. Since different PLMs use different tokenizers, we use char length of the query answers to split MedLAMA into different bins and test the probing performance over various answer lengths. Figure 3 shows the result. We can see that the performance of retrievalbased probing in Contrastive-Probe increases as the answer length increase while the performance of mask predict dropped significantly. This result validates that our Contrastive-Probe (retrievalbased) are more reliable at predicting longer answers than the mask predict approach since the latter heavily relies on the MLM head. 7

Since our Contrastive-Probe involves many hyperparameters and stochastic factors during selfretrieving pre-training, it is critical to verify if it behaves consistently under (1) different randomly sampled sentence sets; (2) different types of relations; and (3) different pre-training steps. Stability of Contrastive-Probe. To conduct this verification, we sampled 10 different sets of 10k sentences from the PubMed corpus 8 and probed the PubMedBERT model using our Contrastive-Probe on the full set. Figure 4 shows the acc@1 performance over top 9 relations and the micro average performance of all the 19 relations. We can see that the standard deviations are small and the performance over different sets of samples shows the similar trend. This further high- lights that the probing success of Contrastive-Probe is not due to the selected pre-training sentences. Intuitively, the contrastive self-retrieving game ( §4) is equivalent to the formulation of the cloze-style filling task, hence tuning the underlying PLMs makes them better suited for knowledge elicitation needed during probing (like 'rewiring' the switchboards). Additionally, from Figure 4 we can also observe that different relations exhibit very different trends during pre-training steps of Contrastive-Probe and peak under different steps, suggesting that we need to treat different types of relational knowledge with different tuning depths when infusing knowledge. We leave further exploration of this to future work. Probing by Relations. To further analyse the probing variance over different relations, we also plot the probing performance of various PLMs over different relations of MedLAMA in Figure 5 . We can observe that different PLMs exhibit different performance rankings over different types of relational knowledge (e.g. BlueBERT peaks at relation 12 while PubMedBERT peaks at relation 3). This result demonstrates that different PLMs are suited for different types of relational knowledge. We speculate this to be reflective of their training corpora. Probing by Layer. To investigate how much knowledge is stored in each Transformer layer, we chopped the last layers of PLMs and applied Contrastive-Probe to evaluate the probing performance based on the first L ∈ {3, 5, 7, 9, 11, 12} layers on MedLAMA. In general, we can see in Figure 6 that the model performance drops significantly after chopping the last 3 layers, while its accuracy is still high when dropping only last one layer. In Figure 7 , we further plot the layer-wise probing performance of PubMedBERT over different relations. Surprisingly, we find that different relations do not show the same probing per-formance trends over layers. For example, with only the first 3 layers, PubMedBERT achieves the best accuracy (>15%) on relation 11 queries. This result demonstrates that both relation types and PLM layers are confounding variables in capturing factual knowledge, which helps to explain the difference of training steps over relations in Figure 4 . This result also suggests that layer-wise and relation-wise training could be the key to effectively infuse factual knowledge for PLMs.

To assess whether the actual probing performance could be possibly higher than what is reflected by the commonly used automatic evaluation, we conducted a human evaluation on the prediction result. Specifically, we sample 15 queries and predict their top-10 answers using Contrastive-Probe based on PubMedBERT and ask the assessor 9 to rate the predictions on a scale of [1, 5] . Figure 8 shows the confusion matrices. 10 We observe the followings: (1) There are 3 UMLS answers that are annotated with score level 1-4 (precisely, level 3), which indicates UMLS answers might not always be the perfect answers.

(2) There are 20 annotated perfect answers (score 5) in the top 10 predictions that are not marked as the gold answers in the UMLS, which suggests the UMLS does not include all the expected gold knowledge.

(3) In general, PubMedBERT achieves an 8.67% (13/150) acc@10 under gold answers, but under the expert annotation the acc@10 is 22% (33/150), which means the probing performance is higher than what evaluated using the automatically extracted answers. 

During the writing of this work, we noticed a concurrent work to ours that also released a biomedical knowledge probing benchmark, called Bio-LAMA Sung et al. (2021) . In Table 6 , we com- pare MedLAMA with LAMA (Petroni et al., 2019) and BioLAMA in terms of data statistics. We found that there is only 1 overlapped relation (i.e., may treat) between BioLAMA and MedLAMA, and no overlap exists on the queries. We can see that, without additional training data from the biomedical knowledge facts, Contrastive-Probe reaches a promising performance compared with OptiPrompt approach, which needs further training data. Additionally, since Mask Predict and OptiPrompt require using the MLM head, it is impossible to compare a model without MLM head being released (e.g. PubMedBERT). In contrast, our Contrastive-Probe not only provides a good indicator of comparing these models in terms of their captured knowledge, but also makes layerwise knowledge probing possible.

How to early stop? For fair comparison of different PLMs, we currently use checkpoints after contrastive tuning for a fixed number of steps (200, specifically). However, we have noticed that different models and different probing datasets have different optimal training steps. To truly 'rewire' the most knowledge out of each PLMs, we need a unified validation set for checkpoint selection.

What the validation set should be and how to guarantee its fairness require further investigation.

Performance not very stable. We have noticed that using different contrastive tuning corpus as well as different random seeds can lead to a certain variance of their probing performances (see Table  5 ). To mitigate such issue, we use average perfor- mance of 10 runs on 10 randomly sampled corpus. Improving the stability of Contrastive-Probe and investigating its nature is a future challenge.

Knowledge Probing Benchmarks for PLMs. LAMA (Petroni et al., 2019) , which starts this line of work, is a collection of single-token knowledge triples extracted from sources including Wikidata and ConceptNet (Speer et al., 2017) . To mitigate the problem of information leakage from the head entity, Poerner et al. (2019) propose LAMA-UHN, which is a hard subset of LAMA that has less token overlaps in head and tail entities. X-FACTR (Jiang et al., 2020a) and mLAMA extend knowledge probing to the multilingual scenario and introduce multi-token answers. They each propose decoding methods that generate multi-token answers, which we have shown to work poorly on MedLAMA. BioLAMA (Sung et al., 2021) is a concurrent work that also releases a benchmark for biomedical knowledge probing.

Probing via Prompt Engineering. Knowledge probing is sensitive to what prompt is used (Jiang et al., 2020b) . To bootstrap the probing performance, Jiang et al. (2020b) mine more prompts and ensemble them during inference. Later works parameterised the prompts and made them train- able (Shin et al., 2020b; Fichtel et al., 2021; Qin and Eisner, 2021) . We have opted out promptengineering methods that require training data in this work, as tuning the prompts are essentially tuning an additional (parameterised) model on top of PLMs. As pointed out by Fichtel et al. (2021) , prompt tuning requires large amounts of training data from the task. Since task training data is used, the additional model parameters are exposed to the target data distribution and can solve the set set by overfitting to such biases . In our work, by adaptively finetuning the model with a small set of raw sentences, we elicit the knowledge out from PLMs but do not expose the data biases from the benchmark (MedLAMA).

Biomedical Knowledge Probing. Nadkarni et al. (2021) train PLMs as KB completion models and test on the same task to understand how much knowledge is in biomedical PLMs. BioLAMA focuses on the continuous prompt learning method OptiPrompt (Zhong et al., 2021) , which also requires ground-truth training data from the task.

Overall, compared to BioLAMA, we have provided a more comprehensive set of probing experiments and analysis, including proposing a novel probing technique and providing human evaluations of model predictions.

In this work, we created a carefully curated biomedical probing benchmark, MedLAMA, from the UMLS knowledge graph. We illustrated that state-of-the-art probing techniques and biomedical pre-trained languages models (PLMs) struggle to cope with the challenging nature (e.g. multitoken answers) of this specialised domain, reaching only an underwhelming 3% of acc@10. To reduce the gap, we further proposed a novel contrastive recipe which rewires the underlying PLMs without using any probing-specific data and illustrated that with a lightweight pre-training their accuracies could be pushed to 24%. Our experiments also revealed that different layers of transformers encode different types of information, reflected by their individual success at handling certain types of prompts. Additionally, using a human expert, we showed that the existing evaluation criteria could overpenalise the models as many valid responses that PLMs produce are not in the ground truth UMLS knowledge graph. This further highlights the importance of having a human in the loop to better understand the potentials and limitations of PLMs in encoding domain specific factual knowledge.

Our findings indicate that the real lower bound on the amount of factual knowledge encoded by PLMs is higher than we estimated, since such bound can be continuously improved by optimising both the encoding space (e.g. using our selfsupervised contrastive learning technique) and the input space (e.g. using the prompt optimising techniques (Shin et al., 2020a; Qin and Eisner, 2021) ). We leave further exploration of integrating the two possibilities to future work.

In this paper, we use two automatic metrics to distinguish hard and easy queries. In particular, we first filter out easy queries by an exact matching metric (i.e. the exactly matching all the words of answer from queries). Since our MedLAMA contains multiple answers for queries, we use a threshold on the average exact matching score, i.e. avg-match>0.1, to filter out easy examples, where avg-match is calculated by:

avg-match = Count(matched answers) Count(total answers) .

This metric can remove all the queries that match the whole string of answers. However, some common sub-strings between queries and answers also prone to reveal answers, particularly benefiting those retrieval-based probing approaches. E.g. <Magnesium Chloride, may-prevent, Magnesium Deficiency>. Therefore, we further calculate the ROUGE-L score (Lin and Och, 2004) for all the queries by regarding <query, answers> pairs as the <hypothesis, reference> pairs, and further filter out the ROUGE-L>0.1 queries.

We train our Contrastive-Probe based on 10k sentences which are randomly sampled from the original pre-training corpora of the corresponding PLMs. Since most of the biomedical BERTs use PubMed texts as their pre-training corpora, for all biomedical PLMs we sampled random sentences from a version of PubMed corpus used by BlueBERT model (Peng et al., 2019) , while for BERT we sampled sentences from its original Wikitext corpora. For the hyperparamters of our Contrastive-Probe, Table 8 lists our search options and the best parameters used in our paper.

To further investigate the impact of the mask ratio to the probing performance, we also test our Contrastive-Probe based on PubMedBERT over different mask ratios ({0.1, 0.2, 0.3, 0.4, 0.5}) under the 10 random sentence sets, the result of which is shown in Figure 9 . We can see that over different mask ratios the Contrastive-Probe always reaches their best performance under certain pre-training steps. And the performance curves of mask ratios are different over the full and hard sets, but they all achieves a generally good performance when the mask ratio is 0.5, which validates that different mask ratios favour different types queries. 

Publicly available clinical bert embeddings

SciB-ERT: A pretrained language model for scientific text

The unified medical language system (umls): integrating biomedical terminology

Language models are few-shot learners

Knowledgeable or educated guess? revisiting language models as knowledge bases

Autoregressive entity retrieval

BERT: Pre-training of deep bidirectional transformers for language understanding

Static embeddings as efficient knowledge bases

Prompt tuning or fine-tuninginvestigating relational knowledge in pre-trained language models

Making pre-trained language models better few-shot learners

Mask-predict: Parallel decoding of conditional masked language models

Is supervised syntactic parsing beneficial for language understanding tasks? an empirical investigation

Jianfeng Gao, and Hoifung Poon. 2020. Domainspecific language model pretraining for biomedical natural language processing

Haibo Ding, and Graham Neubig. 2020a. X-factr: Multilingual factual knowledge retrieval from pretrained language models

How can we know what language models know? TACL

Multilingual lama: Investigating knowledge in multilingual pretrained language models

BioBERT: a pretrained biomedical language representation model for biomedical text mining

BART: denoising sequence-to-sequence pretraining for natural language generation, translation, and comprehension

Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics

Selfalignment pre-training for biomedical entity representations

Fast, effective, and self-supervised: Transforming masked language models into universal lexical and sentence encoders

Mirrorwic: On eliciting word-in-context representations from pretrained language models

Roberta: A robustly optimized bert pretraining approach

Umlsbert: Clinical domain knowledge augmentation of contextual embeddings using the unified medical language system metathesaurus

Scientific language models for biomedical knowledge base completion: An empirical study

Representation learning with contrastive predictive coding

Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets

Language models as knowledge bases? In EMNLP

Erol Bahadroglu, Alec Peltekian, and Grégoire Altan-Bonnet. 2021. Scifive: a text-to-text transformer model for biomedical literature

E-bert: Efficient-yet-effective entity embeddings for bert

E-bert: Efficient-yet-effective entity embeddings for bert

Learning how to ask: Querying lms with mixtures of soft prompts

Exploring the limits of transfer learning with a unified text-to-text transformer

Autoprompt: Eliciting knowledge from language models with automatically generated prompts

Eliciting knowledge from language models using automatically generated prompts

you are grounded!": Latent name artifacts in pre-trained language models

Conceptnet 5.5: An open multilingual graph of general knowledge

2022. A contrastive framework for neural text generation

Tacl: Improving BERT pre-training with tokenaware contrastive learning

Can language models be biomedical knowledge bases

Coder: Knowledge infused cross-lingual medical term embedding for term normalization

Factual probing is [mask]: Learning vs. learning to recall

Nigel Collier and Zaiqiao Meng kindly acknowledges grant-in-aid support from the UK ESRC for project EPI-AI (ES/T012277/1).

 Table 9 : Example predictions of different probing approaches. The human answers are annotated based on the Contrastive-Probe predictions.