key: cord-0191485-7kewgve6
authors: Tam, Leo K.; Wang, Xiaosong; Xu, Daguang
title: Transformer Query-Target Knowledge Discovery (TEND): Drug Discovery from CORD-19
date: 2020-11-28
journal: nan
DOI: nan
sha: ed0baba465a625e9fadf5d6e6357a01d096281ea
doc_id: 191485
cord_uid: 7kewgve6

Previous work established skip-gram word2vec models could be used to mine knowledge in the materials science literature for the discovery of thermoelectrics. Recent transformer architectures have shown great progress in language modeling and associated fine-tuned tasks, but they have yet to be adapted for drug discovery. We present a RoBERTa transformer-based method that extends the masked language token prediction using query-target conditioning to treat the specificity challenge. The transformer discovery method entails several benefits over the word2vec method including domain-specific (antiviral) analogy performance, negation handling, and flexible query analysis (specific) and is demonstrated on influenza drug discovery. To stimulate COVID-19 research, we release an influenza clinical trials and antiviral analogies dataset used in conjunction with the COVID-19 Open Research Dataset Challenge (CORD-19) literature dataset in the study. We examine k-shot fine-tuning to improve the downstream analogies performance as well as to mine analogies for model explainability. Further, the query-target analysis is verified in a forward chaining analysis against the influenza drug clinical trials dataset, before adapted for COVID-19 drugs (combinations and side-effects) and on-going clinical trials. In consideration of the present topic, we release the model, dataset, and code.

The COVID-19 literature has experienced exponential growth and analysis tools have arisen to digest the literature (Brainard, 2020) . To mine the literature, word embedding methods (Huang et al., 2012; Mikolov et al., 2013) operate in a high-dimensional space where semantic relationships are exposed 1 https://www.kaggle.com/leotam/ novel-drug-discovery-from-clinical-trials-on-dgx-2 through interrogation with metrics such as cosine similarity. Advances in transformer attentionbased architectures treat one possible weakness in word embedding learning approaches by conditioning on contextual information for a downstream task (Vaswani et al., 2017) . The pretrain-finetune paradigm, whereby a language model is trained on a large corpus through a task such as masked cloze or autoregressive variants and then trained again (fine-tuned) for a specific application encountered success (Dai and Le, 2015; Collobert and Weston, 2008) . While word2vec implementations focused on analogies evaluation as a downstream task indicative of semantic learning, transformer architectures surveyed a collection of downstream tasks such as sentiment, sentence similarity, natural language (NL) inference, question and answering, and reading comprehension, etc. present in the GLUE and super GLUE benchmark (Wang et al., 2019b,a) . Word2vec discovery methods (Tshitoyan et al., 2019) constitute a return to analogy evaluation, though eschewing generic semantic analogies to highlight analogies in materials science. It is hypothesized that resolving analogies may form the basis for higher reasoning such as advancing the limits of domain-specific research (Kuniyoshi et al., 2020; Hansson et al., 2020) and measuring scholastic achievement (Khot et al., 2017) . For the case of applications in the medical domain (Lee et al., 2020) , transformer pretrain-finetune performance was dependent on the quality and in-domain nature of source and target datasets.

Similar to novel materials discovery, drug discovery is an intensive and arduous process requiring trial and error. Drug discovery is the problem of allocating resources to numerous promising candidates, and thus ranking promising candidates assists the discovery process in a top-down fashion. To date globally, only nine influenza drugs have received full approval for use (De Clercq and Li, 2016) . Where the Tshitoyan et al. (2019) method uses a rigid prediction method to mine associative analogies for undiscovered materials, the advances in tokenization (Sennrich et al., 2016) have relaxed exact vocabulary registration though complicating the method. The present work examines a querytarget (QT) token conditioning method, which extends the masked language modeling (MLM) inference. The QT method is used to rank prediction association with clinical trials efficacy treating a specificity problem when moving from a fixed vocabulary to whole language tokenization. Moreover, the RoBERTa-large model trained on the CORD-19 literature dataset (29500 provided at the time of work and now updated to over 200000 articles) can be enhanced for drug analogies evaluation via a k-shot method (Raffel et al., 2019; Brown et al., 2020) .

Previous work has examined drug and side-effect relationships in a bipartite graph focusing on literature in a four year range (Jeong et al., 2020). While Jeong et al. (2020) considered 169766 PubMed abstracts across drugs,Hansson et al. (2020) considered solely type II diabetes related drugs, clustering them around heuristic expert topics. Scoring was conducted via co-mention weighting and a five year look-back literature mention weighting. Both previous methods incorporate expert information in a semi-supervised method, the first through database registration for drug -side effect pairs and the second through semantic concept selection. The present method is an unsupervised method excepting for physician review for the auxiliary analogies task. Further, the method uses a majority of full texts in a narrow focus (Kohlmeier et al., 2020) . Results from Alsentzer et al. (2019) ; Lee et al. (2020) revealed how the detail and relevance of the dataset influences the final result. We corroborate that domain-specific datasets improve domain-specific analogy performance.

The overview of our method is presented in Fig. 1 , which depicts both training and inference modes. During training, the MLM task is held without modification from Liu et al. (2019) , namely 13.5% of tokens are targeted for replacement with 90% replaced with <mask>, 10% corrupted with a random token. The MLM task is chosen as it closely replaces the function in Tshitoyan et al. (2019) for predicting a target word given the context around the word. The RoBERTa transformer method is a pure application of MLM, removing the next sentence prediction task while scaling to longer sequences with dynamic masking. A crossentropy loss is used for prediction, the RoBERTa 50K byte-pair encoding tokenization is used, and hyper-parameters are left at default settings from Wolf et al. (2019) . During training, the inputs are the CORD-19 dataset described in sec. 3.1 dynamically masked ten times. The MLM training from scratch runs across 16 NVIDIA V100 GPUs in a single-node (DGX-2) configuration for 100000 steps over approximately 36 hours. The MLM prediction is used for analogy evaluation via the structure "A is to B as C is to <mask>" as suggested by Raffel et al. (2019) . The complete set of analogies is provided with the code. For the word2vec implementation, the procedure followed Tshitoyan et al. (2019), including vocabulary generation and evaluation. The QT inference mode is discussed in Sec. 3.2.

The CORD-19 dataset is the largest machine readable full text literature COVID-19 dataset curated by Kohlmeier et al. (2020) . Some statistics such as records sourcing and license split on CORD-19 are presented in Tab. 1. The quality of the dataset may be attributed to the full text access on a narrow focus that was not available in previous medical fine-tuning studies (Alsentzer et al., 2019; Lee et al., 2020; Tshitoyan et al., 2019) . On disk as raw text, the dataset is 875 MB with 20% of the dataset reserved for testing during language modeling. The United States Food and Drug Administration (FDA) approved drugs and global clinical trials data are drawn from De Clercq and Li (2016) and USGov (2020) respectively. In Tab. 2, counts such as the number of trials and drugs trialed per year's end is collected. Namely, we use influenza Fig. 2 . For analogy evaluation, the set of language (grammar) analogies and drug analogies are drawn from relationships in Tshitoyan et al. (2019) and De Clercq and Li (2016) respectively. A list of the analogy categories is presented in Tab. 3. For k-shot training, a random set of k=5 analogies from each category is used as additional pretraining for MLM (Brown et al., 2020) .

During inference, a query and target phrase are selected based on the relationships of interest. We examine the relationship with the RoBERTa training objective, adopting the formulation from Yang et al. (2019) for the objective as:

(1) where δ t is 1 if t indicates a masked token and 0 otherwise, x = [x 1 , · · · , x S ] is a text sequence,x represents a corrupted token,x represents a masked token, e(x) is the embedding of the sequence, and H θ is the RoBERTa-large architecture with parameters θ that maps a S length text sequence into a sequence of hidden vectors. Optimizing the training objective results in accurate MLM inference. For MLM inference (Devlin et al., 2019) , the K masked tokens in the query q = [q 1 , · · · , q K ] are targeted for token prediction, i.e.

.

(

For QT prediction, we condition the masked token targets on the query targets y = [y 1 , · · · , y L ]:

follows by independence assumption in Eqn. 1 and therefore is contained in the training objective (i.e. accurate QT prediction is implied). When q = y, we observe the QT conditioning decomposes to the MLM task prediction. The QT conditioning is more focused than reformulating a span prediction method such as in Devlin et al. (2019) due to rejection of extraneous tokens that would be admitted in a dot product formulation. Once QT prediction has been formed, the analogy MLM task may be permuted with the QT terms using "Q is to T as Q is to <mask>" to analyze the top-k related terms without conditioning. For rank prediction, tokens with positive and negative associations are not intentionally mixed as they are for visualization purposes.

Typically transformer attention visualization examines the sequence to sequence attention (Vaswani et al., 2017; Huang et al., 2019) , namely plotting the per-token attention:

.

(4)

For QT visualization, the token attention per l-th query targets, namely: (2016) is highlighted on a per-sentence basis using the target term "efficacy".

To preserve the casual nature of time series data, the rank calculation from Eqn. 3 is performed on the year-limited data in Tab. 3. The target query is set as "clinical trials efficacy" and the candidate drugs are drawn from the number specified in column 2 of Tab. 2. The candidate drugs are a subset of the total drugs tested as trials cover additional diseases (column 4, Tab. 2).

After training for 100000 steps, the MLM task reached a perplexity of 2.4696 on the held-out test data. The attention relationship for self-sequence to sequence and QT is visualized in Fig. 3 as per eqns. 3 and 4. While the QT scoring may be adapted to sentence highlighting (Fig. 3) , a comparison with the span extraction or abstractive summarization method in Devlin et al. (2019) is beyond the scope of the current work. While negation handling (Fig. 3) is an expected result Devlin et al. (2019); Wang et al. (2019a) , it represents an advancement over the word2vec scoring method. Analogies evaluation is collected in Tab. 4. Although a comparison on simple grammar analogies can be conducted, a simple extension cannot be performed as the 600000+ word2vec vocabulary built from standard procedures does not adequately capture the phrases in the drug analogies. In the categories where RoBERTa-large can be compared to word2vec (opposites, comparatives, superlatives) significant improvement is observed (83.0% accuracy vs 50.4% accuracy). The few-shot and semi-supervised learning approaches are critical to performance, generating 23.8% and 18.8% improvement in top-1 accuracy for grammar and antiviral analogies respectively. While synthetic analogies can be captured to some degree by the CORD-19 RoBERTa-large model, is the model relevant for forward predictions as in Tshitoyan et al. (2019) ? Fig. 4 shows the FC analysis for the period where clinical trials data is reliably available. Below the FC figure, a ranking of drugs under current clinical trials is presented. Shortly after the analysis was issued ([anon URL]), the antiviral remdesivir entered emergency FDA approval, reflecting Fig. 4 . As a possible failure mode, hydroxychloroquine was ranked as a distant third and was later shown to have no correlation with positive or negative outcomes (Geleris et al., 2020) . In Fig. 4 (bottom) , the permuted MLM task mines relationships that mirror the relationship of remdesivir with clinical trials efficacy. Inverting the analogy mining operation (not pictured) does not recover the QT function as predicted terms are too generic to focus on candidate drugs. While further experiments are collected on negative terms (side effects) and drug combinations in Fig. 5 , a reliable method to test and verify these results has not been collected.

A transformer QT conditioning specifies the discovery method on a narrow literature dataset to predict clinical trials approval as verified by FC, real-time prediction, and relationship mining. The conditioning operation is a straight-forward calculation at inference time for the transformer language model permissible by the independence assumption during pretraining. For language models where independence is not assumed, such as the permutation language objective (Yang et al., 2019) , conditioning would be performed via estimation of the posterior distribution, i.e. via a Metropolis-Hastings algorithm. The ranking task can be used to determine a per-sentence passage highlighting (Fig. 3) with a specific query. The scope of the QT method can be given since q, y ∈ X is in the set of all statements in the corpus, and only finite sets could be generated (though by motivation this is unwieldy). For more validation, the field of online learning may offer independent verification through the marginal contribution to accuracy of each datum (Jia et al., 2019) .

Besides the accessible resource of clinical drug trials, other quantitative methods of determining drug function are feasible given detailed dataset formulation. Such methods could focus on canonical measures such as the inhibitory constant (K i ), effective dose at 95% (ED95), or number needed to treat (NNT). Still further are works examining protein receptor binding, but the connection to literature machine learning methods is unclear as well requiring specialized dataset expertise. Due to the relatively limited number of successes for antiviral drugs, analysis suffers from sample bias. Further comparison on the materials dataset was not possible due to unavailability of the dataset after request.

Despite limitations, the study suggests transformer language models are a flexible tool in mining literature. 

New tools aim to tame pandemic paper tsunami

A unified architecture for natural language processing: deep neural networks with multitask learning

Semisupervised sequence learning

Approved antiviral drugs over the past 50 years

BERT: pre-training of deep bidirectional transformers for language understanding

Observational study of hydroxychloroquine in hospitalized patients with covid-19

Semantic text mining in early drug discovery for type 2 diabetes

Improving word representations via global context and multiple word prototypes

Clinicalbert: Modeling clinical notes and predicting hospital readmission

Examining drug and side effect relation using author-entity pair bipartite networks

Towards efficient data valuation based on the shapley value

Answering complex questions using open information extraction

Covid-19 open research dataset

Annotating and extracting synthesis process of all-solid-state batteries from scientific literature

Biobert: a pre-trained biomedical language representation model for biomedical text mining

Roberta: A robustly optimized BERT pretraining approach

Efficient estimation of word representations in vector space

Exploring the limits of transfer learning with a unified text-to-text transformer

Neural machine translation of rare words with subword units

Unsupervised word embeddings capture latent knowledge from materials science literature

Attention is all you need

Superglue: A stickier benchmark for general-purpose language understanding systems

GLUE: A multi-task benchmark and analysis platform for natural language understanding

Huggingface's transformers: State-of-the-art natural language processing

Xlnet: Generalized autoregressive pretraining for language understanding