key: cord-0461887-vykvl6vg authors: Singhal, Ankit title: Predicting Hydroxyl Mediated Nucleophilic Degradation and Molecular Stability of RNA Sequences through the Application of Deep Learning Methods date: 2020-11-09 journal: nan DOI: nan sha: 1f0f77787287c1c1be31ebee32d5e887bc5eccbb doc_id: 461887 cord_uid: vykvl6vg Synthesis and efficient implementation mRNA strands has been shown to have wide utility, especially recently in the development of COVID vaccines. However, the intrinsic chemical stability of mRNA poses a challenge due to the presence of 2'-hydroxyl groups in ribose sugars. The -OH group in the backbone structure enables a base-catalyzed nucleophilic attack by the deprotonated hydroxyl on the adjacent phosphorous and consequent self-hydrolysis of the phosphodiester bond. As expected for in-line hydrolytic cleavage reactions, the chemical stability of mRNA strands is highly dependent on external environmental factors, e.g. pH, temperature, oxidizers, etc. Predicting this chemical instability using a computational model will reduce the number of sequences synthesized and tested through identifying the most promising candidates, aiding the development of mRNA related therapies. This paper proposes and evaluates three deep learning models (Long Short Term Memory, Gated Recurrent Unit, and Graph Convolutional Networks) as methods to predict the reactivity and risk of degradation of mRNA sequences. The Stanford Open Vaccine dataset of 6034 mRNA sequences was used in this study. The training set consisted of 3029 of these sequences (length of 107 nucleotide bases) while the testing dataset consisted of 3005 sequences (length of 130 nucleotide bases), in structured (Lowest Entropy Base Pair Probability Matrix) and unstructured (Nodes and Edges) forms. The stability of mRNA strands was accurately generated, with the Graph Convolutional Network being the best predictor of reactivity ($RMSE = 0.249$) while the Gated Recurrent Unit Network was the best at predicting risks of degradation ($RMSE = 0.266$). Combining all target variables, the GRU performed the best with 76% accuracy. Results suggest these models can be applied to understand and predict the chemical stability of mRNA in the near future. Over the last two decades, there has been increasing interest in the field of RNA-based technologies in the creation of prophylactic vaccines, among other applications such as cancer immunotherapies, induction of pluripotent stem cells, and in vivo delivery of mRNA to replace or supplement proteins. They are widely regarded to be plausible alternatives to conventional approaches due to their high potency, capacity for quick and low-cost manufacturing, and relatively safe administration. They are currently being researched for many diseases, with the Pfizer-BioNTech vaccine against SARS-CoV-2 being the first to be approved for human use [18, 29, 20] . One of the primary barriers in the creation of such therapeutics is the fragility of the RNA molecule; they are susceptible to rapid degradation within minutes to hours [4] and as such need to be freeze dried or incubated at low temperatures in order to be kept stable. The primary mechanism of this degradation is through RNA hydrolysis, in which the phosphodiester bond in the sugar-phosphate bond is broken as a result of a base-catalyzed nucleophilic attack by the 2'-hydroxyl group (Refer to Figure 1 ). In the current pandemic, vaccines are seen as the most promising means to control the novel coronavirus, but with current limitations on mRNA vaccines, efficient in vivo delivery of such mRNA molecules seems improbable; it would likely only reach a fraction of the population, mostly relegated to countries with a higher level of infrastructural development [23] . Therefore, research into the the stability and degradation of RNA molecules has received continued interest, to date largely consisting of traditional statistical approaches and biophysical models. However, it still remains unclear exactly which parts of RNA molecules are more prone to spontaneous degradation and thus difficult to accurately predict the reactivity and degradation of mRNA [25] . Therefore, experimentation, an incredibly time consuming process, is the default method in determining these values. The aim of this manuscript is to present three possible Deep Learning approaches to this problem through the usage of the Stanford OpenVaccine dataset. Two variants of Recurrent Neural Networks (RNNs) are employed, Long Short Term Memory Networks (LSTMs) and Gated Recurrent Unit Networks (GRUs), along with a variant of a Graph Neural Network (GNN), the Graph Convolutional Network (GCN). These models are applied and compared to assess whether Machine Learning methods can provide helpful results in predicting the reactivity and degradation of mRNA molecules. This can save significant resources in the development of mRNA vaccines by helping identify the most promsing candidates using in silico methods. The majority of this work was done in the Python programming language using a TensorFlow back end with a surface level Keras API. Refer to Table 1 for all the software/packages used throughout the creation and testing of the models along with related information [24, 1, 3, 19, 26, 9, 11, 16, 5] . The dataset used in this manuscript to evaluate the models is called the "Stanford OpenVaccine" dataset [7] . It consists of 6034 RNA sequences. The training/validation set consisted of 3029 of these sequences with a length of 107 nucleotide bases. The testing dataset consists of 3005 sequences with a length of 130 nucleotide bases. Due to experimental limitations (in collecting the data according to the researchers at Stanford [7] ), measurements on the final bases in the sequences cannot be carried out, therefore, only 68 (for the sequences with length 107) and 91 (for the sequences with length 130) of the first bases in the respective sequences have experimental data associated with them. Three predictors were associated with each sequence: the sequence itself (described in A,G,C, and U bases), the expected structure of the molecule, and the Predicted Loop Type (derived from a simulation of the secondary structure of the RNA molecule). A Base Pair Probability Matrix was also provided for each individual sequence indicating the probability of certain base-pair interactions. Five experimentallydetermined sets of values (henceforth referred to as 'target values') were also given for the first 68 or 91 base pairs in the sequence: Reactivity values, degradation values at pH 10, degradation values at pH 10 with added Magnesium, degradation values at 50 • , and degradation values at 50 • with added Magnesium. Refer to Table 2 The task of the algorithms that are presented in this paper is to take the sequence and other structural features of an RNA molecule (features marked as 'inputs') and predict its stability (through the five target values). As mentioned earlier, each base of the sequence has five target and two further structural features associated with it. This will be represented as a feature matrix for all three ML models. What is of particular interest however, is the Base Pair Probability Matrix associated with each sequence which takes the form N × N , where N is the number of bases in the sequence. This probability matrix was generated through a computational calculation, optimizing for the lowest possible entropy configuration of the secondary structure. It can be represented as a standard matrix (refer to Figure 2 for visualization) or as a graph (refer to Figure 3 ) with nodes and edges (compared to a 'normal' visualization of such a secondary sequence -refer to Figure 4 ). The higher the likelihood of two bases interacting, the higher the likelihood of a non-covalent interaction (hydrogen bonding) occuring between the two nucleotides. This distinction between representations is important for the former form is used for the two RNN architectures whereas the latter is used for the GCN. Data from the dataset was first mapped to vector matrices and hot encoded, to ensure the data could be processed by a computer. Essentially, they were mapped onto a vector of the shape 1 × n where n is the number of possible options present for that specific feature. For example, since there are four nitrogenous bases, the vector shape to denote a single base was 1 × 4 and the total sequence vector for one strand of RNA took the shape 1 × 4 × 107 (refer to Table 2 for information about the number of features). After splitting the training and testing data according to the specifications lined out in Section 2.1.2, the training data was augmented using the ARNiE package, the same package used to create the data in the first place. Data augmentation is a technique to artificially create new training data from the existing data to improve the performance of ML models [27] . The part of the package, Vienna2, that was originally used to predicts the RNA structures for the sequences, is based off of thermodynamic experiments and finds the structure by modelling the position of the lowest possible entropy. To augment the data, an alternate method was employed, CONTRAfold, that uses statistical methods to predict the structure. Therefore, for every sequence, new structures and predicted loop types were generated, effectively allowing for the training data to be doubled. Finally, the resulting data was used to create the model using the architectures explained in Figure 5 and the model was evaluated on the testing data. In terms of hyperparameter optimization, the number of epochs was set to 50 and the batch size was set to 64. K-Fold cross validation was also conducted with k = 4. These values were determined through a simple grid search method to find the optimal quantity/quantity after which returns were negligible. An ADAM optimzer was also utilized with the learning rate set to a standard 1.0 · 10 −3 , due to the fact that the model was running on a Tensorflow backend. Although words such as 'Artificial Intelligence', 'Machine Learning', 'Deep Learning', and 'Neural Networks' have become popularized as buzzwords and seem to be synonymous with one another, there are important distinctions between them that must be understood. Artificial Intelligence is the broadest term that denotes any computational technique that allows an algorithm to mimic human behavior by making decisions based on logic. It was first coined by John McCarthy in 1956 [22] and over the past 60 years fields within AI have undergone rapid advancement including but not limited to: searching algorithms, evolutionary computation, and robotics [15, 17, 12] . Machine Learning is one such subfield in AI that refers to the techniques that give computers the ability to learn and carry out tasks without explicitly programmed instructions. This usually entails a program performing statistical operations on a large dataset and discovering underlying patterns. For example, a Support Vector Machine (a type of ML algorithm) might be able to analyze labeled patient data then categorize unlabeled data into risk classes of a certain disease. Finally, Deep Learning is an even more specific term that refers to the subset of ML which employs multi-layer neural network algorithms to learn [8] . For certain tasks, such as in Computer Vision or Natural Language Processing, the usage of such networks can recognize patterns that may be hidden to simpler forms of ML, therefore improving the performance of these algorithms. Neural networks are computational models that are loosely based on the human brain. Neurons in the brain are connected to many other and are continuously receiving electrical impulses at their dendrites which eventually reach the cell body. Here, they are integrated in some way and, crudely speaking, once some threshold is reached, the neuron 'fires' and an impulse is transmitted to other neurons through a synapse. Nodes (or neurons) in an artificial neural network operate in a similar fashion. They receive data from a node in a previous layer, perform a learnable computation on them, then output the result to the next layer in the network. A general computation taking place in a single node follows the form: Where n l+1 is the node which is being updated, n l represents all the previous nodes connected to n l+1 , W l are the weights, b l are the biases, and σ is the activation function. These weights and biases are learnable, i.e. these are the values that through backpropagation and gradient descent, the computer is trying to optimize. The activation function is analogous to the binary threshold of the neuron in the brain to determine whether or not it will fire, however, in modern networks, more sophisiticated functions are used such as the sigmoid, ReLu, or hyperbolic tangent functions [8, 21, 6] . Figure 6 shows a visual comparison of biological and artificial neural networks. The type of neural networks that will be discussed in this paper are known as Recurrent Neural Networks (RNNs). They are, roughly speaking, repeating ANNs that allow for information through data in which spatial or temporal information is important [28] . For example, an algorithm in a messaging app that has the aim of predicting the next word in the sentence not only needs information about the preceeding words in the sentence, but also the order in which they are located. A Long Short Term Memory Network (LSTM) is a type of RNN introduced by Hochreiter & Schmidhuber in 1997 [10] and further popularized by many other pieces of work following it. Like other RNNs, it essentially consists of copies of the same network in which the output of the previous network (or cell) informs the next one, allowing information to persist. A single cell in an LSTM, consists of four distinct neural network layers, three of which are ones that make 'decisions' about the persistence of information as shown in equations 2 -4: Here the output of the previous cell along with the new input is passed through a basic neural layer with a non-linear activation function (traditionally sigmoid functions were used but in this paper, all activation functions denoted by σ were ReLu). In parallel, candidate values for the cell state are also produced as shown by equation 5:C Where the activation function in this case is a hyperbolic tangent function (compressing the values between -1 and 1). Finally using these output matrices, one can then update the cell state as given by equation 6 and produce new outputs as shown in equation 7: This output is then fed into the next cell. This model was used with the normal vector matrix BPPM as represented in Figure 2 . Introduce by Cho et al. in 2014 [2] , a Gated Recurrent Unit Network is a variation of traditional LSTM networks that have the primary advantage of faster computation (due to less neural nets in each cell). Like in the LSTM, decision matrices are produced through two non-linear activation function based neural networks as shown in equations 8 and 9: Candidate values are also generated through a hyperbolic-tangent-based network, however, in this case it is done directly for the output as there is no cell state, unlike in an LSTM. The candidates are then chosen by the 'decision' matrices to produce the output as shown in equations 10 and 11: Like any other form of RNN, this output is then passed to the next hidden layer. Graph Convolutional Networks were introduced by Kipf & Welling in 2017 [14] and provide a novel way to analyze arbitrarily structured data in the form of a graph. A GCN is not a form of an RNN although they are both connectionist models. A GCN operates on a graph defined by G = (V, E) where V is the set of nodes and E the set of edges. Nodes in the graph aggregate the features of the surrounding nodes and itself and use the following neural net (refer to 12 to generate an output which is then assigned to the node: Where D is the diagonal node degree matrix of the graph and is equal to A + I, A being the adjacency matrix (taking the form N × N ) representing the graph (in this case, the BPPM -see Figure ? ?). h 0 = N · F 0 i.e. it is a feature matrix. After an output is generated, this process can be repeated each time the output of the nodes propagating outwards. Due to the localization of such a problem, only two repetitions were employed, any further increases resulted in negligible influences. The models will be evaluated based on the error produced in the prediction of the target values. The two loss measures employed in this paper are Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) described in equations 13 and 14. The difference between the two metrics is that in MAE, all the errors are averaged by weighting them equally, however, since RMSE has a quadratic term, larger individual errors will be punished more than smaller ones. After being built using the specifications discussed earlier in this manuscript, the models were trained on the 'training' dataset discussed in section 2.1.2. For k-fold cross validation, in the models presented, k = 4 and this was initially repeated three times to improve accuracy. The model was split into four groups for cross validation as larger values of k were found to have negligible influence on the model results. Training and validation loss (RMSE) for the three models are show in Figure 7 and Table 3 . As can be seen after the initial training and validation phase, the LSTM maintains the best performance. The GRU scores were similar to that of the LSTM, however, the GCN was in a distant third place averaging about 0.05-0.06 higher RMSE values in comparison to the other two. The performance for the three machine learning models on the training dataset are presented in Figures 8 -12 . The GRU maintained the lowest loss values for all target values, with the only exception being the reactivity (see Figure 8 ) and the degradation at pH 10 (only the RMSE -see Figure 9 ) in which the GCN had the best predictions. Notably, in Figures 10, 11 , and 12, the GCN had lower RMSE values than the LSTM, however the opposite was true for the MAE values suggesting that the GCN is not as prone to large, individual errors. Mean loss across the 20 trials for each individual target were calculated along with that of all targets combined for each model. Results are shown in Figure 13 . Although the creation of stable mRNA molecules remains difficult, datasets of sequences and corresponding information are becoming more popular and widely available. Through the use of deep learning architectures, reasonable predictions of structural features can be obtained, as demonstrated in this manuscript, with mean RMSE values ranging from 0.24 to 0.31. The usage of such techniques has the capacity to increase the speed and efficiency of mRNA vaccine discovery and has further implications in other related fields of research. However, such methods, as presented in this paper, are not without their limitations. It is important to note that while reasonably accurate results are produced, the error, when taking into account the scale of the values being predicted (0.5 -4) is not insignificant; in fact, it equates to a 24% Mean Absolute Percentage Error. Therefore, this could have the potential of incorrectly predicting the stability of an important molecule which may be overlooked during the discovery phase. Thus, in its current state, as an extension, a simple binary classification system is suggested to predict whether the molecule is stable, at the end of each of the models that would aim to minimize the False Negative rate -effectively resulting in the model being used as a screening test to remove highly unstable sequences, rather than a full-fledged research tool. Another limitation of this method is the length of the sequences of the mRNA molecules used. The length used in this paper ranged from 107-130 bases, while an actual Covid-19 vaccine would likely range from 3000-4000 [29] bases long. Thus, as more expansive and complete databases are released, further research exploring the reliability of such algorithms in predicting longer sequences is encouraged. Despite these limitations, the results of this work show that such prediction algorithms are feasible and have the potential to save time during research processes, an especially valuable commodity during disease outbreaks. In the long term, such techniques may also better help researches in understanding the reasoning behind the stability of certain RNA molecules and aid in the development of related technologies. It is hoped that this work will be of some use to other chemical researchers as well in creating better prediction models for this field. TensorFlow: Large-scale machine learning on heterogeneous systems Learning phrase representations using RNN encoder-decoder for statistical machine translation Stanford biochemist works with gamers to develop covid-19 vaccine Contrafold: Rna secondary structure prediction without physics-based models An introductory review of deep learning for prediction models with big data Openvaccine: mrna vaccine degradation prediction Deep Learning Array programming with NumPy Long short-term memory Matplotlib: A 2d graphics environment A systematic review of artificial intelligence and robots in value co-creation: current status and future research avenues Debate on the relationship between neural network and the brain Semi-supervised classification with graph convolutional networks Machine learning: a review of classification and combining techniques Viennarna package 2.0. Algorithms for molecular biology Evolutionary computation: An overview mrna vaccines-a new era in vaccinology Scikit-learn: Machine learning in Python Safety and efficacy of the bnt162b2 mrna covid-19 vaccine Deep learning in neural networks: An overview. CoRR, abs/1404 How pfizer vaccine could be cold comfort for some asian nations Python 3 Reference Manual Theoretical basis for stabilizing messenger rna through secondary structure design Data Structures for Statistical Computing in Python Understanding data augmentation for classification: when to warp A review of recurrent neural networks: Lstm cells and network architectures Advances in mrna vaccines for infectious diseases