key: cord-0815219-o6ggyb4b
authors: Oluwaseun, Alakanse Suleiman; Yinka, Joel Ireoluwa; Ambrose, George Oche; Olamide, Adigun Temidayo; Adenike, Sulaiman Faoziyat; Nkechinyere, Ohanaka Judith; Mukhtar, Idris; Abiodun, Yekeen Abeeb; Durojaye, Olarewaju Ayodeji
title: Identification of lead inhibitors of TMPRSS2 isoform 1 of SARS-CoV-2 target using neural network, random forest, and molecular docking
date: 2022-01-14
journal: Data Science for COVID-19
DOI: 10.1016/b978-0-323-90769-9.00021-9
sha: e61e1ed3dcbdbdd09ea2af2f202beeb251c4c35d
doc_id: 815219
cord_uid: o6ggyb4b

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was responsible for over 4 million confirmed cases of severe acute respiratory syndrome, of which more than 300,000 cases were confirmed to be dead as of May 2020. The virulent endocytotic activities of SARS-CoV-2 have been associated with angiotensin-converting enzyme 2 (ACE2) and transmembrane protease serine 2 (TMPRSS2). Previous studies on the viral activation of TMPRSS2 focused most often than not on the isoform 2 of TMPRSS2, but the isoform 1 (529 residues) has also been shown to be expressed in target cells and contribute to viral activation in host. The inhibition of TMPRSS2 has been reported to grossly reduce the pathogenic effects of SARS-CoV-2 endocytotic activities. In this study therefore, we developed two machine learning models using random forest classifier (RFC) and neural networks (NNs) based on 2251 serine protease inhibitors to screen a database of 21,000,000 virtual compounds. We screened the hit compounds using absorption, distribution, metabolism, and excretion (ADME) properties and finally docked the filtered compounds into the predicted binding site of TMPRSS2 isoform 1 homology model to determine their corresponding binding affinity and plausible molecular interactions. One (ASONN) and four (ASOIRFC1–4) lead compounds were obtained from the ADME-NN and RFC filtered hits, respectively, having better binding affinity and lead-likeness properties than those of camostat; this could be due to extensive hydrogen and hydrophobic interactions.

Sever acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS) have been linked to SARS coronavirus (SARS-CoV) and MERS-coronavirus (MERS-CoV), respectively [1,2]. As of September 1, 2021, a total of 218,946,836 laboratory-confirmed cases of SARS infection by SARS-CoV-2 was recorded; of which, 4,539,724 were recorded to have resulted in death [3] .

SARS-CoV-2, a novel coronavirus detected late 2019 and closely related to SARS-CoV, has been etiologically implicated in the new pulmonary disease [4e6] . The viral endocytotic activities of coronaviruses and specifically those of SARS-CoV-2 in the targeted host cells are aided by the S1 unit of the spike protein (S), thus enhancing viral attachment and entry into the host cells via angiotensin-converting enzyme 2 (ACE2), which can be blocked via transmembrane protease serine 2 (TMPRSS2) inhibitors [7] .

In addition, the priming of S protein for SARS-CoV-2 viral entry is also aided by TMPRSS2 [8e10] and endosomal cysteine proteases cathepsin B and L (CatB/L).

A total of 2253 inhibitors Simplified Molecular Input Line Entry System (SMILES) of TMPRSS2 isoform 1 was downloaded from the Chemical European Molecular Biology Laboratory (CHEMBL) database with their corresponding Half-maximal inhibitory concentration (IC50) values. A python script using the pandas module was written to clean up the downloaded data (i.e., remove SMILES with missing IC50 values and duplicates and extract only SMILES and their corresponding IC50).

Molecular Operating Environment (MOE) [15] was used to calculate 2D descriptors for the SMILES. A total of 206 descriptors were calculated.

Exploratory data analysis was carried out using a python script (Pandas, Feature Selector, Matplotlib, and Seaborn). IC50 values were converted to the negative log of the IC50 value when converted to molar (pIC50) using this formula: 9-log10[IC50]. The pIC50 values were converted into categorical values of active (1) and nonactive (0), and the activity threshold was set at 8.0 (active ! 8.0, nonactive < 8.0). Correlation within descriptors was carried out using the python Feature Selector module, and the correlation threshold was set at 0.75. All intercorrelated descriptors were removed.

A python script implementing TPOT analysis [16] was carried out to investigate the best machine learning algorithms and their corresponding hyperparameters for our modeling task. TPOT classifier class was used, early stopping parameter was set to true, verbose was set to 4, and cross-validation with shuffling was set to 10-fold. All other parameters were left at default values.

In order to reduce the number of training features, we used a python script implementing the Feature Selector module. The module uses the XGBoost algorithm [17] to identify zero-and low-importance features. The cumulative importance value was set at 0.95, the task value was set to the classification, and the evaluation metric used was l2. All other parameters were left at default. The number of features was reduced to 43. The 43 descriptors were normalized using Scikit-Learn RobustScaler [18] .

The RFC was the best according to our TPOT analysis, hence an RFC algorithm was used to build a model based on our 43 descriptors using Scikit-Learn RFC class [18] . The following hyperparameters were used: bootstrap: False, criterion: entropy, max_features: 0.2, min_samples_leaf: 1, min_samples_split: 15, n_estimators: 100, and class weight was set at 0: 0.60, 1: 0.4. All other parameters were left at default. A deep NN model was built using a python TensorFlow module [19] in order to find the best values for NN hyperparameters, namely, epoch, batch size, and optimization function. We wrote a python script implementing Scikit-learn Keras wrapper and Grid Search CV class [18] . Cross-validation was set to 10 verbose 4, an NN with three layers of 100 units each, ReLU as the activation function.

After finding the best values for the abovementioned hyperparameters the NN for our modeling task was constructed. The NN architecture had the following parameters: input dimension: 44, three dense layers with 100 neurons (units) each, an output dense layer with 2 units (2 units to accommodate for both active(1) and nonactive(0)), each of the dense layers used ReLU activation function and an l2(0.01) regularizer, and the output layer used sigmoid activation function. Finally, the network was compiled using binary cross-entropy loss function, Adam optimizer, categorical accuracy as metric evaluation, early stopping and model check point as the callback, epoch: 20, batch size: 5, verbose: 2, and class weight: 0: 60, 1: 40. All other parameters were left at default.

SCUBIDOO a database [20] with 21,000,000 compounds was screened against both models. SCUBIDOO has sample representations of the database: 9994 (S); 99,977 (M); and 999,794 (L) compounds. We used the M sample to screen against our models, and hits were filtered using two criteria: synthesis probability (! 0.9) and drug-likeness (¼ 1). This criterion was calculated using MOE software [15] , while the filtration was executed using a python script. The hit compounds from this filtration process were further subjected to ADMET property verification and further reduced.

2.8 Sequence retrieval, alignment, and homology modeling of transmembrane protease serine 2 isoform 1 target protein

The crystal structure of TMPRSS2 of either isoform 1 or 2 was not available in the PDB database; thus the homology model was generated for this study. The protein FASTA query sequence of TMPRSS2 isoform 1 was retrieved from the NCBI database with the accession number NP_001128571.1. The template 5ce1.1.A was selected for the homology modeling using the SWISS-MODEL webserver [21] .

The generated homology model of TMPRSS2 isoform 1 was uploaded on the 3Drefine webserver [22] ; this makes use of iterative optimization of hydrogen bonding network in addition to atomic-level energy minimization on the optimized protein model using a composite physics and knowledge-based force field for efficient protein structure refinement. The output of 3Drefine webserver was further refined using GalaxyRefine webserver [23] ; this works by rebuilding side chains, repacking of side chain, and relaxing overall protein structure using molecular dynamic simulation (MDS) [23] .

2.10 Validation and quality estimation of optimized transmembrane protease serine 2 isoform 1 modeled protein

The optimized TMPRSS2 isoform 1 model was validated using RAMPAGE and PROSESS [24] . Quality estimation of the modeled protein was carried out using SAVES server and ProSA-web [25] . Ramachandran plot quality estimation was done using WHAT IF webserver. The resolution of the optimized modeled protein was calculated using ResProx server [26] and visualized using Discovery Studio 3.0 [27] .

Physiochemical characterization of transmembrane protease serine 2 isoform 1

Using the ProtParam webserver, the individual percentage of amino acid residues, molecular weight, theoretical pI, atomic composition, extinction coefficient, and instability index of the model protein were determined [28] .

The recognition of protein-ligand binding sites is of great importance in drug discovery. The binding site of the proposed lead compound(s) was predicted using P2Rank, which is a template-free machine learning algorithm embedded in PrankWeb [29, 30] . This works based on local chemical neighborhood ligandability unified on junctures placed on a solvent-accessible protein surface. Junctures or points with increased ligandability score are then clustered to form the resulting ligand binding sites [29] . 

The use of machine learning algorithms to virtually screen databases of compounds with unknown biological activity for hit compounds has become an established protocol in drug discovery [31] . Machine learning algorithms can be classified into various categories, including logic-based algorithms (decision trees), statistical algorithms (support vector machine, Bayesian network, K-nearest neighbor), and perceptron-based algorithms (NNs) [32] , but for the sake of this study, we classified them into two: NNs and non-NNs (logic-based and statistical algorithms).

In this study, we built models based on both types of machine learning algorithms and screened for hit compounds.

A total of 2251 inhibitors of protease inhibitors were extracted from CHEMBL, with pIC50 ranging from 3 to 11 ( Fig. 28 .1). 2D descriptors (206) were calculated using MOE software, descriptors with a correlation greater than 0.75 were removed ( Fig. 28 .1), and descriptors with zero or low importance were removed using the Feature Selector python package. A total of 44 descriptors were carried forward for modeling. The pIC50 values were converted into categorical data-type of active and nonactive using a python script, and the activity threshold was set to pIC50 8 and above.

There are huge arrays of non-NN algorithm, and each has its array of hyperparameters that must be tuned to achieve high-performance models. Selecting from these models could be a daunting task, therefore we used a python package, TPOT [16] , to assist in this task.

TPOT is a genetic programming AutoML (Auto machine learning) protocol with the primary aim of maximizing classification or regression accuracy. TPOT classifier class was trained on the compounds and validated (training set, 1576; test set, 675). The analysis produced RFC as the best model for our modeling task with an accuracy of 0.802. The RFC was trained on a training set of 1576 and validated on a test set of 675 compounds. The RFC model was subjected to a 10-fold cross-validation and was evaluated using various categorical classification evaluation metrics (Table 28 .1). A confusion matrix ( Fig. 28. 2) was constructed from which the classification evaluation metrics were calculated.

Precision metric measures the positive predictive value rate; that is, it measures how well the model identifies true positives as against false positives. The RFC model has a precision score of 0.85 (Table 28 .1). The recall metric, also known as sensitivity score, is a measure of true positive rate, i.e., evaluates how well the model classifies true positives from false negatives. F1 is the harmonic mean of both precision and accuracy. Specificity measures the true negative rate; that is, it measures how well a model classifies true negatives correctly. In order to get a full picture of how well the model is identifying true positive (sensitivity) and true negative (specificity), we calculate another metric call balanced accuracy (Table 28 .1). Balanced accuracy is the mean of both sensitivity and specificity, and it enables us to evaluate the model's ability to identify active and nonactive inhibitors. The RFC model had a balanced accuracy of 0.808. The ROC_AUC score measures how often the model picks a true positive ahead of a false positive (Fig. 28.3 ). The RFC model had an ROC_AUC score of 0.84 (Note that all the metric values ranged from 0 to 1, with 1 being the best value and 0 the worst value possible).

Having evaluated this RFC model with the abovementioned metrics the results, therefore, suggest that the model is robust and reliable for the screening of compounds with unknown biological activity.

Machine learning models provide an immerse opportunity in predicting unknowns, but they come with a challenge of interpretability. Most of the best performing algorithms are black boxes in nature and how they come about their prediction with the imputed features is not known [9] . It is therefore imperative to provide some sort of model explanation for every selected model. We used the Skater python library to provide a suitable model explanation. Skater explains machine learning models using two major plots: feature importance plot and partial dependence plot (PDP). The feature importance plot ranks features in the order of their importance to the model. It revealed SLogP_VSA4, SMR_VSA4, vsa_other, apol, and GCUT_SLOGP_0 as the top descriptors important to the model (Fig. 28.4) .

The PDP investigates the marginal effect of different values of a descriptor on a predicted outcome of the machine learning model [33] . For categorical models, PDP investigates the marginal effect of the descriptor on the prediction probability of a class. For our study, the class in consideration is the active class. The PDP suggests that increasing the values of SLogP_VSA4 has the highest marginal effect on the predicted outcome (i.e., active class) of the model. SMR_VSA4 had the next highest marginal effect; however, this effect was only visible within a narrow range of 0.8e1.0. Further increase in values of SMR_VSA4 resulted in decreased effects, which were stable afterward (Fig. 28.5) . VSA_other, apol, and GCUT_SLOGP_0 had an equal marginal effect on the active class predictions (Fig. 28.5 ). 

In this study, NN was used to build a model on the inhibitors of TMPRSS 2. The NN model was built using the TensorFlow Python library [20] . The training set consists of 1576 compounds, test set for evaluation consisted of 675 compounds, and validation data was 10% of the training set.

As noted in non-NN, hyperparameters of NNs are also quite extensive; in order to solve this problem, python Scikit-learn Keras wrapper and GridSearchCV library were used to search for the optimum values: epoch, batch size, and the optimization function.

The optimum values obtained from this exhaustive search after a 10-fold crossvalidation are epoch: 20, batch size: 5, and optimizer: Adam. These values were therefore used to build the NN.

Another common problem of NN is overfitting of the model to the training data, in order to avoid this a regularization kernel (l2 regularize :0.01) was set for each layer of the neural architecture and finally an early stopping callback was set when compiling the neural architecture.

The model was trained over eight epochs (although the epoch parameter was set to 20 but was stopped early at the eighth epoch). The model training loss and training categorical accuracy, validation loss, and categorical accuracy were measured at each epoch and plotted (Fig. 28.6 ). 

The eighth epoch had a training loss of 0.26 and training categorical accuracy of 0.80, validation loss was 0.25, and validation categorical accuracy was 0.81 (Table 28. 2). The eighth epoch was the best and saved. The model was finally evaluated with the test set with a categorical accuracy of 0.84 and a loss of 0.46 (Table 28. 2).

A confusion matrix was constructed (Fig. 28.7) and categorical metrics evaluated (Table 28 .2) (it should be noted, however, that evaluation is on the eighth epoch only).

The NN model had the following metric score: precision 0.79, recall (sensitivity) 0.82, F1 0.80, specificity 0.85, and balanced accuracy 0.84 (Table 28 .2). The results, therefore, suggest that the NN model is reliable and can be used for extrapolations. 

As stated earlier, model interpretation is important in understanding machine learning model predictions. The Skater python library was also used to provide explanations for the NN model (Feature Importance, PDP) (Figs. 28.8 and 28.9). Of the 43 descriptors used to build the NN model, SLogP_VSA4, PEOE_VSA_FHYD, a_nf, and SMR_VSA4 were the top important features (Fig. 28.8) . A PDP showed the marginal effect of these descriptors on the predicted outcome (active class). Of these top descriptors, the a_nf descriptor had the highest marginal effect on the active class prediction probability, while PEOE_VSA_FHYD had the lowest contribution to the marginal effect on the active class prediction probability (Fig. 28.9 ). However, increasing the values of these descriptors had an increasing marginal effect on the model predictions ( Fig. 28.9 ). 

Although the primary aim is to compare both models, we evaluated both models on some performance metrics to see which model was the best performer. From this comparison (Table 28. 3), NN had the lowest error rate and the highest balanced accuracy, also the sensitivity of NN to true positive is about a fold higher than RFC. This, therefore, suggests that although both NN and RFC had a high level of accuracy and equal level of true negative identification (specificity), NN would most likely screen the database with a high level of sensitivity to compounds with inhibitory activities.

Having developed and validated these two models (RFC and NN), we screened the SCUBIDOO database [20] sample of 100,000 compounds. The random forest model classified over 80,000 compounds as active and the NN model classified 3000 compounds as active (Fig. 28.10) . The high number of predicted active compounds by RFC was however not surprising, which is due to the random forest low sensitivity score (Table 28. 3). However, in order to filter down this number, a python script was written to filter down the base of the compound on two criteria: synthesis probability and druglikeness. The script reduced the RFC model hits to 1600 compounds and the NN model hits to 250 compounds. Furthermore, using ADMET properties on SwissADME [34] the compounds were further screened down to 784 and 70, respectively.

The TMPRSS2 isoform 1 physiochemical properties were predicted and analyzed using ProtPara webserver. The protein sequence for TMPRSS2 isoform 1 consisted of 346 amino acid residues, with Ser (8.7%) and Met and Phe (2.6%) amino acid residues having the highest and lowest composition, respectively, as indicated in Fig. 28 .11. The computed pI for the amino acid residues for TMPRSS2 isoform 1, i.e., the isoelectric pH, In experimental studies like isoelectric focusing and 2D electrophoresis, the isoelectric pH of a protein plays a critical role. The stability of the protein was computed by analyzing its instability index. The index for this protein was 35.18. This protein may be stable because its predicted value is within the range of 40 [35, 36] .

The Hum-mPLoc3 webserver [36] was used to predict the subcellular localization of TMPRSS2 isoform 1. The analysis revealed that the human TMPRSS2 isoform 1 is predominantly a plasma membrane protein (Fig. 28.12 ).

Quaternary structures with complex interactions and their physiologic roles are necessary for detailed comprehension of the human system. Experimental elucidation of protein structures using either nuclear magnetic resonance spectroscopy or X-ray crystallography is more realistic but time consuming and occasionally unsuccessful in case of membrane proteins; thus these necessitate the use of homology modeling techniques. Serine protease hepsin of template 5ce1.1.A with a 33.82% sequence identity and 2.5 Å resolution was used in the homology modeling of TMPRSS2 isoform 1. Global Model Quality Estimate helps in the identification and selection of an optimal template in the modeling process. For this study, the score was 0.48. The sequence similarity to the query sequence was 0.38, which covers the range of 183e528 of the amino acid residues. The quality of the built model was assessed by the QMEAN scoring function, which uses potentials of mean force to generate global and per residue quality estimates. The QMEAN score [37] , which is the best of the model generated, was À1.39, as indicated in Fig. 28.13D . The quality of the model developed was assessed using SAVES [38] , PROSESS [24] , and PROSA webservers [25] . The percentage of residues in the most favored regions and the percentage of Phi/Psi pairs in the favored regions of the Ramachandran plot of the nonoptimized protein was 86% and 92% compared to the 93.08% and 97.60% expected values, respectively, as indicated in Fig. 28.13A .

The overall quality factor of the model was assessed further using ERRAT and was observed to be 92.923%. A good model of high resolution is expected to have ERRAT quality values around 95% higher. Thus the model developed has a lower resolution, as indicated in Fig. 28 .13B.

One of the computational limitations of protein modeling is the variation of the experimental and predicted model from the native true structure of a protein, thus necessitating the need for refinement and optimization of protein models. In this study, an improvement in the model was made using 3Drefine [22] , which led to the generation of five refined protein models from an initially input SWISS model. 3Drefine works by optimizing hydrogen bonding network in addition to atomic-level energy minimization. The best model of the five generated, as indicated in Fig. 28 .14D, has the lowest potential energy according to 3Drefine force field. Quality check was carried out on the 3Drefine model having the lowest potential energy. An improvement in the quality of the model was observed compared to the unrefined model, as indicated in Fig. 28.13D . The percentage of residues in the most favored regions and the percentage of Phi/Psi pairs in the favored regions of the Ramachandran plot of the 3Drefineoptimized protein was 89% and 96.51% compared to the 86% and 92% of the unrefined model, respectively. But this improvement was below the expected values 93.08% and 97.60%. The ERRAT and Z-score qualities also improved and were 95.5% and À8.68, respectively, as indicated in Figs. 28.14 and 28.15.

The quality of the 3Drefine model was further optimized using GalaxyWEB [39] , which works on the basis of rebuilding the side chain and overall protein relaxation using MDS. The best of the five models generated was model one. Quality check and validation was carried out on this model also with a dramatic increase in the quality of the protein.

The percentage of residues in the most favored regions of the Ramachandran plot was 92%, which exceeded the expected by 2%. The ERRAT quality factor score was also improved by 97.01%, which also exceeded the expected by 2%. The quality of the model assessed by PROSA also indicated an improvement in Z-score, as indicated in Fig. 28 .15C. Thus compared to the unrefined and 3Drefine, the GalaxyWEB generated the best model as corroborated by the quality and validation scores. 3.9 Binding pocket prediction of transmembrane protease serine 2 isoform 1

Using a template-free machine learning algorithm, P2Rank embedded in PrankWeb [29, 30] was used to predict the binding site of ligand on TMPRSS2. P2Rank works based on points situated on solvent-accessible protein surface from local chemical neighborhood ligandability. The resulting binding sites are formed via the cluster of points with high ligandability score. The predicted binding sites are indicated in Table 28 .1.

The set of computational methodology used in the analysis of large databases with the aim of identifying potential hit candidates is referred to as virtual screening [40e42].

Ligand-based and structure-based (or receptor-based) are the two main types of virtual screening methods used in potential hit identification. Both the methods were carried out. In this study, through ligand-based virtual screening, 1600 and 250 compounds obtained from RFC and NN models were screened to 70 and 784 compounds using SwissADME, respectively. In ligand-based virtual screening of the compounds obtained from RFC and NN using SwissADME the compounds were screened to 70 and 784, respectively; all the compounds passed the Lipinski's [43] , Ghose's [44] , Oprea's [45] , Veber's [46] , Varma's [47] , Egan's [48] , and Muegge's [49] rules filters for drug-likeness evaluation. In addition to these, no PAINS (pan assay interference compounds) [50] and Brenk alert [51] were recorded and all passed the lead-likeness rules. These compounds were then used in receptor-based virtual screening so as to obtain potential leads compared to camostat; the established standard drug was used as an inhibitor of TMPRSS2. The leads obtained from the receptor-based virtual screening of the compounds are indicated in Figs. 28.16 and 28.17AeD. The binding affinities were À8.7, À8.4, À8.4, À8.5, and À8.5 kcal/mol, respectively, compared to À7.4 kcal/mol obtained for camostat, as indicated in Fig. 28 .18. All the amino acids involved in bound interactions are within the range of predicted binding pockets, as indicated in Table 28 .4.

The compound ASOINN indicated in Fig. 28 .16 was the lead obtained from the structural virtual screening of the 70 compounds. The lead forms nine hydrophobic bond interactions, which included two HIS 333 , four TRP 498 , two LYS 379 , and LEU 456 . In addition to these, two SER 497 bound interactions were formed.

Compounds ASOIRFC1e4 as indicated in Fig. 28 .17AeD were obtained from TMPRSS2 isoform 1 virtual screening of 784 compounds, which were obtained from the SwissADME-screened RFC 1600 compounds.

As indicated in Fig. 28 .17A, ASOIRFC1 formed six hydrogen bond interactions, which included SER 478 , SER 497 , SER 473 , two GLY 499 , and GLY 501 , as well as five hydrophobic bonds (three TRP 498 , LEU 456 , and LYS 379 ); these are in contrast to the nine hydrophobic interactions and two hydrogen bonds formed in ASOINN. As indicated in Fig. 28 .17C, ASOIRFC3 formed four hydrogen bond interactions, which included ALA 423 , GLY 422 , ASP 477 , and GLY 296 . In addition to these, two hydrophobic interactions, including ALA 503 and ALA 437 , were formed. ASOIRFC3 so far formed the lowest number of interactions.

ASOIRFC4 as indicated in Fig. 28 .17D formed seven bond interactions, which included two hydrogen bonds, three hydrophobic interactions, pi-lone pair, and pisulfur type bonds.

Camostat, the standard drug used as an inhibitor of TMPRSS2, forms seven hydrogen bonds and one electrostatic and hydrophobic bond, as indicated in Fig. 28.18 .

Hydrogen bonds are the prevailing directional intermolecular interactions in biological complexes, and the predominant contribution to the specificity of molecular recognition. In drug design, hydrogen bonds are exploited to gain specificity owing to their strict distance and geometric constraints. Furthermore, previous studies indicated that, contribution of hydrogen bond to free energy is dependent on local environment: a solvent-exposed hydrogen-bond contributes significantly less to net interaction energy than the same hydrogen-bond in a buried hydrophobic pocket [51a,51b] . To validate the docking protocol, we redocked on of the lead (ASOINN) compound into the predicted binding pocket of TMPRSS2. As indicated in Fig. 28 .19, the redocked pose overlapped almost completely with the previous orientation, thus indicating the reliability of the docking protocol and the scores obtained. Hydrogen bonds are nonbonded interactions; they are specific, directional, and in a short range. They occur via a covalent interaction between hydrogen atoms and electronegative atoms, which usually involves N, S, or O. The strength of hydrogen bonds is optimal with alignment of the atoms involved in bond formation, and this occurs especially when the H donor points directly at the electron acceptor pair. However, the strength of hydrogen bonds is weaker than that of covalent or ionic bonds, contributing to the specificity of molecular recognition [52, 53] . Heavy atoms in NeH/O, NeH/N, and OeH/O hydrogen bonds were all found to be separated by similar median distances of approximately 3.0 Å in previous works. In this study, the bond distances for majority of the lead complexes are within the required threshold of median distance 3.0 Å [58] . Hydrophobic contacts are by far the most common interactions in proteineligand complexes, accounting for 66,772 contacts at a distance cut-off of 4.0 between a carbon and a carbon, halogen, or sulfur atom. The group formed by an aliphatic carbon in the receptor and an aromatic carbon in the ligand is the most populous, accounting for over 42,000 interactions [59] . This suggests that aromatic rings are common in small molecule inhibitors. In fact, one or more aromatic rings are present in 76 percent of marketed drugs, with benzene being by far the most commonly encountered ring system [60] . Not surprisingly, leucine side-chains are the most frequently involved in hydrophobic interactions, followed by valine, isoleucine, and alanine side-chains [58] . Based on the requirements of absorption and permeation of drug molecules, the number of hydrogen bonds has been shown to be limited. The Lipinski rule-of-five, for example, states that drug molecules with hydrogen bond acceptors greater than 10 and hydrogen bound donors greater than 5 are most likely to have poor absorption or permeation properties [62] . A single hydrogen bond formed either by intra-or intermolecular interaction is weaker than ionic or covalent bond but stronger than van der Waals interaction, thus unable to aid a drug-receptor interaction alone. Moreover, the formation of multiple drugreceptor hydrogen bond interactions conferred stability, which is an essential feature of drug-receptor interactions [57] .

In this study, the lead compounds, ASOINN and ASORFC1e4, are within the range of accepted number of hydrogen bounds in terms of acceptors and donors; thus they all may have good absorption or permeation properties. In addition to these, although all the compounds are within the range of accepted number of hydrogen, only ASOIRF1 has the highest number of hydrogen bonds (6) compared to camostat (7); thus ASOIRF1 may be more stable than ASOINN and ASOIRF2e4.

Previous studies have shown that synergistic receptor-ligand H-bond pairings potentiate high-affinity binding, which correspond to an increase in binding affinity [54] . In addition to these, hydrogen bonding and optimized hydrophobic interactions have been shown to both stabilize the ligands at the target site and help alter binding affinity and drug efficacy [55, 56] . The recorded increase in binding affinities of ligands ASOINN and ASORFC1e4 in contrast to camostat may be due to the observed increase in the hydrophobic and hydrogen bound interactions, in addition to electrostatic and pinteractions.

The search for the therapeutic treatment of SARS-CoV-2 infection (coronavirus disease 2019 ) is not only of utmost importance but also time sensitive. Hence a fast method to screen for plausible therapeutic drug against different targets in SARS-CoV-2 has been employed all over the world. In this study, we selected TMPRSS2 isoform 1 as a therapeutic target for SARS-CoV-2 and employed the power of machine learning to develop two models (random forest and NNs) based on 2251 inhibitors of serine proteases downloaded from CHEMBL. These models have been used to screen a sample of SCUBIDOO database (M: 99,977), which is a database of 21,000,000 virtual compounds. We have therefore identified five possible lead compounds having shown good ADMET properties, binding affinity, and molecular interaction with TMPRSS2.

To further improve this work, the five lead compounds can be used to search for more similar compounds in the SCUBIDOO database and finally validated experimentally.

13 (AeD) Homology model protein, quality, and validation check of transmembrane protease serine 2 (TMPRSS2) isoform 1 before optimization

14 (AeD) Homology model protein, quality, and validation check of transmembrane protease serine 2

Detection of 2019 novel coronavirus (2019-nCoV) by realtime RT-PCR

Middle East respiratory syndrome: emergence of a pathogenic human coronavirus

Coronavirus disease (COVID-19)

Clinical features of patients infected with 2019 novel coronavirus in

A novel coronavirus outbreak of global health concern

A novel coronavirus from patients with pneumonia in China

A pneumonia outbreak associated with a new coronavirus of probable bat origin

Evidence that TMPRSS2 activates the severe acute respiratory syndrome coronavirus spike protein for membrane fusion and reduces viral control by the humoral immune response

Efficient activation of the severe acute respiratory syndrome coronavirus spike protein by the transmembrane protease TMPRSS2

A transmembrane serine protease is linked to the severe acute respiratory syndrome coronavirus receptor and activates virus entry

TMPRSS2 contributes to virus spread and immunopathology in the airways of murine models after coronavirus infection

Clinical isolates of human coronavirus 229E bypass the endosome for cell entry

Wild-type human coronaviruses prefer cell-surface TMPRSS2 to endosomal cathepsins for cell entry

TMPRSS2 isoform 1 activates respiratory viruses and is expressed in viral target cells

The Molecular Operating Environment), software available from Chemical Computing Group Inc., 1010 Sherbrooke Street West, Suite 910

TPOT: a tree-based pipeline optimization tool for automating machine learning

XGBoost: reliable large-scale tree boosting system

Scikit-learn: machine learning in Python

Tensorflow: a system for large-scale machine learning

SCUBIDOO: a large yet screenable and easily searchable database of computationally created chemical compounds optimized toward high likelihood of synthetic tractability

SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information

3Drefine: an interactive web server for efficient protein structure refinement

GalaxyRefine: protein structure refinement driven by side-chain repacking

PROSESS: a protein structure evaluation suite and server

ProSA-web: interactive web service for the recognition of errors in three-dimensional structures of proteins

Resolution-by-proxy: a simple measure for assessing and comparing the overall quality of NMR protein structures

Version 3.0

MFPPIeMulti FASTA ProtParam interface

PrankWeb: a web server for ligand binding site prediction and visualization

P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure

Applications of machine learning in drug discovery and development

Supervised machine learning: a review of classification techniques

Causal interpretations of black-box models

Enzymatic detoxification of cyanide: clues from Pseudomonas aeruginosa Rhodanese

Half a century of Ramachandran plots

Hum-mPLoc 3.0: prediction enhancement of human protein subcellular localization through modeling the hidden correlations of gene ontology and functional domain features

SWISS-MODEL: homology modelling of protein structures and complexes

Verification of protein structures: patterns of nonbonded atomic interactions

GalaxyWEB server for protein structure prediction and refinement

What has virtual screening ever done for drug discovery?

Molecular fingerprint similarity search in virtual screening

Hierarchical virtual screening approaches in small molecule drug discovery

Poor aqueous solubilitydan industry wide problem in drug discovery

Knowledge-based, central nervous system (CNS) lead selection and lead optimization for CNS drug discovery

Current trends in lead discovery: are we looking for the appropriate properties?

Overview on the rule of five

Biopharmaceutic classification system: a scientific framework for pharmacokinetic optimization in drug research

Predicting ADME properties in drug discovery

Pharmacophore features of potential drugs

PAINS: relevance to tool compound discovery and fragment-based screening

Lessons learnt from assembling screening libraries for drug discovery for neglected diseases

No free energy lunch

Computational Models for Drug Design and Delivery (Doctoral dissertation

The role of functional groups in drugereceptor interactions

The CH/p hydrogen bond in chemistry. Conformation, supramolecules, optical resolution and interactions involving carbohydrates

Regulation of protein-ligand binding affinity by hydrogen bond pairing

Optimized hydrophobic interactions and hydrogen bonding at the target-ligand interface leads the pathways of drug-designing

Increasing the binding affinity of VEGFR-2 inhibitors by extending their hydrophobic interaction with the active site: design, synthesis and biological evaluation of 1-substituted-4-(4-methoxybenzyl) phthalazine derivatives

Receptors and drug action, in: Foye's Principles of Medicinal Chemistry

A systematic analysis of atomic proteineligand interactions in the PDB

SwissADME: a free web tool to evaluate pharmacokinetics, druglikeness and medicinal chemistry friendliness of small molecules