key: cord-0982551-61kohuyx authors: Lin, Hui-Heng; Zhang, Qian-Ru; Kong, Xiangjun; Zhang, Liuping; Zhang, Yong; Tang, Yanyan; Xu, Hongyan title: Machine learning prediction of antiviral-HPV protein interactions for anti-HPV pharmacotherapy date: 2021-12-21 journal: Sci Rep DOI: 10.1038/s41598-021-03000-9 sha: d3fa9de7d8bf47311092eb9c135ce6f032c4d266 doc_id: 982551 cord_uid: 61kohuyx Persistent infection with high-risk types Human Papillomavirus could cause diseases including cervical cancers and oropharyngeal cancers. Nonetheless, so far there is no effective pharmacotherapy for treating the infection from high-risk HPV types, and hence it remains to be a severe threat to the health of female. Based on drug repositioning strategy, we trained and benchmarked multiple machine learning models so as to predict potential effective antiviral drugs for HPV infection in this work. Through optimizing models, measuring models’ predictive performance using 182 pairs of antiviral-target interaction dataset which were all approved by the United States Food and Drug Administration, and benchmarking different models’ predictive performance, we identified the optimized Support Vector Machine and K-Nearest Neighbor classifier with high precision score were the best two predictors (0.80 and 0.85 respectively) amongst classifiers of Support Vector Machine, Random forest, Adaboost, Naïve Bayes, K-Nearest Neighbors, and Logistic regression classifier. We applied these two predictors together and successfully predicted 57 pairs of antiviral-HPV protein interactions from 864 pairs of antiviral-HPV protein associations. Our work provided good drug candidates for anti-HPV drug discovery. So far as we know, we are the first one to conduct such HPV-oriented computational drug repositioning study. | (2021) 11:24367 | https://doi.org/10.1038/s41598-021-03000-9 www.nature.com/scientificreports/ 3D structural conformations have been revealed through approaches of in silico simulation 9 and structural biology 10 . Other studies have tested, discussed, and reviewed the in vitro effects of existed drug, i.e., the Human Immunodeficiency Virus (HIV) protease inhibitor, on HPV proteins and cells infected with HPVs [11] [12] [13] [14] [15] . These reports targeting existed drugs for HPV treatments showed that, compared with de novo drug discovery, repositioning exited drugs is indeed the better and quicker strategy. Nonetheless, drug efficacies from above evidences were moderate and no further progress is seen in later stages, e.g., in clinical contexts. And hence, above research progresses are yet far from being able to identify drug candidates with good therapeutic and anti-HPV potential. Limitation of them could be due to two reasons. One is that inappropriate compound or drug candidates have been chosen for testing. The other reason could be the number of drug candidates to be tested is too small. Testing only limited numbers of compound or drug candidates surely restricts the probability of identifying those appropriate ones. In order to meet the urgent needs for effective anti-HPV drug discovery, based on target-oriented drug repositioning strategy, we collected and analyzed 96 antiviral drugs to do the relatively large-scale in silico screening for 9 HPV-16 proteins, so as to computationally and effectively identify effectively antivirals with good potential for targeting HPV proteins. Briefly, in this work, we constructed, benchmarked, and selected machine learning predictive models (also known as predictors) to predict antivirals that could have potential interactions with HPV proteins. This is because drug-target interactions are vital prerequisite of molecular therapeutic mechanisms. Through benchmarking, we selected the high-precision K-Nearest Neighbor (KNN) 16 and Support vector machine (SVM) 17 predictors to detect those confidence interaction pairs of antiviral-HPV protein. To the best of our knowledge, no prior study similar to our work has been done. Lots of researchers predicted targets of drugs, compound-protein interactions, or protein-protein interactions using machine learning or other computational methods [18] [19] [20] [21] . However, so far as we know, no study has focused on studying relationships between antiviral drugs and HPV proteins. Research question formulation. Theoretically, a therapeutic target and its drug molecule have interactive binding relation to each other. Therefore, trying to identify potential HPV protein targets of antivirals could be considered as a binary classification task, i.e., to predictively classify proteome of HPV into two classes of proteins. One class is HPV proteins which have potential interaction with drug molecules, and the other class is HPV proteins do not have potential interactions with drugs. Machine learning is state-of-art method to solve such binary classifications (Fig. 1) . Considered that known antiviral drug-target interaction pairs were available, which could serve as the known-label validation dataset, we thus chose supervised (machine) learning methods for this study. Research framework of this study. Predicting antiviral drug-HPV protein interaction could be considered a binary classification task, and machine learning is a good method for such task. In this work, antiviral drug-target pairs' features were transformed into vectors for constructing machine learning predictors. Through benchmarking, the best predictors were selected to predict antiviral-HPV protein interactions. 25 and Therapeutic Targets 26 databases (As of 19th July 2020). Drug-target interaction pairs which contained the United States Food and Drug Administration (FDA)approved antiviral drugs were treated as the validation dataset for machine learning, because FDA-approved antivirals as the validation set can better reflect the real-world application value of our models. And the rest drug-target interaction pairs were treated as the training dataset for machine learning. In this work's machine learning classification task, an interaction pair of an antiviral drug and a protein was defined to be a positive instance, while negative instance indicated a non-interactive pair of antiviral and protein. In order to balance data ratio for binary machine learning classification task. We randomly generated non-interactive drug-target pairs so as to assure the 1:1 ratio of positive instances to negative instances for machine learning. In more details, we initially constructed a full graph of bipartite drug-target network, in which each antiviral was connected to all the target proteins in the network. Upon removing those known antiviral-target interaction pairs, we had those non-interactive drug-target pairs. And then, we randomly drew such non-interactive drug-target pair out without replacement (treated it as the negative instances for machine learning) until the ratio of positive to negative instance reached 1:1. Next, we integrated the proteome (9 proteins in total) of high-risk HPV-16 subtype and all the antiviral drugs to form drug-protein interaction prediction dataset. See Supplementary Table S1 for machine learning training dataset of antiviral drug-target interaction pairs, Supplementary Table S2 for drugtarget interaction pair dataset used in machine learning validation process, and Supplementary Table S3 for Uniprot's HPV-16 proteome, i.e., 9 proteins. Next, all antivirals' molecular structures were analyzed using ChemmineR 27 and 1024-dimension chemical fingerprint datasets were generated through R scripting 28 . All proteins were analyzed using ProtR 29 and 10,784 high dimension protein descriptor feature datasets were generated. As seen in Table 1 . Descriptors used were protein structural and physicochemical properties. These descriptors have been widely used in studying protein-protein interactions and protein-ligand interactions in silico, and they worked well 29 . All datasets were integrated, scaled and normalized using R computing environment 28 . Machine learning and prediction. Briefly, the machine learning processes of this research work followed such order and general steps. Initially, the training dataset was loaded to different machine learning algorithms, and fivefold cross validation and grid searching were applied to training processes, so as to identify the best parameters of machine learning models with the best predictive power. Later, predictors with good performances were further applied to classify the validation dataset with known labels. Lastly, the verified best predictor was used to predict antiviral-HPV protein interaction pairs. Diverse sorts of supervised learning algorithms with different purposes exist. Amongst, the Support Vector Machine, Random Forest 30 , Logistic Regression 31 , etc., are classic algorithms for tackling the binary classification questions. 6 types of machine learning classifiers friendly for binary classification were chosen for building predictive models. The chosen predictors were Support Vector Machine, Random Forest, AdaBoost 32 , Logistic Regression, Naïve Bayes 33 and K-Nearest Neighbor classifier. Amongst, K-Nearest Neighbor classifier and Adaboost displayed good prediction performances on predicting miRNA-disease associations 34, 35 . And Chen et al. developed a Random Forest-based model RFMDA which had good predictive power on multiple kinds of human complex diseases 36 . These studies support us to choose aforementioned predictors for this work. With default parameters, 6 predictors went over simple checking through quick training and performance measurement. At this early stage, as expected, all predictors did not perform well. Subsequently, in order to identify better parameters for predictors, grid search fivefold cross validations and performance benchmarking were conducted. The predictive performance of 6 different predictors with better parameters were tested using known-label validation dataset. Upon checking performance of different predictors, we selected the optimized K-Nearest Neighbors classifier and SVM, which had the highest precision scores and were the most appropriate www.nature.com/scientificreports/ predictor to identify high confidence drug-protein interaction pairs from 864 pairs of antiviral-HPV protein associations. Aforementioned data processing and machine learning computations were done via in-house scripts of Python 37 and R 28 . Libraries and modules used were Sci-kit learn 38 , Pandas 39 , Numpy 40,41 Scipy 42 , and also Bioconductor 43,44 and Biomart 45 .We acknowledge the authors and developers of these computational tools. Specifically, parameter set of KNN that finally used for predicting antiviral-HPV protein interaction pair was that, the number of neighbors was set to 65, "weights" was set to "distance", and "leaf_size" was set to 60. And for SVM, gamma was set to 0.001, C (the regularization parameter) was set to 0.0002, and polynomial kernel with degree = 3 was used to predict antiviral-HPV protein interaction pair. The rest parameters remained default ones of the function of Python library Sci-kit learn 38 . Dataset overview. The antiviral drugs and their associated targets were retrieved and analyzed as described in method section. Table 2 provides a summary of our dataset. We had totally 61 antiviral drugs, which formed totally 284 antiviral-target interaction pairs with their targets. For the purpose of measuring machine learning predictors' performance, antiviral-target interactions were split into two classes where 102 pairs were used as the dataset for training or fitting machine learning predictors, and the rest 182 pairs were treated as dataset for validating the predictive performances of machine learning predictors. And we also compiled 9 proteins of HPV (its complete proteome) with 96 antiviral drugs to form 864 pairs of antiviral-HPV protein association pairs ( Table 2 ). Performances of machine learning models. Initially, we chose 6 types of machine learning models and applied fivefold cross validation strategy to fit the antiviral-target interaction training dataset. A primary benchmarking of the predictive performance of 6 chosen predictors was as seen in Table 3 . Briefly, all predictors' predictive performances were less satisfying, as expected. SVM with default parameter (RBF kernel) performed the worst in all sorts of metrics among 6 predictors. AdaBoost classifier scored the best in terms of precision score but had the lowest recall score. F1-measure is the harmonic average of precision and recall. The highest F1-measure was found from the Random forest classifier, which was 0.63. While we also found other metrics of Random forest were not high. All its metrics were around 0.65 though the values were close to each other. The highest accuracy score and AUC (Area Under Curve of Receiver-Operating Characteristic Curve) of 6 predictors' were 0.66 and 0.68, respectively. And both of them were also found in Random forest's performance. Metrics of default parameters' KNN were all around 0.6, indicating its unsatisfying performances in fivefold cross validation, too. Similar to KNN, Naïve Bayes classifier did not perform well, and one common point of KNN and Naïve Bayes classifier was that, the value of their recall score was higher than those of other metrics (Table 3) . Next, we tuned parameters of predictors through grid searching fivefold cross validation, and tested how combination set of parameters affected predictors' predictive performances on known-label validation dataset. At the beginning, we focused on optimizing predictors for obtaining better values of comprehensive metrics, Table 2 . Summary of antiviral-target and antiviral-HPV protein interaction dataset used in machine learning processing of this study. a Validation set consisted of U.S.FDA-approved antiviral drugs and these drugs' binding target proteins. b 9 proteins of HPV-16. c Ratio of positive instance to negative instance was 1:1. d Number of validation set was greater than that of training set because (1) more FDA-approved antivirals were desired for validating the real-world application value of our machine learning models; (2) generalization performance of machine learning models could be reflected using smaller training set but larger validation set. www.nature.com/scientificreports/ such as the F1-measure, accuracy or AUC value. Despite a great number of times' trying, no high sores of aforementioned F1-measure, accuracy or AUC metric value was seen. Given that high precision score indicates the low number of predictive false positive instances, and high recall score indicates the low number of predictive false negative instances, we changed our strategy and decided to do high precision-oriented optimization. This was because the purpose of this work was to identify antivirals that interact with HPV proteins. To this end, using high-precision predictor, predictive positive instances could have lower false positive instances mixed inside. Therefore, in this work, we preferred precision metric over recall metric for selecting appropriate predictors to predict antiviral-HPV protein interactions (positive instances). Through benchmarking the performances of predictors, we found optimized SVM and KNN predictors had better precision scores than others. SVM's was 0.8 and the KNN classifier's was 0.85 (Table 4 ). We hence used them for prediction task and we chose the intersection of their prediction results as the final results. Predicted antiviral-HPV protein interaction pairs. Upon selection of high-precision predictors, we applied them to predict the antiviral-HPV protein interactions. We selected two predictors' result intersection as the final prediction result, i.e., we only consider an antiviral-HPV protein association pair has potential interaction if both predictors predicted this pair to be position (interactive). As a result, within 864 antiviral-HPV protein association pairs, most antiviral-HPV protein pairs were predicted to be negative, i.e., the antiviral drug does not interact with the HPV protein. Only a small portion, i.e., 57 of antiviral-HPV protein pairs were predicted to have interaction. Prediction results were summarized in Table 5 in HPV protein-oriented form. Full prediction results could be found in Supplementary Table S3 . Here we took the Docosanol as an example for analysis. The drug Docosanol was predicted to interact with HPV-16's protein E7 using our high-precision machine learning predictors. Docosanol is a U.S. FDA-approved antiviral drug targeting Envelope glycoprotein GP350 and GP340 of Epstein-Barr Virus (EBV, also known as Human Herpesvirus or HHV-4) and it is used to treat fever blisters, etc. Interestingly, through literature survey, a recently published clinical case report was found to claim that, the mixture usage of Docosanol, curcumin, and other drugs together treated HPV infection and vaginal warts of a patient well 46 . This could be evidence supporting our predictive result about Docosanol and HPV protein. HPV protein-oriented antiviral prediction results were summarized in Table 5 and brief description of the example antiviral drugs, protein targets of the antiviral drug and relevant therapeutic indications were also listed in Table 5 . While our results are to be validated by in vitro assays, in this work, we constructed machine learning models, and predicted antiviral-HPV protein interactions so as to identify potential drug candidates targeting HPV proteins. The high-risk types of HPV are not limited to HPV-16. There are other types such as HPV-18. Indeed, we are not only able to apply the research framework of this study to predict the potential drug candidates for the proteome of other HPV subtypes, but also to other types of pathogenic and infectious microbes, as well. www.nature.com/scientificreports/ Reviewing this current study, we found several significant points that could help us do better preparation for further works. Initially, in this work, though we tried our best to collect more antiviral drugs, due to the availability of antiviral drugs, we had limited size of dataset for machine learning. This could be one of factors why we did not obtain predictors with high scores of F1-measure, accuracy, or AUC. Compared with antivirals, the amount of other types of drugs, e.g., cancer drugs or antibiotics, is higher. Thus, in future studies, we would consider using other types of drugs for repositioning purpose. Also, the final predictors selected did not have high F1-measure, accuracy, or AUC. Because current machine learning processes are black box which is difficult to interpret. Alternatively, in this study, considered the tradeoff between precision and recall, we chose to select the intersected prediction results from two high-precision predictors in order to get higher confidence antiviral-HPV protein interactions. For future study, we would learn and try to apply the state-of-art explainable machine learning methods which may be interpretable. In such case, we may be able to find out reasons causing low performances and obtaining guidance for model optimizations and obtaining more powerful machine learning predictors. One more interesting idea for extending current work is to predict synergistic antiviral drug combinations for HPV infection pharmacotherapy. Similar to "cocktail" treatment for HIV infections and synergistic treatment for fungal infections, it is likely that synergistic drug combinations work for treating HPV infections, too. A good example to get insights from is NLLSS 47 , which is a well-performed algorithm for predicting antifungal synergistic drug combinations. Similarly, it is a computational and machine learning-based research work, and hence multiple points, such as its research ideas and methodology, could be referred to. Inspired by the needs of anti-HPV drug discovery, drug repositioning and computational analytics, we designed this research project and constructed machine learning models to predict possible antiviral-HPV protein interactions so as to identify potential pharmacotherapy for HPV infection. As a result, we optimized the predictors and identified 57 antiviral-HPV protein interaction pairs. To the best of our knowledge, we are the first pioneer to conduct this HPV-oriented computational antiviral repositioning study. No similar study has been found so far. Therefore, our work provides good insights to virologists, medicinal chemists, gynecologists, clinical microbiologists, etc., those who are interested in the treatment and therapy of HPV infections. Also, drug candidates pre-selected via computational analytic screening could have lower probability of ineffectiveness than those that did not go through computational analyses. It thus could save resources, and antivirals identified by us could be good candidates for further in vitro and in vivo tests. In such way, this work contributes to drug development for HPV infections. What is more, our predicted antiviral-HPV protein interaction pairs also offer insights for fundamental biomedical research on drug-protein interactions or molecular interaction mechanisms. The last but not the least, the research framework of this study, i.e., the machine learning-based compound-protein interaction prediction, could also be applied to primary drug repositioning or drug discovery for those diseases or infectious microbial pathogens lacking effective pharmacotherapy. E.g., the Noroviurs and COVID-19. Data of this study were included in the supplementary materials. HPV-associated diseases Human papillomavirus and overall survival after progression of oropharyngeal squamous cell carcinoma Epidemiologic classification of human papillomavirus types associated with cervical cancer Human Papillomavirus (HPV) and cervical cancer Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries Quadrivalent human papillomavirus vaccine: recommendations of the Advisory Committee on Immunization Practices Human papillomavirus E6 and E7 oncoproteins as risk factors for tumorigenesis The E7 oncoprotein is translated from spliced E6* I transcripts in high-risk human papillomavirus type 16-or type 18-positive cervical cancer cell lines via translation reinitiation Molecular modeling simulation studies reveal new potential inhibitors against HPV E6 protein Structural basis for hijacking of cellular LxxLL motifs by papillomavirus E6 oncoproteins Repositioning HIV protease inhibitors as cancer therapeutics Using HIV drugs to target human papilloma virus Specific HIV protease inhibitors inhibit the ability of HPV16 E6 to degrade p53 and selectively kill E6-dependent cervical carcinoma cells in vitro Raman chemical mapping reveals site of action of HIV protease inhibitors in HPV16 E6 expressing cervical carcinoma cells A metabolomics investigation into the effects of HIV protease inhibitors on HPV16 E6 expressing cervical carcinoma cells KNN model-based approach in classification What is a support vector machine Machine learning for drug-target interaction prediction Recent advances in the machine learning-based drug-target interaction prediction Machine learning approaches for protein-protein interaction hot spot prediction: Progress and comparative assessment Classification and prediction of protein-protein interaction interface using machine learning algorithm DrugBank 5.0: A major update to the DrugBank database for A comparison of results reporting for new drug approval trials PubChem 2019 update: Improved access to chemical data UniProt: A hub for protein information Update of TTD: Therapeutic target database ChemmineR: A compound mining framework for R R: A language for data analysis and graphics Protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences Random forest classifier for remote sensing classification Logistic regression diagnostics Soft margins for AdaBoost A 'non-parametric'version of the naive Bayes classifier RKNNMDA: Ranking-based KNN for MiRNA-Disease Association prediction Adaptive boosting-based computational model for predicting potential miRNA-disease associations Novel human miRNA-disease association inference based on random forest Python for scientific computing Scikit-learn: Machine learning in Python Pandas: A foundational Python library for data analysis and statistics Array programming with NumPy The NumPy array: A structure for efficient numerical computation Bioconductor: Open software development for computational biology and bioinformatics BioMart and bioconductor: A powerful link between biological databases and microarray data analysis BioMart-biological queries made easy An alternative treatment for vaginal cuff wart: A case report NLLSS: predicting synergistic drug combinations based on semi-supervised learning We appreciate the kind assistance from Mr. Zhu Yifan during revision stage. We are also thankful to Mr. Yang Wei from Guangdong Zhongsheng Pharmaceutical Co., Ltd, for his kind help and advice to us during revision stage. The authors declare no competing interests. Supplementary Information The online version contains supplementary material available at https:// doi. org/ 10. 1038/ s41598-021-03000-9.Correspondence and requests for materials should be addressed to H.-H.L. or H.X.Reprints and permissions information is available at www.nature.com/reprints.Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.