key: cord-0840927-zlgghxco
authors: Lim, Hocheol; Jeon, Hyeon-Nae; Lim, Seungcheol; Jang, Yuil; Kim, Taehee; Cho, Hyein; Pan, Jae-Gu; No, Kyoung Tai
title: Evaluation of protein descriptors in computer-aided rational protein engineering tasks and its application in property prediction in SARS-CoV-2 spike glycoprotein
date: 2022-01-31
journal: Comput Struct Biotechnol J
DOI: 10.1016/j.csbj.2022.01.027
sha: d581879af00d6bc956c733e8bcf9ebdfe96bab37
doc_id: 840927
cord_uid: zlgghxco

The importance of protein engineering in the research and development of biopharmaceuticals and biomaterials has increased. Machine learning in computer-aided protein engineering can markedly reduce the experimental effort in identifying optimal sequences that satisfy the desired properties from a large number of possible protein sequences. To develop general protein descriptors for computer-aided protein engineering tasks, we devised new protein descriptors, one sequence-based descriptor (PCgrades), and three structure-based descriptors (PCspairs, 3D-SPIEs_5.4 Å, and 3D-SPIEs_8Å). While the PCgrades and PCspairs include general and statistical information in physicochemical properties in single and pairwise amino acids respectively, the 3D-SPIEs include specific and quantum–mechanical information with parameterized quantum mechanical calculations (FMO2-DFTB3/D/PCM). To evaluate the protein descriptors, we made prediction models with the new descriptors and previously developed descriptors for diverse protein datasets including protein expression and binding affinity change in SARS-CoV-2 spike glycoprotein. As a result, the newly devised descriptors showed a good performance in diverse datasets, in which the PCspairs showed the best performance ([Formula: see text] for protein expression and [Formula: see text] for binding affinity). As a result, the newly devised descriptors showed a good performance in diverse datasets, in which the PCspairs showed the best performance. Similar approaches with those descriptors would be promising and useful if the prediction models are trained with sufficient quantitative experimental data from high-throughput assays for industrial enzymes or protein drugs.

Protein engineering is a progressive process to design or develop proteins with properties valuable for scientific, industrial, or medical applications [1] . Protein amino acid sequences determine protein properties and functions, including expression level and catalytic activity [1] . Protein engineering involves a premeditated process to investigate the relationship between the amino acid sequence and protein function and to identify amino acid sequences with improved function. As a powerful protein engi-neering technique, directed evolution has been successful in producing enzymes and binding proteins by emulating the natural evolution process in the laboratory [2] . Directed evolution leads to an accumulation of beneficial mutations through an iterative process that is coupled to sequence diversification methods and selection strategies [1, 2] . However, this approach is timeconsuming and resource-intensive due to multiple highthroughput iterations [2] . In directed evolution, it is difficult to learn lessons from failure because valuable information regarding unimproved sequences is discarded [1] .

Machine learning can utilize the information of unimproved sequences to differentiate protein properties. Prediction models with machine learning can speed up the evolution and optimization of protein properties by evaluating and selecting new variants to screen [1] . The models can guide the design of future experi- mental rounds to escape local optima by learning efficiently and synthesizing the most promising variants [1, 3] . Even in the cases where the underlying biophysical mechanisms are not well explained, machine learning models can be applied and be powerfully predictive [1] . Machine learning models in protein engineering require descriptors, also known as features, that are suitable for obtaining information in proteins [3] .

Many protein descriptors have been developed and applied to diverse protein engineering tasks [3, 4] . The descriptors are generally based on mutation indicators, protein sequences, and protein structures. Mutation indicator (MutInd) is a binary vector of elements 0 or 1, indicating whether specific mutations in sequence exist or not [3] . The MutInd can directly utilize experimental values, but it is too simple to explain all protein functions and has a limitation on extrapolation for new single mutations.

Attempts to utilize protein sequences have been made because they may possess valuable information regarding protein expression, binding affinity, stability, and other properties. In a bottomup approach, many single amino acid property descriptors were developed and most of them are listed in the AA-index database [5] . In a top-down approach, attempts were made to study representations from raw sequences. Some examples of this approach include the word embedding model 'doc2vec' and natural language processing-based continuous vector representation 'BioVec' [6, 7] . Recently, a statistical unified representation (UniRep) was developed to summarize protein sequences not equal in length to fixed-length vectors via recurrent neural network methods [4, 8, 9] . UniRep may connote fundamental features of protein sequences because clustering with UniRep can distinguish biophysical single amino acid properties, secondary structural helixsheet properties, and evolutional proteome properties [4] . However, protein sequence descriptors have a sequence-function gap because protein sequences must be translated to the accurately folded protein structures for protein functions.

Protein structures form the basis for the structure-activity relationship (SAR) and the analysis of SAR enables a prediction for protein activity of new mutated proteins in protein engineering. Protein structures can be generally made use of topological and biophysical descriptors. As a popular topological descriptor in protein structure, the distance matrix between amino acids in a protein structure can be used to extract the spatial arrangement in a protein structure. A structure-based descriptor derived from amino acid pairwise contact potentials (sPairs) used the distance matrix to filter the amino acid pairs and employed the AA-index amino acid pairwise contact potential descriptors [3] . Recently, graph convolutional networks (GCN) are developed to generalize convolutional operations on the graph-like molecular representation of protein structures [10] . DeepFRI is a GCN-based model for predicting protein functions by leveraging sequence features extracted from a protein language model and protein structures [11] .

In biophysical descriptors, it is important to simulate molecular phenomena and accurately describe their physical, chemical, and biological properties. Energy calculations in molecular simulation have been used to predict the properties of biomolecules especially in the field of computer-aided drug discovery. Although quantum mechanics (QM) based molecular orbital calculation methods provide an accurate description of molecular phenomena, QM methods require huge computational costs and could not be easily applied to large biological systems [12] . Fragment molecular orbitals (FMO) method was developed for QM calculations of large molecular systems in 1999 [13] . The FMO method dramatically reduced computational cost without compromising the accuracy compared to the traditional QM method, which has been successfully applied to the protein-ligand interactions and protein-protein interactions [12, [14] [15] [16] [17] . Compared to traditional QM methods, the FMO method provides inter-fragment interaction energies, the map of which contains secondary structural and stability information in protein structure [18] . A 3-dimensional scattered pair interaction energies (3D-SPIEs) method extracts significant pair interactions in the map, which can be used to find a hot spot region in the protein-protein interface of the spike glycoprotein from severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [12, 14] .

SARS-CoV-2 has been categorized as a human pathogen that caused the global pandemic of coronavirus disease that began in December 2019 [19, 20] . The receptor binding domain (RBD) of the spike glycoprotein in SARS-CoV-2 binds to human angiotensinconverting enzyme 2 (hACE2), allowing the viral membrane of SARS-CoV-2 to fuse with the host cell membrane [21] . Analysis of the quantitative deep mutational scanning data obtained using yeast-surface display methods showed how RBD amino acid mutations affect the binding affinity with hACE2 and protein expression of RBD [22] . These experimental validations for single amino acid variants of RBD are valuable data for assessing whether the viral mutation is likely to be deleterious. The ongoing evolution of SARS-CoV-2 variants has provided critical insights for preparing for and preventing future outbreaks [23] . Accurately predicting the effects of amino acid changes on the ability of RBD, such as protein expression and binding affinity, can help assess the implications for public health in the ongoing evolution of SARS-CoV-2.

In this work, we devised one sequence-based (PCgrades) and three structure-based protein descriptors (PCspairs, 3D-SPIEs_5.4 Å, and 3D-SPIEs_8Å). And then we applied the descriptors to make prediction models for seven datasets. To evaluate the protein descriptors, we compared the performance of the prediction models trained with the newly devised descriptors and the known three protein descriptors (PCscores, sPairs, and UniRep fusion). The PCscores and sPairs were the top-ranked descriptors in the comparison of protein descriptors in diverse datasets by Xu et al [3] . The UniRep fusion is a newly devised descriptor with natural language processing methods and has the potential for diverse protein engineering tasks [4] . Because the PCscores, sPairs, and UniRep fusion can be applied to general property prediction in diverse datasets, we made baselines with the PCscores, sPairs, and UniRep fusion to evaluate the newly devised general descriptors in this work.

The seven datasets in this work are summarized in Table 1 . Reference protein sequences in all datasets were obtained from Uni-Prot [24] . In the absorption wavelength shift dataset, Gloeobacter violaceus bacteriorhodopsin (GR) was used (UniProt ID: Q7NP59) and the dataset was from Engqvist et al [25] . In the enantiomeric selectivity dataset, Aspergillus niger epoxide hydrolase (ANEH) was used (UniProt ID: Q9UR30) and the dataset was from Gumulya et al and Reetz et al [26] [27] [28] . In the enantiomeric excess dataset, Rhodothermus marinus nitric oxide dioxygenase (RmaNOD) was used (UniProt ID: D0MGT2) and the dataset was from Wu et al and Wittmann et al [29, 30] . In binding affinity and protein expression data sets, the spike glycoprotein of SARS-CoV-2 was used (UniProt ID: P0DTC2) and the datasets were from Starr et al [22] . In all datasets, only the substitution mutations were selected. The insertion-deletion mutations were removed.

Single amino acid property descriptors infer various physicochemical and biochemical properties of single amino acids [5] .

AA-index is a large database of numerical indices of single amino acids and pairs [5] . The zScales and VHSE are based on principal component analysis to represent the scalar values of amino acids. While the zScales has 5 rows and it is based on the amino acid position information (size, hydrophobicity, charge, and so on.) [31] , the VHSE has 50 rows and is vectors of hydrophobic, steric, and electronic properties [32] . The PCscores and PCgrades were used as a representative of protein single amino acid property descriptors (Fig. 1) . The min-max scaling was introduced in principal component analysis. While the PCscores is the first 11 principal components of the 533 sets of single amino acid descriptors in the AA-index by Xu et al [3] , we used the 553 sets in the AA-index to make the PCscores in this work. The PCgrades is firstly introduced in this work and is the first 11 principal components of the 606 sets of single amino acid descriptors in the AA-index, VHSE, and zScales [31, 32] . The explained variance ratios of principal components in the PCscores and PCgrades are shown in Fig. S1 , in which the first 11 principal components of PCscores accounted for about 91.3% of variances and the 11 components of PCgrades accounted for about 91.5% of variances.

UniRep is based on a 1900-unit multiplicative long-/short-term memory recurrent neural network (mLSTM/RNN) model and was trained with approximately 24 million protein sequences (Uni-Ref50) by Alley et al [4] . We used globally pre-trained weights with UniRef50 and calculated all descriptors with the performant reimplementation of UniRep in Jax [33] . The mLSTM/RNN-1900-unit model in UniRep provided all three representations of protein sequence: average hidden state, final hidden state, and the last internal cell state of the single 1900-dimensional layer. Alley et al devised the UniRep fusion by concatenating the three representations and they made prediction models with the UniRep fusion [4] . We also concatenated the three representations to obtain the UniRep fusion in this work (Fig. S2 ).

All experimental protein structures were collected from the Protein Data Bank (PDB) [34] . In the GR dataset, we used chain A of a wild-type GR (PDB ID: 6NWD), the resolution of which is 2.00 Å [35] . In the ANEH dataset, we used chain A of a wild-type ANEH (PDB ID: 1QO7), the resolution of which is 1.80 Å [36] . In the RmaNOD dataset, we used chain A of a wild-type RmaNOD (6WK3), the resolution of which is 2.45 Å [30] . In the SARS-CoV-2 expression dataset, we used an unbound form of RBD in chain A of SARS-CoV-2 glycoprotein (PDB ID: 6ZGE), the resolution of which is 2.60 Å [37] . In the SARS-CoV-2 binding affinity dataset, we used a complex form of RBD in chain A of SARS-CoV-2 glycoprotein with chain B of hACE2 (PDB ID: 7KMS), the resolution of which is 3.64 Å [38] . All the missing side chains of the protein structures were filled with Prime implemented in the Schrödinger program [39] . Hydrogen atoms were added to the protein structures at pH 7.0 and their positions were optimized with the PROPKA implemented in the Schrödinger program [40] . The restrained energy minimization was performed with OPLS3 in the Schrödinger program within 0.3 Å root-mean-squared deviation [41] . Each mutant structure was generated with residue scanning, in which the mutated side-chain rotamers were searched for all mobile residues with Prime implemented in the Schrödinger program [42] . The residue scanning method is to predict the structures of residues in mutants with homology modeling, which adjusts the sidechain rotamers for repacking and minimizes the side-chain atoms. In this step, the backbone minimization of the mutated residues was not performed. Rather, the predicted mutant structures were re-prepared with the same protocol, and all hydrogen atoms are removed and re-added at pH 7.0. In generating all structurebased descriptors, one data point shares one protein structure in seven datasets.

AA-index database includes amino acid pairwise contact potentials for statistical analysis of protein sequences and protein structures [5] . The generation workflow for the sPairs and PCspairs is shown in Fig. 1 . The sPairs is a structure-based descriptor employing the AA-index amino acid pairwise contact potential, which used statistical contact potential derived from 25 X-ray protein structures (TANS760101) [3] . While only a single 3D protein structure was used to make sPairs in Xu et al [3] , we used each mutant structure to make sPairs for each mutant in this work. The PCspairs is firstly introduced in this work, and is the first principal component of the 135 sets of amino acid pairwise contact potentials (Fig. 1) . The min-max scaling was introduced in principal component analysis. The explained variance ratios of the first principal components in each amino acid are shown in Fig. S3 , in which each first principal component accounted for more than 50% of the variance in 135 sets. The 135 sets of amino acid potentials include the substitution matrices, contact potentials in X-ray structures, the transfer energy of amino acids from water to the protein environment, and potentials in the protein-protein interfacial regions [5] . The PCspairs can include effective information not only on contact potentials but also water-mediated and protein-protein interactions. In paired positions of two residues within 8 Å in the structure, the values were derived from the contact potential but otherwise were set to zero. The distance between two residues was measured with a single-linkage distance.

2.6. Quantum mechanical energy descriptors (3D-SPIEs_5.4 Å and 3D-SPIEs_8Å)

All FMO calculations were performed with the version of Feb 14, 2018 GAMESS [43] , and with FMO2-DFTB3/D/PCM level. They include the two-body FMO method (FMO2), a self-consistentcharge density-functional tight-binding method derived via a third-order expansion (DFTB3) with the 3OB parameter set [44, 45] , the UFF-type dispersion correction (D) [46, 47] , and implicit polarizable continuum model (PCM) [44] . The two-body FMO calculation consists of four steps with fragmentation, fragment self-consistent field calculation, fragment-pair self-consistent field calculation, and total property evaluation [12] . Firstly, all input files were prepared in compliance with the hybrid orbital projection (HOP) scheme fragmentation [48] . In the HOP scheme, each residue was defined as one fragment, and two cysteine residues forming the disulfide bridge were defined as one fragment. In the GR, the retinal and its covalently bound lysine were defined as one fragment. In RmaNOD, we removed the heme group and the heme-coordinated residues were terminated with hydrogen atoms because the 3OB parameter does not support the 'Fe' atom type. Secondly and thirdly, all the molecular orbitals in fragment and fragment-pair are optimized by self-consistent field theory in the whole electrostatic field from the other fragments. The difference between the second and third steps is just the size of the fragment, where the fragment pair is the combination of two fragments. Fourthly, all results from the second and third steps are pieced together to generate the whole picture of the system. In this step, FMO provides the pair interaction energies between two frag-ments. All 3D-SPIEs results were generated with a similar protocol in previous studies (Fig. 1) [12, 14] . In this work, we did not apply the energy cut-off criteria and used all values within the specific distance (5.4 Å and 8.0 Å). The distance between two fragments was measured with a single-linkage distance.

Predictions on test sets were evaluated using R 2 , RMSE, MAE, and Spearman's rank correlation. R 2 is a square of a measure of linear correlation between predicted and observed values. RMSE is a root-mean-square-error and a measure of the difference between the predicted and observed values. MAE is a mean absolute error and a measure of the absolute errors between the predicted and observed values. Spearman's rank correlation is a nonparametric measure of rank correlation. The Scikit-learn package was used to calculate R 2 , RMSE, and MAE [49] , and the SciPy package in Python was used to calculate Spearman's rank correlation coefficients (SCC) [50] .

For all supervised tasks in this study, we prepared a split with 80% training and 20% test sets in Python using the Scikit-learn package with a fixed random seed [49] . In training, we used 10-fold cross-validation with GridSearchCV in the Scikit-learn package [49] . To improve machine learning, we removed the columns with all same values in each descriptor feature and introduced a minmax scaling by fitting the scaler with training data and applying the scaler on training and test data, respectively.

Prediction models were constructed using the random forest (RF) regression model and the extreme gradient boosting (XGB) model. The RF regression model is an ensemble method based on many classifying decision trees, and it uses averaging to improve stability and accuracy by reducing variance and avoiding overfitting problems [3] . Decision trees use a flowchart-like tree structure and observe features in descriptors to provide a useful continuous output. The XGB is an ensemble learning method based on gradient tree boosting, which builds each tree sequentially and each tree fits the residuals of predictions of all previous trees [3] .

The hyperparameter tuning procedure is summarized in Table S1 . Some hyperparameters in RF were incorporated for tuning the models; the number of decision trees (n_estimators) and the number of features to consider when looking for the best split (max_features) [49] . Some hyperparameters in XGB were incorporated for tuning the models; the number of the gradient boosted trees (n_estimators), the maximum tree depth for based learners (max_depth), the boosting learning rate (learning_rate), the minimum loss reduction required to make a further partition on a leaf node of the tree (gamma), the minimum sum of instance weight needed in a child (min_child_weight), and the subsample ratio of the training instance (subsample) [51] . The 10-fold crossvalidation was used for hyperparameter tuning of all machine learning algorithms. Grid-search was performed to find optimal values under each set of descriptors. The final model was selected with the best performance of R 2 in the validation set of crossvalidation. The optimal hyperparameter values are summarized in Table S2 .

Predicting protein properties is important in computer-aided rational protein engineering tasks. To improve prediction performance, we devised new protein descriptors, one sequence-based descriptor (PCgrades) and three structure-based descriptors (PCspairs, 3D-SPIEs_5.4 Å, and 3D-SPIEs_8Å). The generation workflow of protein descriptors is shown in Fig. 1 . To evaluate the new protein descriptors, we collected seven datasets and made prediction models from the combination of two machine learning models (RF and XGB) and seven protein descriptors (PCscores, PCgrades, sPairs, PCspairs, UniRep fusion, 3D-SPIEs_5.4 Å, and 3D-SPIEs_8Å).

All data sets in this study are summarized in Table 1 and illustrated in Fig. 2 . Hyperparameter tuning and 10-fold crossvalidation of all machine learning algorithms were performed with grid-search. The performance metrics of mean R 2 in training sets and 10-fold cross-validation test sets are summarized in Table S3 and Table 2 . The performance metrics of R 2 , RMSE, MAE, and Spearman's rank correlation in test sets are summarized in Table 3 and  Tables S4-S6 . The correlation plots from the test set predictions of the best model in each dataset are shown in Fig. 3 .

GR is a valuable engineering target protein in light-harvesting to capture photons of solar light in the bioenergy production and bio-sensing industry [25] . The maximum absorption wavelength levels of various mutants of GR were obtained from the study by Engqvist et al [25] . In the GR dataset, we made prediction models separately for wavelength (GR-wave) and wavelength shift from wild-type (GR-shift).

In GR-wave, the prediction model from XGB and 3D-SPIEs_8Å showed the best performance in the test set prediction (R 2 ¼ 0:947). The best model was trained with optimal hyperparameter (learning_rate = 0.01, max_depth = 10, and n_estimators = 1000). In the test set of the best model, the RMSE is 7.948, the MAE is 6.609, and the SCC is 0.950. The second-best model is the RF/ PCgrades and XGB/3D-SPIEs_5.4 Å, the performance of which in the test set is R 2 ¼ 0:934. The best prediction model by Xu et al showed R 2 ¼ 0:934 with the VHSE and multilayer perceptron method [3] .

In GF-shift, the prediction model from XGB and PCspairs showed the best performance in test prediction (R 2 ¼ 0:950). The best model was trained with optimal hyperparameter (learning_r ate = 0.05, max_depth = 15, and n_estimators = 500). In the test set of the best model, the RMSE is 10.554, the MAE is 8.972, and the SCC is 0.968. The second-best model is the RF/PCgrades, the performance of which in the test set is R 2 ¼ 0:934.

Enantiomeric selectivity of an enzyme is the ability of an enzyme to selectively distinguish one enantiomer from its counter isomer in an enzymatic reaction. This property plays a critical role in fine chemical production and bioindustry. The enantiomeric selectivity levels of various ANEH mutants were obtained from the studies by Gumulya et al and Reetz et al [26] [27] [28] . ANEH has hydrolytic kinetic activity with glycidyl phenyl ether, and wildtype ANEH has selectivity in favor of the (S)-glycidyl phenyl ether (DDG z ¼ À0:85). In the ANEH dataset, we made prediction models separately for e-value (ANEH-evalue) and dd G z (ANEH-ddG).

In ANEH-evalue, the prediction model from RF and PCspairs showed the best performance in the test set prediction (R 2 ¼ 0:859). The best model was trained with optimal hyperparameter (max_features = 'sqrt' and n_estimators = 500). In the test set of the best model, the RMSE is 12.862, the MAE is 8.628, and the SCC is 0.956. The second-best model is the RF/PCgrades, the performance of which in the test set is R 2 ¼ 0:848. The best prediction model by Xu et al showed R 2 ¼ 0:754 with sPairs and elastic-net regularized generalized linear model method [3] .

In ANEH-ddG, the prediction model from RF and PCgrades showed the best performance in the test set prediction (R 2 ¼ 0:938). The best model was trained with optimal hyperparameter (max_features = 'log2 0 and n_estimators = 1000). In the test set of the best model, the RMSE is 0.143, the MAE is 0.108, and the SCC is 0.935. The second-best model is the RF/PCscores, the performance of which in the test set is R 2 ¼ 0:935.

Enantiomeric excess is a measure of the purity of one enantiomer in a sample with mixed enantiomers. The enantiomeric excess levels of various mutants of RmaNOD were obtained from the study by Arnold et al [29, 30] . RmaNOD catalyzes carbonsilicon bond formation between ethyl 2-diazopropanoate and phenyldimethyl silane and produces two possible product enantiomers of carbine Si-H insertion reaction [29] . In the RmaNOD dataset, we used enantiomeric excess as an objective variable (RmaNOD-ee).

In RmaNOD-ee, the prediction model from RF and PCscores showed the best performance in the test set prediction (R 2 ¼ 0:723). The best model was trained with optimal hyperparameter (max_features = 'sqrt' and n_estimators = 1500). In the test set of the best model, the RMSE is 0.194, the MAE is 0.133, and the SCC is 0.838. The second-best model is the RF/PCspairs, the performance of which in the test set is R 2 ¼ 0:718. The best prediction model by Xu et al showed R 2 ¼ 0:288 with PCscores and XGB method [3] .

Protein expression is affected by protein stability and solubility, which are the primary common causes of protein production failure. The mean protein expression levels of various mutants of RBD were obtained from the study by Starr et al [22] . In the SARS-CoV-2 protein expression dataset, we used mean protein expression levels as an objective variable (SARS2-expr).

In SARS2-expr, the prediction model from RF and PCspairs showed the best performance in the test set prediction (R 2 ¼ 0:783). The best model was trained with optimal hyperparameter (max_features = 'log2 0 and n_estimators = 1500). In the test set of the best model, the RMSE is 0.490, the MAE is 0.355, and the SCC is 0.891. The second-best model is the XGB/PCspairs, the performance of which in the test set is R 2 ¼ 0:743. The general protein expression prediction model showed the prediction performance (R 2 ) between 0.504 and 0.698 [52] .

Binding affinity is a measure of the strength of the interaction between a protein and a ligand. The mean binding affinity levels of various mutants of RBD with hACE2 were obtained from the study by Starr et al [22] . In the SARS-CoV-2 binding affinity dataset, we used mean binding affinity levels as an objective variable (SARS2-bind).

In SARS2-bind, the prediction model from RF and PCspairs showed the best performance in test prediction (R 2 ¼ 0:711). The best model was trained with optimal hyperparameter (max_features = 'sqrt' and n_estimators = 1500). In the test set prediction of the best model, the RMSE is 0.735, the MAE is 0.412, and the SCC is 0.873. The second-best model is the XGB/PCspairs, the performance of which in the test is R 2 ¼ 0:702. The general binding affinity prediction models have the prediction performance (Pearson's linear correlation) between 0.61 and 0.76 [53] [54] [55] [56] , which can be converted to the prediction performance (R 2 ) between 0.3721 and 0.5776. The specific binding affinity prediction models for SARS-CoV-2 and hACE2 have the performance (Pearson's linear correlation) of 0.73 and 0.82, which can be converted to the prediction performance (R 2 ) of 0.5329 and 0.6724 [57, 58] .

To compare evaluation metrics, the performance of 98 final models (the 98 combinations from seven protein descriptors, seven datasets, and two machine learning models) was ranked according to the median value in increasing order for RMSE and decreasing order for R-squared in the test prediction (Fig. 4) . The ranking results of RMSE and R-squared are almost identical. The PCspairs won both in RMSE and R-squared ranking, followed by PCgrades, PCscores, sPairs, 3D-SPIEs_8Å, 3D-SPIEs_5.4 Å, and UniRep fusion. In the model ranking, the combination of XGB and PCspairs won in all combinations, followed by RF/PCspairs, RF/PCgrades, RF/ PCscores, and XGB/PCgrades. 

Given the successful application of machine learning methods in directed protein evolution, machine learning methods have been broadly applied in protein engineering [3, 29] . The broader applications in protein engineering require sufficient data quantity, quality, and protein descriptors. Generating high quantity and quality data from high-throughput assays improves machine-learningguided protein engineering in principle. But, most cases of protein engineering tasks have a data shortage in the desired properties of target proteins. For example, phage display technologies are superb and powerful for therapeutic and industrial projects but require tedious optimization and time-consuming test cycles. Therefore, many good protein descriptors are inevitably required in supervised machine learning with small data sets from the feedback between machine learning prediction models and experimental assays.

Protein descriptors are usually based on mutation indicators, protein sequences, and protein structures. As valuable information in databases including UniProt [24] and PDB [34] has increased, the necessity of utilizing protein sequences and unlabeled structures has increased. For example, the UniRep was developed from unla-beled protein sequences and distinguished physicochemical, secondary structural, and evolutionary information [4] . The UniRep is the mLSTM/RNN based on the statistical descriptor to extract fundamental features in protein sequences from a large unlabeled protein sequence database (UniRef50). Although the UniRep fusion underperformed compared to other protein descriptors in this work, the UniRep fusion has the potential in comparing protein sequences not equal in length. Although mutation indicators and protein sequence-based descriptors are powerful and effective tools for constructing prediction models to achieve diverse desired properties in protein engineering, protein structural descriptors are also promising because of the close relationship between structure and activity.

In this work, we devised one sequence-based protein descriptor (PCgrades) and three structural protein descriptors (PCspairs, 3D-SPIEs_5.4 Å, and 3D-SPIEs_8Å). To evaluate the newly devised protein descriptors, we made prediction models with the newly devised descriptors and the previously developed descriptors (PCscores, sPairs, and UniRep fusion). In the seven datasets, the PCspairs generally showed a better performance than other protein descriptors. The PCgrades and 3D-SPIEs showed the best performance in the ANEH-ddG and GR-wave datasets, respectively. The PCgrades has more information on single amino acid descriptors than the PCscores, and PCspairs has more information on amino acid pairwise potential than the sPairs. Although the newly devised protein descriptors showed a good performance in the seven datasets, it still has a scope to improve the model performance for future studies. The combination of sequence and structure-based descriptors would be promising because they include valuable information from a different angle. For example, while sequencebased descriptors are easy to compare evolutional information, structure-based descriptors are easy to consider biophysical environments.

The 3D-SPIEs are based on quantum mechanical free energy calculations, and they can be improved with appropriate simulation systems, higher calculation levels, and more accurate homology models. The 3D-SPIEs only outperformed in GR-wave, because quantum mechanical simulations require more rigorous and detailed simulation systems. In GR-wave, we included a retinal molecule in FMO calculation, which improved the descriptor quality. Because molecular simulations mainly depend on appropriate simulation systems, it is important to construct specific simulation systems for the specific biological phenomena. Moreover, the molecular simulations also mainly depend on protein structures, so it is important to predict mutant protein structures accurately. As the homology modeling methods have been dramatically improved with deep learning methods [59, 60] , structure-based descriptors would be more accurate and powerful. New protein structure-based descriptors from rational protein analysis in the pharmaceutical industry would be promising in the future.

The newly devised protein descriptors (PCgrades, PCspairs, and 3D-SPIEs) contain different information in proteins. The PCgrades include the physicochemical property information of single amino acids in protein sequences and the PCspairs include the pairwise statistical potential information between two residues in protein structures. The two descriptors effectively compressed the information with principal component analysis and can be trained for the general and statistical properties of proteins. On the other hand, the 3D-SPIEs contain specific and mechanical information in protein structures. The 3D-SPIEs utilized quantum mechanical methods to quantify the interaction energies between two residues. Because the three descriptors can include different information in protein sequences and structures, they can be trained in response to diverse information from proteins. Therefore, similar approaches with those descriptors would be promising and useful for industrial enzymes and protein drugs.

In addition to the protein descriptors, various combinations with state-of-the-art machine learning algorithms and optimization of model architectures in deep learning would also improve the model performance [3] . Taken together, it has permitted the development of more accurate and powerful prediction models, which in turn would enable the computational exploration of enormous sequence space and suggestions for better variants in therapeutic or industrial research and development. In this work, we developed the prediction models for seven diverse datasets. The prediction models for the GR, ANEH, and RmaNOD would be applied to find optimal sequences satisfying the desired properties from many possible variants. Our prediction models in the three datasets outperformed the top-ranked models by Xu et al [3] . On the other hand, the prediction models for the protein expression and binding affinity of SARS-CoV-2 would be used to predict host adaption of SARS-CoV-2 variants with higher protein expression and binding affinity in the ongoing evolution of SARS-CoV-2. Our prediction models in the two datasets outperformed the generalpurpose prediction models in protein expression [52] and binding affinity [53] [54] [55] [56] and outperformed the specific prediction models for binding affinity in SARS-CoV-2 [57, 58] .

Protein engineering is a progressive process to find proteins with valuable properties in a tremendous possible protein sequence space. Machine-learning-guided protein engineering can speed up the identification of variants with optimal properties and has been expanded the predictions for diverse properties in many proteins. In this work, we developed one protein sequencebased and three structure-based descriptors and applied them to diverse protein engineering tasks. Similar approaches with those descriptors would be promising and useful if the prediction models are trained with sufficient quantitative experimental data from high-throughput assays for industrial enzymes or protein drugs. 

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Machine-learning-guided directed evolution for protein engineering

From directed evolution to computational enzyme engineering-a review

Deep dive into machine learning models for protein engineering

Unified rational protein engineering with sequence-based deep representation learning

AAindex: amino acid index database, progress report

International conference on machine learning

Continuous distributed representation of biological sequences for deep proteomics and genomics

Evaluating eUniRep and other protein feature representations for in silico directed evolution

Low-N protein engineering with data-efficient deep learning

Semi-supervised classification with graph convolutional networks

Structure-based protein function prediction using graph convolutional networks

Investigation of protein-protein interactions and hot spot region between PD-1 and PD-L1 by fragment molecular orbital method

Fragment molecular orbital method: an approximate computational method for large molecules

Hot spot profiles of SARS-CoV-2 and human ACE2 receptor protein protein interaction obtained by density functional tight binding fragment molecular orbital method

Investigation of hot spot region in XIAP inhibitor binding site by fragment molecular orbital method

Exploring chemistry with the fragment molecular orbital method

Electron-correlated fragment-molecular-orbital calculations for biomolecular and nano systems

Visualization analysis of inter-fragment interaction energies of CRP-cAMP-DNA complex based on the fragment molecular orbital method

A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster

Clinical features of patients infected with 2019 novel coronavirus in Wuhan

Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation

Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding

On the origin and evolution of SARS-CoV-2

UniProt: a hub for protein information

Directed evolution of Gloeobacter violaceus rhodopsin spectral properties

Many pathways in laboratory evolution can lead to improved enzymes: how to escape from local minima

Directed evolution of an enantioselective epoxide hydrolase: uncovering the source of enantioselectivity at each evolutionary stage

Constructing and analyzing the fitness landscape of an experimental evolutionary process

Machine learning-assisted directed protein evolution with combinatorial libraries

Diversity-oriented enzymatic synthesis of cyclopropane building blocks

New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids

A new set of amino acid descriptors and its application in peptide QSARs

Reimplementing Unirep in JAX

The protein data bank

X-ray crystallographic structure and oligomerization of Gloeobacter rhodopsin

Structure of Aspergillus niger epoxide hydrolase at 1.8 Å resolution: implications for the structure and function of the mammalian microsomal class of epoxide hydrolases

SARS-CoV-2 and bat RaTG13 spike glycoprotein structures inform on virus evolution and furin-cleavage effects

Cryo-EM structures of SARS-CoV-2 spike without and with ACE2 reveal a pH-dependent switch to mediate endosomal positioning of receptorbinding domains

On the role of the crystal environment in determining protein side-chain conformations

PROPKA3: consistent treatment of internal and surface residues in empirical p K a predictions

OPLS3: a force field providing broad coverage of drug-like small molecules and proteins

Applying physicsbased scoring to calculate free energies of binding for single amino acid mutations in protein-protein complexes

GAMESS As a Free Quantum-Mechanical Platform for Drug Research

The fragment molecular orbital method combined with density-functional tight-binding and the polarizable continuum model

Parameterization of DFTB3/3OB for sulfur and phosphorus for chemical and biological applications

An efficient a posteriori treatment for dispersion interaction in density-functional-based tight binding

UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations

Fragment molecular orbital method: application to polypeptides

Scikit-learn: Machine learning in Python. the

SciPy 1.0: fundamental algorithms for scientific computing in Python

XGBoost With Python: Gradient Boosted Trees with XGBoost and Scikit-Learn

Improving protein expression prediction using extra features and ensemble averaging

SSIPe: accurately estimating protein-protein binding affinity change upon mutations using evolutionary profiles in combination with an optimized physical energy function

SAAMBE-SEQ: a sequence-based method for predicting mutation effect on protein-protein binding affinity

MutaBind2: predicting the impacts of single and multiple mutations on protein-protein interactions

A topology-based network tree for the prediction of protein-protein binding affinity changes following mutation

Computational prediction of the effect of amino acid changes on the binding affinity between SARS-CoV-2 spike RBD and human ACE2

Rapid assessment of binding affinity of SARS-COV-2 spike protein to the human angiotensinconverting enzyme 2 receptor and to neutralizing biomolecules based on computer simulations

Highly accurate protein structure prediction with AlphaFold

Improved protein structure prediction using predicted interresidue orientations

This research was financially supported by the Ministry of Trade, Industry, and Energy (MOTIE), Korea, under the ''Infrastructure Support Program for Industry Innovation" (reference number P0014714) supervised by the Korea Institute for Advancement of Technology (KIAT).

Supplementary data to this article can be found online at https://doi.org/10.1016/j.csbj.2022.01.027.