key: cord-0692425-mfhafsku authors: Toropova, Alla P.; Raškova, Maria; Raška Jr., Ivan; Toropov, Andrey A. title: The sequence of amino acids as the basis for the model of biological activity of peptides date: 2021-01-22 journal: Theor Chem Acc DOI: 10.1007/s00214-020-02707-8 sha: e9eccbac952da09248bb986607708288dbdd9dd7 doc_id: 692425 cord_uid: mfhafsku The algorithm of building up a model for the biological activity of peptides as a mathematical function of a sequence of amino acids is suggested. The general scheme is the following: The total set of available data is distributed into the active training set, passive training set, calibration set, and validation set. The training (both active and passive) and calibration sets are a system of generation of a model of biological activity where each amino acid obtains special correlation weight. The numerical data on the correlation weights calculated by the Monte Carlo method using the CORAL software (http://www.insilico.eu/coral). The target function aimed to give the best result for the calibration set (not for the training set). The final checkup of the model is carried out with data on the validation set (peptides, which are not visible during the creation of the model). Described computational experiments confirm the ability of the approach to be a tool for the design of predictive models for the biological activity of peptides (expressed by pIC50). History of mathematical chemistry contains contributions of many outstanding scientists, such as A.T. Balaban, M. Randić, I. Gutman, N.Trinajstić, S.C. Basak, R. Carbó-Dorca, as well as many others [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] . Mathematical chemistry [1] is the area of research engaged in novel applications of mathematics to chemistry, biochemistry, and biology. It concerns itself principally with the mathematical modeling of complex molecular phenomena [2] . Most areas of research in mathematical chemistry include chemical graph theory, which deals with the development of topological descriptors which find application in quantitative structure-property relationships [3, 4] , as well as chemical aspects of group theory, which finds applications in stereochemistry and quantum chemistry [5, 6] . Chemoinformatics is a relatively young field of natural sciences. By analogy with "in viva" and "in vitro," the results of cheminformatics denominate as "in silico" [7] . It is to be noted, contributions of Prof. R. Carbó-Dorca, related to the development of cheminformatics tools applied to quantum mechanical theoretical problems, which gave the possibility to solve chemical problems, like catalysis and reactivity, by simple computational schemes [8] [9] [10] [11] [12] . Chemoinformatic gradually extends to solve tasks in fields of theoretical chemistry, computational chemistry, and modeling [13] [14] [15] . Apply mathematical methods to solve the tasks of chemistry and biochemistry can be effective [16, 17] . Peptides are important objects of chemistry, biochemistry, and medicine. Most interest in using proteins and peptides is caused by their application in drug design [18] . The amino acid residues of epitope-peptide substrate and SARS coronavirus main protease are interacting. Hence, the affinity of epitope-peptides with class I MHC (major histocompatibility complex) molecules can be used to development of antiviral agents, e.g., toward coronaviruses [18] . A fundamentally widely accepted science principle to understand complex systems is "Everything should be made as simple as possible, but no simpler" [19] . Perhaps, the approach used here cannot be adequately evaluated using the above principle, since a simpler method is not possible, or at least a simpler approach has not yet been described in the literature [20] [21] [22] [23] . To state the approach "simpler" than "simple" is not correct, since the approach gives quite good models [20] [21] [22] [23] . The model of biological activity of peptides described here is based on sequences of amino acids, represented by 1-letter codes ( Table 1) . The aim of the present study is the estimation of the CORAL software to provide a satisfactory model for the bioactivity of peptides. Representation of peptides via a sequence of amino acids is like a well-known simplified molecular input-line entry system (SMILES) [24] . Consequently, the CORAL software (www.insil ico.eu/coral ) that is oriented to build up quantitative structure-activity relationships (QSARs) using the SMILES representation can be a tool to build up a predictive model for the activity of peptides as a function of sequences of the 1-letter codes of corresponding amino acids [25] . Factually, the sequences of amino acids represented by 1-letter codes are quasi-SMILES [20, 21] . The numerical data on the biological activity of epitopepeptides with class I MHC (major histocompatibility complex) molecules taken from the literature [18] . The endpoint expressed via a negative logarithm of half-maximal inhibitory concentration IC 50 (pIC 50 ). Table 1 contains sequences of amino acids represent epitope-peptides studied here. The available epitope-peptides were randomly distributed into the active training set (25%), passive training set (25%), calibration set (25%), and validation set (25%). Each above set has a defined task. The task for the active training set is to build up optimal correlation weights for the optimal descriptor. The task for the passive training set is to checkup whether current correlation weights (and the optimal descriptor) are satisfactory for peptides, which are not involved in the calculation of the correlation weights. The task for the calibration set is to detect the moment of the begin overtraining. The task of peptides from the validation set is the final estimation of the predictive potential of the model. The CORAL software provides models, which are linear one-variable correlations obtained by the Monte Carlo method (http://www.insil ico.eu/coral ). The generalized representation of the model for the biological activity of peptides is the following one-variable correlation: The DCW(T,N) is the descriptor of correlation weights (DCW). The C 0 and C 1 are regression coefficients. The T and N are parameters of the Monte Carlo optimization discussed below. The descriptors applied to QSAR analysis are calculated as the following: The A k is a 1-letter code of amino acid; CW(A k ) is the correlation weights for the A k . The T is an integer to define two classes (i) the rare and (ii) non-rare. If the frequency of A k in the active training set is less than T, the A k is rare, and the CW(A k ) = 0 (i.e., the A k is removed from the modeling process). Thus, the model is based on correlation weights solely non-rare in the active training set amino acids. The N is the number of iterations for the Monte Carlo optimization. The T = T* and N = N* are values which provide the best statistical quality of the model for the calibration set. The scheme of the Monte Carlo optimization is described in [23, 25] . The essence of this version of the optimization procedure is the application of the Index of ideality of correlation (IIC). Models for the inhibitory activity of peptides built up here are build up to apply two different target functions TF 1 and TF 2 : The R AT and R PT are the correlation coefficient between observed and predicted endpoints for the active training set and passive training set, respectively. The IIC CLB is calculated with data on the calibration set as the following: The observed and calculated are the corresponding values of pIC 50 . Figure 1 contains the comparison of histories of the Monte Carlo optimizations with target functions TF 1 and TF 2 . One can see, the TF 2 seems preferable because factually the decrease in the statistical quality for calibration set and validation set is not observed, whereas in the case of TF 1 the decrease in the statistical quality for the calibration set and validation set is observed. The domain of applicability for the CORAL model is defined according to the distribution of SMILES attributes in the active training set and calibration set as two steps: Step 1: the definition of the statistical defect (d k ) for each SMILES attribute involved in building up of a model: where P(A k ) and P'(A k ) are the probability of A k in the training and calibration sets, respectively. N(A k ) and N'(A k ) are frequencies of A k in the training and calibration sets, respectively. Step 2: the calculation for all substances the statistical SMILES-defect (D j ): where NA is the number of non-blocked SMILES attributes in the SMILES. A substance falls in the domain of applicability if where D is the average of the statistical SMILES-defect for the training set. The same operation can be carried out with the sequences of 1-letter codes of amino acids, if instead of A k defined as a SMILES attribute, one examined A k defined as a 1-letter code of corresponding amino acids. The models obtained for three random splits into the training set (which is association of the active and passive training sets together with the calibration set) and validation set are the following: Target Function TF 1 Target Function TF 2 (11) Dj < 2 * D Table 2 contains the statistical characteristics of the models calculated with Eqs. 12-17. One can see, the predictive potential of models calculated using the IIC is better. Having numerical data on correlation weights of different amino acids obtained in several runs of the optimization, one can detect the amino acids of two classes: (1) amino acids with stable positive correlation weights, these are promoters of increase of pIC50; and (2) amino acids with stable negative correlation weights, these are promoters of decrease of pIC50. Thus, the approach gives the statistical mechanistic interpretation of the models. Table 3 contains a collection of amino acids which are promoters of increase/decrease for pIC 50 . It is to be noted, the prevalence of corresponding amino acids also should be considering. Table 4 contains experimental and calculated with Eq. 17 pIC 50 . Table 5 contains the numerical data on the correlation weights of amino acids to calculate the model with Eq. 17. Table 6 contains an example of calculation DCW (1, 15) for epitope-peptide "WLEPGPVTA" together with the calculation of corresponding pIC50 using Eq. 17. Thus, the described approach can be a tool to build up models for pIC 50 for epitope-peptides. The described approach gives a robust model for the biological activity of peptides ( Table 4 ). The results are quite acceptable for three random splits into the training set and validation set. The approach obeys the OECD principles [26] . Once again, the possibility to build up predictive models for endpoints related to complex molecular systems (peptides) is confirmed [5] [6] [7] [8] . In addition, the described confirms once more that applying the IIC improves the predictive potential of models [20, 25] . Table 4 Experimental and calculated with Eq. 17 pIC 50 for model obtained with split 3 (the best model): " + " is the indicator for the active training set; "-" is the indicator for the passive training set; "#" is the indicator of calibration set; and "*" is the indicator for validation set Mathematical chemistry, a new discipline Mathematical concepts in organic chemistry Mathematical chemistry Advances in mathematical chemistry and applications Basic overview of chemoinformatics Application of molecular quantum similarity to QSAR About the prediction of molecular properties using the fundamental quantum QSPR (QQSPR) equation Modeling the structure-property relationships of nanoneedles: a journey toward nanomedicine Construction of coherent nano quantitative structure-properties relationships (nano-QSPR) models and catastrophe theory Six questions on topology in theoretical chemistry Toward a universal quantum QSPR operator Divagations about the periodic table: BOOLEAN hypercube and quantum similarity connections Hypercubes defined on n-ary sets, the Erdös-Faber-Lovász conjecture on graph coloring, and the description spaces of polypeptides and RNA Solutions to the quantum QSPR problem in molecular spaces Geometric and electronic similarities between transition structures for electrocyclizations and sigmatropic hydrogen shifts Peptide reagent design based on physical and chemical properties of amino acid residues Simulating complex systems by cellular automata. Understanding Complex Systems Ideal correlations" for biological activity of peptides Prediction of antimicrobial activity of large pool of peptides using quasi-SMILES Utilization of the monte carlo method to build up QSAR models for hemolysis and cytotoxicity of antimicrobial peptides QSAR modeling of endpoints for peptides which is based on representation of the molecular structure by a sequence of amino acids SMILES, a chemical language and information system: 1: Introduction to methodology and encoding rules Index of Ideality of correlation: new possibilities to validate QSAR: a case study CORAL software: prediction of carcinogenicity of drugs by means of the monte carlo method Acknowledgements AAT and APT are grateful for the contribution of the project LIFE-VERMEER contract (LIFE16 ENV/ES/000167) for financial support. Conflict of interest The authors confirm they have no conflict of interest.