key: cord-0056117-050nb388 authors: Amin, Arqam; Awais, Muhammad; Sahai, Shalini; Hussain, Waqar; Rasool, Nouman title: iDRP-PseAAC: Identification of DNA Replication Proteins Using General PseAAC and Position Dependent Features date: 2021-02-08 journal: Int J Pept Res Ther DOI: 10.1007/s10989-021-10170-7 sha: 20ca7428523b43b770b271d1b9a18ea3f25b6b8a doc_id: 56117 cord_uid: 050nb388 DNA replication is one of the specific processes to be considered in all the living organisms, specifically eukaryotes. The prevalence of DNA replication is significant for an evolutionary transition at the beginning of life. DNA replication proteins are those proteins which support the process of replication and are also reported to be important in drug design and discovery. This information depicts that DNA replication proteins have a very important role in human bodies, however, to study their mechanism, their identification is necessary. Thus, it is a very important task but, in any case, an experimental identification is time-consuming, highly-costly and laborious. To cope with this issue, a computational methodology is required for prediction of these proteins, however, no prior method exists. This study comprehends the construction of novel prediction model to serve the proposed purpose. The prediction model is developed based on the artificial neural network by integrating the position relative features and sequence statistical moments in PseAAC for training neural networks. Highest overall accuracy has been achieved through tenfold cross-validation and Jackknife testing that was computed to be 96.22% and 98.56%, respectively. Our astonishing experimental results demonstrated that the proposed predictor surpass the existing models that can be served as a time and cost-effective stratagem for designing novel drugs to strike the contemporary bacterial infection. DNA replication is one of the specific processes in eukaryotes. The prevalence of DNA replication should be a significant evolutionary transition at the beginning of life. By replicating DNA content, organisms can pass genetic information on to future generations (Fragkos et al. 2015; Kurat et al. 2017; Vaz et al. 2016) . Mutations during the reproduction process allow the population to evolve and adapt. The central importance of DNA replication for such important processes in life makes the development of the DNA replication mechanism more important for understanding the evolution of life (Wang et al. 2004) . DNA replication is a biological process that produces two identical copies of DNA from the original DNA molecule. DNA replication occurs in all living organisms. The cells have the characteristic that requires DNA replication to be carried out (Beattie et al. 2017; Yeeles et al. 2017) . Double helix DNA consists of two integrated branches. These strings are separated during the copy process. Next, each strand of the original DNA molecule functions as a template and generates its counterpart. This is a process known as semi-conservative iteration. Because of the semiconservative replication, the new coil is composed of both the original DNA strand and the newly synthesized strand. Cell error correction and error-checking mechanisms ensure almost complete commitment to DNA replication (Fragkos et al. 2015; Kurat et al. 2017) . In cells, DNA replication begins at a specific site or origin of replication in the genome. The formation of DNA and the synthesis of new strands, called Helis enzyme uptake, results in the appearance of repetitive spinous processes that grow in both directions from the original direction. Many proteins bind to the replication fork and initiate and maintain DNA synthesis. Importantly, DNA polymerase synthesizes new strands by adding complementary nucleotides to each strand. DNA replication occurs in the synthesis (Aze et al. 2016) . In the area of vaccine development, DNA replication proteins are of great interest being immunologically active for inducing the immune system (Hamzeh-Mivehroud et al. 2013) . DNA replication proteins are reported to be very potentially active targets against antimicrobial agents (Eijk et al. 2017) . This information depicts that DNA replication proteins have a very important role in human bodies. Transcription termination factor is one of the important examples of DNA replication proteins (Fig. 1) . However, to study their mechanism, their identification is necessary. Thus, it is a very important task but, in any case, an experimental identification is time-consuming, highly-costly and laborious. To cope with this issue, a computational methodology is required for prediction of these proteins (Jiang et al. 2016; Li et al. 2016) , however, no prior method exists. Computational methodologies or bioinformatic tools and approaches have extensively been used to provide insight and valuable information at the molecular as well as protein level. These approaches include structural bioinformatics tools (Chou 2004) , molecular packing (Chou et al. 1988 ), molecular docking Wang et al. 2000; Zheng et al. 2007; Li et al. 2007; Zhang et al. 2006) , protein subcellular location prediction (Chou and Shen 2007a,b) , membrane proteins prediction with their types (Chou and Shen 2007b) , prediction of classes and subclasses of functional enzymes (Chou and Shen 2007b) , approaches for protease cleavage sites and single peptides prediction (Chou and Shen 2007a,b; Shen and Chou 2008) and QSAR models to predict specific activities of peptides and proteins for drug designing. The bioinformatic analysis provides by these computational approaches has developed remarkable advances for better understanding of the nucleic acids and proteins, their interactions and mode of actions that helped in the development of a wide variety of novel drugs to target extensive range of microbial infections. Bioinformatic analysis of proteomics revealed the fundamental requirements of the discrimination between DNA replication protein and non-DNA replication proteins. Numerous algorithms have been proposed over the past decades for the predicting structure of proteins (Xiao et al. 2008) protein classification (Li and Li 2008) , proteins superfamily, family and subfamily classes (Li and Li 2008; Cai et al. 2005; Zhou et al. 2007 ), prediction of protein subnuclear and subcellular localization Jiang et al. 2008; Lin 2008; Lin et al. 2008 ) and other protein cellular attributes (Chou 2001a) . These protein attributes predicting algorithms include support vector machine; SVM) (Lin 2008; Lin et al. 2008; Chen et al. 2008) , K-nearest neighbor; KNN) (Shen and Chou 2005a,b; Yan et al. 2008) and Fisher discriminant classifier (Ding et al. 2009 ). The pseudo amino acid composition (PseAAC) has been demonstrated to efficiently improve the calibre of protein prediction by presenting a distinct model of protein-peptide sequence deprived of lacking the information of sequence order of protein (Li and Li (2008) ; Chou 2001a; Du et al. 2014) . For statistical prediction, the efficiency of a predictor is most often examined by the number of cross-validation methods including independent dataset test, self-consistency testing, subsampling test, K-fold cross-validation test and jackknife test (Shen and Chou 2008; Liu et al. 2016a,b; Butt et al. 2016 Butt et al. , 2017 . The present study was conducted to construct a novel computational predictor for predicting DNA replication proteins and for discriminating DNA replication proteins with non-DNA replication proteins. This model will provide beneficial and worthy information for successful prediction of DNA replication proteins. The Chou's PseAAC can integrate the chief attributes of the composition of an amino acid as well as the correlation of sequence order. This sequence-based statistical predictor operates based on the following five prime rules i.e. 5-step rule (Chou 2020a,b,c,d,e,f; Fang et al. 2020; Lin et al. 2020; Liu and Chou 2020; Lu and Chou 2020; Xu et al. 2020; Zhang et al. 2020) which is (i) Construction or selection of an effective benchmark dataset for training and testing the sequence-based statistical predictor, (ii) Formulation of the effectual biological sequence tasters with an operative measured expression to accurately replicate the intrinsic relation of the biological sequence with the target to be prophesied, (iii) Development of a productive and efficacious algorithm for operating the prediction, (iv) Execution of persuasive cross-validation trials to factually assess the projected precision of the predictor, and (v) The inception of a comprehensible and foolproof web-server regarding the predictor and to ensure its receptiveness and accessibility to the public. In this section, the first three steps of the 5-steps prime rule are being addressed. Proposed Methodology is being illustrated in the flowchart in Fig. 2 . Organizing the particular and precise set of data for testing and training is the utmost priority for creating a potent predictor. Flawed, imprecise or fallacious benchmark dataset consequences in deceitful and unreliable predictor training that leads to false verification and validation of the particular data. That is why the collection of the exhaustive and non-redundant dataset is of prime importance. This study implies the construction of computational model by retrieving the protein sequences and formation of comprehensive benchmark dataset from a renowned online protein database such as UniProt during the first phase followed by the extraction of feature vector comprising of pertinent features in numerals during the second phase. During the next phase, the retrieved features were then trained with the help of a neural network for attaining convergence. For this proposed trained predictor, feature vectors serve as an input to predict output that specifies whether the particular sequence is of DNA replication protein or non-DNA replication protein. In the end, various verification tests were employed for the predictor model followed by its validation on several test datasets to establish the accuracy and preciseness of the predictor model so that the possibility and precision of the trained model could be illustrated in Fig. 3 . There were 5301 DNA replication proteins and 5250 non-DNA replication proteins retrieved from UniProt databases. "DNA replication protein" as a keyword was used for retrieval of the protein sequence. Scrupulous data collection was ensured by excluding the ambiguous sequences and only the sequences containing specific attributes were retrieved. These specific attributes include annotation of the particular sequence with the terms viz "DNA replication" by similarity, probable, potential, fragment. Moreover, these were all reviewed sequences and non-reviewed sequences were not considered. We searched all DNA replication proteins from UniProt, irrespective of length, to make dataset rich and discriminant, so that the model trained could be more dynamic. Similarly, for non-DNA replication proteins as well, reviewed sequences were retrieved. The present study dealt with the removal of redundant proteins by using CD-HIT process and proteins sequence identity as the cutoff was 60%. 60% cutoff depicts that all sequences, which showed similarity more than 60% were excluded from the dataset to reduce redundancy in dataset and overfitting of the model. The reason for choosing this threshold was that it is supported by various previously reported studies Khan et al. 2019a The precise order of incorporation of the amino acid sequence into the polypeptide chain of the protein defines the characteristic properties of that specific protein encoded by the specific gene. Attributes of a particular protein can be altered in terms of structure or function as a consequence of amino acid mutations or due to the presence or absence of a signal amino acid in the sequence of the particular gene. The relative placing of integral amino acid residues is far more substantial than the amino acid composition that significantly affects the behaviour of the protein. The reason behind the eminent alteration of the characteristic biophysical properties of the protein lies at the small variation in the relative positioning of amino acids (Liu et al. 2016a ). These facts reinforce the development of mathematical and computational models to retrieve information of the characteristic features from the protein's primary sequence with the regard of the relative positioning of the amino acids rather than the constituents of the proteins. Statistical moments referred to as a measure of data collection quantitatively. Several orders of moments depict numerous attributes of the data. The moments delineate the evaluation of the data size as well as its orientation and eccentricity. Varied moments have been constructed on the basis of renowned distribution and polynomials functions. There are various eminent moments to explicate the anticipated problem viz. raw, central and Hahn moments. Raw moments reckon variance, mean and asymmetry of the distribution of probability irrespective of scale invariance or location invariance (Butt et al. 2016; Khan et al. 2014) . Central moments along with the information similar to the raw moments; computed along the centroid of the data with respect to scale invariance or location invariance (Butt et al. 2016; Khan et al. 2014) . Hahn moments are computed on the basis of Hahn polynomials irrespective of scale invariance or location invariance. The attribute of moment selection depends on the susceptive of the moment towards sequence ordered information which is of prime importance. Therefore, scale-invariant moments are ignorable so these moments were evaded. The computed values from all these methods elucidate data differently. Moreover, the discrepancy between the computed values for the moments of random dataset implicit discrepancies in the features of the data source ).In the present study, the bi-dimensional version of these moments was implied by transforming the single-dimensional primary sub-sequence into bi-dimensional notation. In supposition, the sequence of the protein or subsequence 'R' can be denoted by: where α i represents the ith residue of an amino acid in a primary sub-sequence comprising of k residues, again suppose, For accommodating all the amino acid residues of the protein R, A matrix R' is composed of dimension m*m As the bi-dimensional matrix R' belongs to the primary structure R, therefore, the matrix R can be transformed into R' using a mapping function ω as follow: The matrix of 2D R' contents is used to compute the moments to the degree 3. Therefore, the raw moments can be calculated by the equation as follow: where i + j represents the moments, order computed till order 3 and can be expressed as ƙ 00 , ƙ 01 , ƙ 10 , ƙ 11 , ƙ 12 , ƙ 21 , ƙ 30 and ƙ 03 . The centroid of the data corresponds to the point where the data is consistently dispersed in all directions regarding the weighted average and it can be simply calculated followed by the computation of raw moments. It is expressed as a point ῡ, ỹ where The central moment was computed using centroid with the help of the following equation: A square matrix notation R' was formed by transforming one-dimensional notation R. This transformation exhibits greater dividend so as Hahn moments to be calculated on such a level dimensional organization of data as it requires a square matrix as a bi-dimensional input data. Being an orthogonal entity, discrete Hahn moments have a reversible property that renders the reconstruction of the original data with the utilization of inverse functions of Hahn moments. Therefore, it is evident that the computed (1) R = ( 1, 2, 3, . . ., AK) Rq moments conserved the positional and conformational information of a primary sequence. Let the Hahn polynomial order of' to be specified by the following equation: The pochhammer symbols used in the above equation was generalized as: Gamma operator was used to abridging the expression as follow: With the help of weighting function and square norm, the raw values of Hahn moments were ascended as below: whereas; Finally, for bi-dimensional discrete data matrix, the orthogonal normalized Hahn moments were computed by the following equation: For every primary sequence, bi-dimensional raw, central and Hahn moments were computed up to third order and afterwards, these were combined with the miscellany feature vector. The basic information of sequence order encoded into the primary sequence of the protein and relative positioning information of amino acid residues, being a chief paradigm exhibits the foundation of any mathematical model for the prediction of attributes and characteristics of the protein. The quantization of the relative position of an amino acid within the polypeptide chain is also of prime importance. Position relative incidence matrix (PRIM), generated by the elements with 20 × 20 dimensions, signifies the comprehensive information concerning relative positioning of amino acid residues within the protein's polypeptide chain. An element of i→j represents the summation of the relative position of jth residue concerning the first incidence of the ith residue. PRIM yielded a huge number of coefficients which were then further reduced up to 24 elements through computing the statistical moments employing PRIM as the input. The fact related to proficiency and the precision of the machine learning algorithm lies behind the punctiliousness and the fastidiousness for extracting the utmost pertinent data set. Ambiguous patterns entrenched within data can be understood and uncover with the self-adapting capability of a machine learning algorithm. As PRIM matrix uncovers the information concerning with the relative positioning of amino acid residues within the polypeptide chain of a protein, reverse position relative incidence matrix (RPRIM) was employed to uncover the obscure hidden attributes of the primary sequence of the protein that ultimately extenuate opacities among proteins with apparently identical sequences. Likewise, as PRIM, RPRIM also generated as a 20 × 20 dimension of elements and yielded 400 coefficients nevertheless in the reverse primary sequence which was further reduced to 24 coefficients by computing the moments. Reverse position relative accumulative matrix is given as follow: The number of times each amino acid residue occurs within the polypeptide chain of the primary sequence of a protein designated as a frequency and frequency matrix was designed to measure the distribution of frequency of an amino acid residue in the sequence. The frequency matrix is given as follow: whereas τ i denotes the frequency of occurrence of ith amino acid residue. The frequency matrix comprehends the information concerning the configuration and conformation of the protein. Moreover, computing the frequency matrix aims at extracting the structural evidence of the protein sequence. As frequency matrix represents how much specific amino acid residue is frequent, it gives the structural and conformational information but not the relative position of amino acid residues in a protein. Hence, accumulative absolute position incidence vector was computed for the relative position of amino acid and for extracting the composition of the protein. AAPIV constitutes of 20 elements and each element denotes the summation of all the ordinal values for individual amino acid positioned at their corresponding site within the primary sequence of a protein. AAPIV vector has represented the incidence of particular amino acid residue in the primary sequence and is given by: α i pn signifies the occurrence of specific amino acid residue α i at positions of p 1 , p 2 , p 3 … p n . Accordingly, Accumulative absolute position incidence vector is designated as follow: Henceforth, for an arbitrary ith element, AAPIV can be computed as follow: Reverse accumulative absolute position incidence vector was generated to extract the information of deep and ambiguous pattern about the relative locations of amino acid residues in the protein sequence. RAAPIV is also a 20-element vector and was generated by reversing the primary sequence followed by the extraction of reverse accumulative absolute position incidence vector from the particular reversed sequence and is computed as follow: Suppose the incidences of a particular amino acid residue in the reversed sequence be represented as follow: where l 1 , l 2 , l 3 , …, l n signifies the ordinal positions for the occurrence of amino acid residue (α i ) that particular reverse sequence. Therefore, for an arbitrary ith element, reverse accumulative absolute position incidence vector (RAAPIV) can be computed as follow: Decision complications can be resolved with an utmost commanding technique referred as a neural network that resembles the human nervous system as the brain grasp and absorbs the environmental information and acts according to the scenario by learning from the circumstances. A neural network has been structured on the analogous code. During the training operation, it acquires characterized input, it projected the judgment for each input and based on knowledge obtained from each input. Two approaches are employed for the training of neural network categorized as supervised and unsupervised. Former comprehends both the inputs and the outputs in which network processing of the input outcomes the desired output while later encompasses the ability to sense the input deprived of external assistance (Fig. 4) . After the accomplishment of the network training, the intrinsic capability of the network enables it to organize the respective input within an adequate level of precision. Reduction of an error is the prime objective during the learning procedure of the neural network which regulates its weights throughout each reiteration to diminish the possibility of an error. It eventually aids to translate into upgraded learning and enhanced precision for predicting the pertinent group of random input. An artificial neural network is an immensely efficacious approach for the creating supervised classifier for the development of prediction algorithm for DNA replication protein and non-DNA replication proteins. The depths and particulars drawn out from raw data into the feature vector possess a dynamic part. A feature vector is proficient for discerning data to obtain diligent outcomes. The construction of feature vector embraces discriminating attributes including FM, SVV, AAPIV and RAAPIV. It also employs raw, Hahn moments and central moments of PRIM, RPRIM along with the bi-dimensional primary structure. A feature vector consists of a large numeral of coefficients that efficiently (20) A = 1 , 2 , 3 … , 20 predict the DNA replication protein and non-DNA replication proteins. Now, the fourth step of Chou's 5-step rule will be discussed. The estimation of exactness helps in target assessment of another algorithm for prediction. To address this, it is important to think about which sort of measurements are to be utilized and what test technique would be utilized to score those measurements. There are four factors which are used to measure the prediction of DNA replication proteins. The accuracy (Acc) factor used to measure the overall prediction accuracy, the sensitivity (Sn) factor used to check the sensitivity level, Mathew's correlation coefficient (MCC) factor used to measure the overall stability of predictor and the Specificity (Sp) factor used to check the specificity. Lamentably, their regular details as given in ref. need instinct and most exploratory researchers feel difficult to comprehend them, especially for the MCC. The traditional definitions of these parameters do not have the instinct and these parameters are viewed as troublesome by the researchers to comprehend (Chou 2001b) , in this way, Chou's symbols, changed over by Xu et al. (2013) and Chen et al. (2013) , were utilized for estimation of exactness which is given as where ℕ + shows to the number of real predicted DNA replication proteins, and ℕ + − shows to the number of real DNA replication proteins which are predicted as non-DNA replication proteins by applying the above matrix algorithm. Essentially, ℕ − shows to the number of real predicted non-DNA replication proteins, and ℕ − + shows to the number of real non-DNA replication proteins which are predicted as DNA replication proteins by applying the above matrix algorithm. As indicated by the matrix, Sn ends up 1 whenℕ + − = 0 . Also, Sp winds up 1 whenℕ − + = 0 . Most extreme accuracy Fig. 4 The architecture of ANN for the proposed prediction model and MCC that is Acc = 1 and MCC = 1 are accomplished when ℕ + − = ℕ − + = 0 , which represents that no mistaken forecasts have been made concerning the DNA replication protein proteins and non-DNA replication proteins. On the off chance that one has ℕ + − = ℕ − + , it implies that none of a solitary DNA replication protein in the positive dataset and non-DNA replication protein in negative dataset dishonestly anticipated, and it gives us the MCC = 1 and Acc = 1; if one has ℕ + − = ℕ + and ℕ − + = ℕ − , it implies all the DNA replication proteins in positive the dataset and non-DNA replication proteins in the negative dataset are erroneously anticipated, so it gives us the MCC = − 1 and Acc = 0. Then again, on the off chance that one has ℕ − + = ℕ − ∕2 and ℕ + − = ℕ + ∕2 then it will give us the MCC = 0 and Acc = 0.5, only an estimate whether it is DNA replication protein or non-repair site. In this way, Eq. (19) gives the clarification of explicitness, affectability, in general exactness, and solidness all the straighter forward and instinctive, especially when one talks about MCC. Subsequently, these matrix equations upgrade the instinct and comprehension of exactness measurements, and numerous specialists are likewise as per this (Feng et al. 2019; Song et al. 2018a,b) . The arrangement of above equations can be substantial just for the single-class label framework, for example, genuine/false and for frameworks having progressively various labels, an alternate arrangement of parameters can be utilized which is characterized in (Chou 2013). Self-consistency test referred to as the ultimate test for the validation of efficiency and efficacy of the prediction model using the test cases by training the data set. The obtained results from the self-consistency test were particularized with the assistance of a confusion matrix. Confusion matrix represents an eminent practice to illustrate the precision of a predictor by pronouncing the prediction results counter to the actual data. True positive (TP) represents the entry denoted as a positive for DNA replication protein whereas False positive (FP) identifies the non-DNA replication protein that has been pointed as a DNA replication protein erroneously by the predictor. True negative (TN) recognizes the non-DNA replication protein while False negative (FN) denotes the DNA replication protein inaccurately marked as a non-DNA replication protein by the model. Values of metrics were estimated by putting the values of accuracy parameters into the above equations. A representation of these proposed parameters by conducting the self-consistency testing results for the iDRP-PseAAC is shown in Table 1 , while ROC and Confusion Matrix is given in Figs. 5 and 6. There are numerous approaches used for the validation of a predictor elaborated in the literature by various researches (Butt et al. 2016 . K-fold cross-validation and Jackknife testing represent the utmost reliable method as prediction model validation method. Dataset of characteristic validation process typically comprised of training and test data. Firstly, training data was utilized to train the prediction model. After the model being copiously trained and achieved the convergence, untrained data was used to validate the accuracy of the trained predictor. The validation process is demonstrated in Fig. 7 below. Distinct approaches have been employed by the researches for the validation methods using training or test data. For determining the foolproof performance of a predictor, Jackknife testing and Cross-validation are meticulous and diligent approaches (Liu et al. 2015 (Liu et al. , 2016a . Cross-validation is a method to thrive an expectancy for the proposed model as an exemplary method in the absence of validation set. Available data was fragmented into K-folds, which is a constant. In this case, the partitions were disjointed and the process was tested for these partitions although being trained for the rest of the data. The test was iterated K times for each partition. The crossvalidation result was reported in terms of the overall average of accuracy in each iteration. Supposedly, let X be the total number of samples comprising of DNA replication protein and non-DNA replication protein samples given as: The dataset was split into the subsets X i which are of comparable size of k where X i represents an arbitrary DNA replication protein or non-DNA replication protein sample. To ensure the comparable sizes of the subsets, these were selected arbitrarily i.e. where X i and X j represent discrete arbitrary sets. Elements of X i were left out during the single iteration and then the predictor was trained on rest of the data. The left-out data was tested using the trained model to compute an accuracy rate R. The mean value for outcomes of k iterations was calculated to compute the overall cross-validation result, R o . was performed for DNA replication protein and data was divided into test and training data. For the dataset partitioned into 10-folds, a partition was left-out as a test data in each iteration. Afterwards, the neural network was sufficiently trained on the remaining data followed by the simulation to determine its accuracy on test data. Crossvalidation test was performed repeatedly on ten datasets encompassing DNA replication protein and non-DNA replication proteins. The average value pronounced the ultimate accuracy of the predictor. The overall accuracy was estimated at 96.22% as demonstrated in Table 2 , while the ROC and Confusion Matrix is shown in Figs. 8 and 9 . Another study, proposed by Yang et al. (Yang et al. 2015) , has been reported previously, which targeted the sequencebased identification of DNA replication proteins, and the proposed model was validated through tenfold cross-validation. Therefore, the comparison with that study is also presented in Table 2 . Jackknife testing is amongst the most frequently used validation technique. Arbitrarily selected or partitioned datasets are the basis of different validation tests. However, partitioning of the data governs no specific rule. There are several ways to partition the data in a way that certain partitions produce better results whereas certain partition does not give good results. These methods probably fail to produce unique results. Jackknife testing represents the proficient technique capable of producing unique results. It is an iterative technique that computes the accuracy of the model for all variations of the sample of size n − 1. The jackknifing technique train the predictor on left-out data and estimates overall accuracy by meticulously leaving out every observation from a dataset. Eventually, the outcome results of this validation are averaged and produce a unique result for the respective dataset which ultimately alleviates the drawbacks generated by subsampling and data independence. Supposedly, let S denotes the total sample size comprising of n elements given as follow: Let R i denotes the rate of accuracy for ith iteration of the jackknife test and to compute its value, the dataset leaves out the ith element within the dataset S i which is given as: The trained neural network along with the feature vector was simulated for all the samples in S i to compute the accuracy rate, the number of true positives and negatives, as well as false-positive and negatives, were determined. The average of all the result values of R i was computed as R′. ′ signifies the overall accuracy average of the prediction model and n symbolizes the number of the total observations. The outcome of the model was pragmatically computed for the left-out sample as during each iteration, a sample is excluded out of the training dataset. Iteration was done for entire particular dataset and results were obtained by gathering all predicted results that yielded an accuracy of 98.56%. Jackknife testing results for the iDRP-PseAAC is shown in Table 3 while ROC and Confusion Matrix is shown in Figs. 10 and 11. (28) S = S 1 , S 2 , S 3 … . S n (29) S = S 1 , S 2 , S 3… ., x i−1, x i+1, S n 10 Jackknife testing is an iterative methodology that calculates the precision of the predictor for all variations of the population of S N-1 . This area explains the last step of Chou's 5-steps rule (Liu et al. 2016c; Zhang et al. 2016; Chou 2011) which is the improvement of a web server for the simplicity of clients and easy to understand, as shown by the different examiner in some ongoing publications, easy to understand and freely available web-servers speak to the future heading for growing increasingly helpful prediction strategies and computational analyses tools. They have fundamentally improved the effects of computational science on restorative science (Chou 2015) , driving medicinal science into an extraordinary upheaval (Cheng et al. 2017) . In this manner, one shall make efforts to develop a webserver for the expectation technique detailed in this paper; until further notice, its independent code is accessible at idrp-server.herokuapp. com/which is created utilizing Django 2.0.7. For a neural system, the sci-unit neural system was utilized with Theano 1.0.0 at the backend. Underneath, the well-ordered manual for the utilization of the webserver is discussed. First, open the URL of web server idrp-server.herokuapp. com/and then see a top header menu which has four tabs i.e. Home, Prediction, About and Supplementary Data. An overview of DNA replication proteins defines in the Home tab. The DNA replication proteins prediction portal shows in Prediction tab. The reference of relevant paper and its information shows in About tab. The beneficial Data provides for download in the Supplementary Data tab. You just click the Prediction tab to predict the proteins. In prediction tab, one can input the protein sequence in an empty textbox and then click the Submit button for getting the results. The outcomes will show up on the next window after some time, which relies upon the length of the input. To see the relevant paper and algorithm which can be used to develop this server in About tab. One can also see the citation of the relevant paper in this tab. One can download the relevant or supplementary dataset for future experiments in Supplementary Data tab. Discrimination of DNA replication protein from the non-DNA replication proteins is a crucial requisite to study the mechanism of these proteins. The present study was conducted to predict the given polypeptide sequences as DNA replication protein or non-DNA replication protein based on fundamental steps of Chou's. Features were calculated by incorporating statistical and position relative features into the amphiphilic pseudo amino acid composition. The results computed from the proposed predictor was validated by employing self-consistency testing, jackknife testing and cross-validation approach. The overall accuracy of the predictor was depicted by using exemplary metrics presenting the high accuracy for the model. Stupendous experimental results demonstrated that the proposed predictor is an accurate and precise approach to conduct further researches as well as it presents time and cost-effective strategy for identifying DNA replication proteins. Author Contributions All authors contributed equally. Funding No funding was received for this study. iPhosH-PseAAC: identify phosphohistidine sites in proteins by blending statistical moments and position relative features according to the Chou's 5-step rule and general pseudo amino acid composition Centromeric DNA replication reconstitution reveals DNA loops and ATR checkpoint suppression Frequent exchange of the DNA polymerase during bacterial chromosome replication A treatise to computational approaches towards prediction of membrane protein and its subtypes Predicting enzyme family classes by hybridizing gene product composition and pseudoamino acid composition Prediction of linear B-cell epitopes using amino acid pair antigenicity scale Prediction of mucintype O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition pLoc-mPlant: predict subcellular localization of multi-location plant proteins by incorporating the optimal GO information into general PseAAC Prediction of protein cellular attributes using pseudo-amino acid composition Using subsite coupling to predict signal peptides Structural bioinformatics and its impact to biomedical science Some remarks on protein attribute prediction and pseudo amino acid composition Some remarks on predicting multi-label attributes in molecular biosystems Impacts of bioinformatics to medicinal chemistry The most important ethical concerns in science The problem of Elsevier series journals online submission by using artificial intelligence Other mountain stones can attack jade: the 5-steps rule Using similarity software to evaluate scientific paper quality is a big mistake Proposing 5-steps rule is a notable milestone for studying molecular biology The development of Gordon life science institute: its driving force and accomplishments MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides Energetics of the structure of the four-alpha-helix bundle in proteins UbiSitePred: A novel method for improving the accuracy of ubiquitination sites prediction by using LASSO to select the optimal Chou's pseudo components Using Chou's pseudo amino acid composition to predict subcellular localization of apoptosis proteins: an approach with immune genetic algorithm-based ensemble classifier Prediction of cell wall lytic enzymes using Chou's amphiphilic pseudo amino acid composition PseAAC-General: fast building various modes of general form of Chou's pseudo-amino acid composition for large-scale protein datasets Reveal the molecular principle of coronavirus disease 2019 (COVID-19) iDNA6mA-PseKNC: identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC DNA replication origin activation in space and time Agaritine and its derivatives are potential inhibitors against HIV proteases Phage display as a technology delivering on the promise of peptide drug discovery SPre-nylC-PseAAC: a sequence-based model developed via Chou's 5-steps rule and general PseAAC for identifying S-prenylation sites in proteins SPal-mitoylC-PseAAC: a sequence-based model developed via Chou's 5-steps rule and general PseAAC for identifying S-palmitoylation sites in proteins iMethylK-PseAAC: improving accuracy of lysine methylation sites identification by incorporating statistical moments and position relative features into general PseAAC via Chou's 5-steps rule iPPI-PseAAC (CGR): identify protein-protein interactions by incorporating chaos game representation into PseAAC Using Chou's pseudo amino acid composition based on approximate entropy and an ensemble of AdaBoost classifiers to predict protein subnuclear location BP neural network could help improve pre-miRNA identification in various species An efficient algorithm for recognition of human actions pSSbond-PseAAC: prediction of disulfide bonding sites by integration of PseAAC and statistical moments iProtease-PseAAC (2L): a two-layer predictor for identifying proteases and their types using Chou's 5-step-rule and general PseAAC Chromatin controls DNA replication origin selection, lagging-strand synthesis, and replication fork rates Predicting protein subcellular location using Chou's pseudo amino acid composition and improved hybrid approach Computational approach to drug design for oxazolidinones as antibacterial agents Protein folds prediction with hierarchical structured SVM The modified Mahalanobis discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition Predicting subcellular localization of mycobacterial proteins by using Chou's pseudo amino acid composition Use Chou's 5-steps rule to predict remote homology proteins by merging grey incidence analysis and domain similarity analysis pLoc_Deep-mGneg: predict subcellular localization of gram negative bacterial proteins by deep learning Identification of real microRNA precursors with a pseudo structure status composition approach iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework iATC_Deep-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals by deep learning pLoc_Deep-mAnimal: a novel deep CNN-BLSTM network to predict subcellular localization of animal proteins pLoc_Deep-mPlant: predict subcellular localization of plant proteins by deep learning Using optimized evidence-theoretic K-nearest neighbor classifier and pseudo-amino acid composition to predict membrane protein types Predicting protein subnuclear location with optimized evidence-theoretic K-nearest classifier and pseudo amino acid composition HIVcleave: a web-server for predicting human immunodeficiency virus protease cleavage sites in proteins iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework DNA replication proteins as potential targets for antimicrobials in drug-resistant bacterial pathogens Metalloprotease SPRTN/DVC1 orchestrates replication-coupled DNAprotein crosslink repair Holins: the protein clocks of bacteriophage infections Role of DNA replication proteins in double-strand break-induced recombination in Saccharomyces cerevisiae Using grey dynamic modeling and pseudo amino acid composition to predict protein structural classes iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition The topological entropy mechanism of coronavirus disease 2019 Discrimination of outer membrane proteins using a K-nearest neighbor method A machine learning approach to identify DNA replication proteins from sequencederived features How the eukaryotic replisome achieves rapid and efficient DNA replication Molecular modeling studies of peptide drug candidates against SARS iOri-Human: identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition The chemical mechanism of pestilences or coronavirus disease 2019 (COVID-19) Screening for new agonists against Alzheimer's disease Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations