key: cord-0897873-55lkfivh
authors: Cho, Myeongji; Son, Hyeon Seok
title: Prediction of cross-species infection propensities of viruses with receptor similarity
date: 2019-04-23
journal: Infect Genet Evol
DOI: 10.1016/j.meegid.2019.04.016
sha: 5ec27b5a22980a4b6873be42dcb1458c97879a13
doc_id: 897873
cord_uid: 55lkfivh

Studies of host factors that affect susceptibility to viral infections have led to the possibility of determining the risk of emerging infections in potential host organisms. In this study, we constructed a computational framework to estimate the probability of virus transmission between potential hosts based on the hypothesis that the major barrier to virus infection is differences in cell-receptor sequences among species. Information regarding host susceptibility to virus infection was collected to classify the cross-species infection propensity between hosts. Evolutionary divergence matrices and a sequence similarity scoring program were used to determine the distance and similarity of receptor sequences. The discriminant analysis was validated with cross-validation methods. The results showed that the primary structure of the receptor protein influences host susceptibility to cross-species viral infections. Pair-wise distance, relative distance, and sequence similarity showed the best accuracy in identifying the susceptible group. Based on the results of the discriminant analysis, we constructed ViCIPR (http://lcbb3.snu.ac.kr/ViCIPR/home.jsp), a server-based tool to enable users to easily extract the cross-species infection propensities of specific viruses using a simple two-step procedure. Our sequence-based approach suggests that it may be possible to identify virus transmission between hosts without requiring complex structural analysis. Due to a lack of available data, this method is limited to viruses whose receptor use has been determined. However, the significant accuracy of predictive variables that positively and negatively influence virus transmission suggests that this approach could be improved with further analysis of receptor sequences.

Over the past 50 years, new infections caused by pathogens such as human immunodeficiency virus (HIV), Ebola virus, severe acute respiratory syndrome coronavirus (SARS-CoV), H5N1 avian influenza virus, antibiotic-resistant S. aureus, and antibiotic-resistant Mycobacterium tuberculosis have emerged worldwide (Snowden, 2008; Jones et al., 2008) . The majority of these new infections are caused by infectious agents crossing species barriers and completing their life cycle with expanded host ranges, a process that is influenced by diverse parameters (Domingo, 2010; Woolhouse and Gowtage-Sequeria, 2005) . Given the necessity and importance of early detection and response to potential threats, experts from various fields have attempted to address this problem, although it has yet to be solved. For decades, emergent viruses have been studied using both classical methods of virology and genome-based technologies. Recently, computational approaches, including bioinformatics, have also been used, such as genome sequencing, construction of databases and analysis systems, and development of models and software to predict emerging infections (Rappuoli, 2004; Haagmans et al., 2009; Pepin et al., 2010; Morse et al., 2012; Woolhouse et al., 2012) . There have also been studies investigating genomic patterns of receptor proteins or receptor-binding domains that influence host susceptibility to viral infections, which have enabled the discrimination of infection propensities based on the primary structure of receptors without requiring complicated structural analysis (Rogers et al., 1983; Matrosovich et al., 2000; Graham and Baric, 2010; Bae and Son, 2011; Imai and Kawaoka, 2012) . The cellsurface proteins used as receptors by viruses have been identified (Schneider-Schaulies, 2000; Dales, 1973; Grove and Marsh, 2011) . However, determining the biological parameters that influence virus-receptor interactions is problematic because the virus is different from the natural ligands or substrates of the receptors (Dimitrov, 2004; Skehel and Wiley, 2000) . In addition, the details of the entry, penetration, and uncoating of some viruses are unclear. This is gradually being overcome by the increasing availability of virus genetic information and the ongoing advancements in computerized data-processing and sequence-analysis methods (Dales, 1973; Dimitrov, 2004) . In the present study, the distance and similarity of receptor proteins that function as a species barrier to viral infection were calculated and used to predict the propensity for cross-species viral infections between host species. With the aim of estimating the capacity of viruses to affect the emergence of new viral diseases, we developed a computational framework to predict the cross-species infection propensities of viruses using receptor sequences. We postulated that a virus from a reservoir might be able to adapt to a new host only if the similarity between receptor proteins present in the potential host species is high enough for them to cross the species barrier. Evolutionary divergence matrices were used to calculate distance scores of target sequence pairs considering amino acid substitutions, and overall sequence similarities were computed using Java programming.

Receptor sequences of 98 species of source organisms were collected from the NCBI protein database. In total, 277 amino acid sequences for 18 types of receptor proteins were collected by choosing non-partial sequences. Accordingly, 18 receptor data sets consisting of orthologous amino acid sequences derived from different species were constructed (Table 1) . Multiple sequence alignment was performed on collected sequences with MUSCLE using the default parameters for each receptor data set (Edgar, 2004) . Using the alignment results, an original data set was constructed to train, validate, and test build a classifier. For each receptor data set, all possible sequence pairs were generated to construct the original data set, which is the source of variables for the classification model. Sequence variation in receptor proteins is relevant in the conformational differences of virus-host interaction interfaces and in protein expression, which may strongly influence viral infection propensity. With the aim of confirming the effect of receptor similarity on host susceptibility to cross-species infection, evolutionary distances and sequence similarities were calculated and used to classify groups of infection propensity from 50 sequence pairs of training sets. A test data set was constructed with 6 sequence pairs that were randomly split from the original data for each group ratio. Of the original datasets (consisting of 56 sequence pairs), 50 were used for model construction using discriminant analysis, and 6 were used for validation using the constructed models. Two groups were classified based on susceptibility to viral infection: Group 1, which contained sequence pairs with high similarity, which increases cross-species infection propensity, and Group 2, which contained sequence pairs with low similarity, which decreases cross-species infection propensity. The possibility of virus-host-cell interactions, which confer host susceptibility to viral infection, was determined using a database search and literature review for each virus with specific receptor used. Viruses with unknown or unspecified host receptors were excluded from the original data set, and both zoonotic and non-zoonotic viruses were included. Each receptor data set had at least one sequence derived from the natural host (primary reservoir). The numbers of sequence pairs for each group in the original data set were 36 and 20, respectively. The frequencies of the groups in the training data sets were 32 and 18, and those in the test data sets were 4 and 2, respectively.

Classification of host pairs into infectivity groups was performed using evolutionary distance and sequence similarity based on sequence alignment. Multiple sequence alignments and optimal pairwise sequence alignments were performed on the collected sequences of each of the 18 receptor protein sets using MUSCLE (Edgar, 2004) . A scoring matrix was used to represent the evolutionary distance of a sequence from the other sequence within the pair. MEGA6 software was used to calculate the distance (Tamura et al., 2013) . The distance of a residue within a sequence was measured as the substitution score from the amino acid of the relevant column in a matched sequence. The matrix showed substitution scores for all possible sequence pairs within a receptor data set. Total sequence similarity was calculated using Java programming based on the alignment results. As a result of the distance and similarity analysis, we parameterized the absolute distance (pairwise distance), relative distance, and overall sequence similarity for each host pair in the data set. All of the host-pair data used to generate predictive variables were examined for infection characteristics by literature review. The absolute distance (pairwise distance, g S i,1 ) is an estimate of the evolutionary divergence between sequences and is defined as the number of amino acid substitutions between aligned sequences. The Poisson correction model was used as a substitution model and the complete deletion method was used to process gaps/missing data (Zuckerkandl and Pauling, 1965) . The relative distance ( g S i,2 ) is the ratio of the pairwise distance to the maximum distance value calculated from the distance analysis results (pair-wise distance ÷ maximum distance within datasets). The total similarity ( g S i, 3 ) is the result of a similarity analysis of all possible host-pairs in the dataset, which is the number of matched amino acids in an orthologue sequence to the total length of the aligned sequence including indels. These three independent variables were used to classify group members into the cross-species infectious or non-infectious group, and the decision coefficients obtained from the discriminant analysis were used to build a prediction model. In our study, similarity of predicted interaction hotspots with some amino acid residues in the receptor sequence had been considered a candidate predictive variable, but the significance of the discriminant analysis was low, and it was excluded from the prediction model. The scores for the three variables were defined as: where g S i,1 is the score for the distance of the i th (the number of variable sets) row of infectivity group g (), g S i,2 is the score for the relative distance of the i th row of infectivity group g, and g S i,3 is the score for similarity of the i th row of infectivity group g. The distance scores, which categorized absolute and relative scores, and the similarity scores of collected sequence pairs were stored in a MySQL database. All decision coefficients for three independent variables were calculated and used to construct a model for classifying group members using a discriminant function.

IBM SPSS Statistics 24.0 (Cor, 2016; Green et al., 1996) and XLSTAT (Addinsoft, 2017) software were used for discriminant analysis to classify susceptibility to cross-species infection based on receptor similarity. Discriminant analysis is a form of multivariate analysis in which distinct sets of observations are classified according to previously defined groups, and a model is built from predictive variables and objects to allocate new observations to pertinent groups (Mika et al., 1999) . The discriminant analysis method classifies an individual object into the group using the discriminant scores from linear discriminant function or the Mahalanobis distance from pooled covariance matrices of non-normally distributed sets (Mika et al., 1999) . In this study, we used discriminant scores to estimate infection propensities considering that the differences in probability of cross-species infection of viruses between the two groups (infectious and non-infectious) should be maximally reflected and quantitatively expressed through the same index. Considering that the adjustment for the scalar propensity score is sufficient to remove bias due to all observed covariates (Rosenbaum and Rubin, 1983) , we calculated propensity scores of all sequence pairs to estimate cross-species infection propensities between host species through the adjustment of observed covariates and determinant scores. The adjustment was performed by calibrating determinant scores of all sequence pairs using the first order function formula derived using the determinant score range and the centroid for each group. The z-score was used as a cut-off for classifying each group member into the infectious or non-infectious group. The coefficient of determination (discriminant coefficient) is estimated to maximize the difference between the groups, and is multiplied by each independent variable as a weighted score. The discriminant scores of all sequence pairs were calculated and the average discriminant scores were used to obtain the group centroid. The discriminant and propensity scores were stored in a MySQL database and used to generate a web application system.

In cross validation, each case was classified by the functions derived from all cases. Leave-one-out cross-validation was performed to analyze the accuracy of discriminant analyses, and the erroneous classification rates were verified (Lachenbruch and Mickey, 1968) . A single object was first omitted from training, and the discriminant model was built with leave-one-out cross-validation. The omitted subject members were classified after training based on the built model. A total of 50 rounds of class predictions were performed, and the calculated confusion matrices were assessed. For the original and cross-validated grouped cases, the ratio of the number of correctly grouped members to total members was calculated, respectively. The performance of the discriminant model can be determined by calculating the sensitivity, specificity, total classification accuracy, and area under the receiver operating characteristic (ROC) curve (AUC). Performance scores were calculated from the confusion matrices for the training sample, validation samples, and cross-validation results (Foody, 2002) . The sensitivity, specificity, total classification accuracy, and AUC were defined as follows:

The number of correctly predicted sequence pairs The total number of sequence pairs

The accuracies of test data sets are presented as the ratio of the number of correctly predicted sequence pairs to the total number of test sequence pairs. Seven rounds of class predictions were performed in the test phase.

The propensity scores calculated from the discriminant analysis using the original data set were used for the implementation of our web application system. We named the propensity score, which indicates the probability of viral infection between species, Infectindex. The calculated Infectindex values ranged from 0.001 to 99.994% in the total data set, from 0.001 to 36.946% in the non-infectious group, and from 68.294 to 99.994% in the infectious group.

We also collected and stored reference sequences for 24 virus genomes to generate the target sequence resources required to execute the Search Engine (Sequence Similarity Scoring System) according to the virus sequence query input. Accession numbers and viral species for reference genome sequences used in the study were as follows: Table 2) .

As a web server-based analysis tool, the web interface of ViCIPR was programmed to operate with a MySQL-based database to search information efficiently according to query input and to output calculation results. The data fields are divided into class (classified group), casenum (case number), receptor, virus, reservoir, host1 (primary/ donor host), host2 (secondary/receipt host), pairwise_distance, re-lative_distance, total_similarity, g_verified (verified group), dis-g_predicted (predicted group by discriminant score), disds_calculated (calculated discriminant score), disds_trimmed (adjusted discriminant score), and Infectindex (cross-species infection propensity) ( Table 3) . The values corresponding to each data item (sequence pair) were classified, assigned, calculated, and stored. Table 3 shows the stored data items, their types, and the values stored in each item.

In this study, web programming was performed to implement the functionality provided by ViCIPR. To implement the sequence input window and an upload function for the user's sequence to perform a sequence homology search, sequence databases were generated as a target sequence resource. All collected sequences were stored in a web project as a FASTA file. We also developed a 'Sequence Similarity Scoring System' based on the Java programming language. HTML, JSP, and JAVA scripts were also used for web development that includes the functions of the program. The main analysis page was designed to enable users to input or upload a query sequence and select a primary (donor) host organism from the species available in the database. To present a selectable secondary (recipient) host species based on the virus genome sequence with the highest percent identity (%) among the target sequences (virus reference genome sequences), the web interface A reference sequences for 24 virus genomes were collected and stored to generate a target sequence resource for performing a similarity search engine (Sequence Similarity Scoring System) in ViCIPR according to a virus sequence query input. In constructing the gene database and the protein database, 118 genes and protein sequences were collected and processed to construct target sequence resources by parsing cds regions of 24 reference genome sequences. (Green et al., 1996) Predetermined group 1 for cases verified as non-infectious group through literature review Predetermined group 2 for cases verified as infectious group through literature review disg_predicted int (Green et al., 1996) Score-based predicted group 1 for cases predicted as infectious group by discriminant model Score-based predicted group 2 for cases predicted as non-infectious group by discriminant model was designed to include a dynamic search function for multiple conditions, such as virus species and primary host species, in conjunction with the MySQL database. The information for potential secondary host species is provided on the web page, and a dynamic selection box is presented for confirmation. The web interface was configured to search, store, and output all information corresponding to virus name, primary and secondary host species, and Infectindex under a series of analysis processes to output the calculated infection propensities corresponding to the selected data item. For reliable data management and future updates, the pairwise distance, relative distance, total similarity, discriminant scores, and Infectindex values of all sequence pairs were stored in the MySQL database.

We organized a system development environment to use our analysis system as an open web resource. For the construction of a serverbased calculation program and web interface, we built a server with Intel Xeon E5-240 v2 2.40GHz CPU, 8Gb RDIMM, 1600MT/s Memory, 300GB 15K RPM, and 6Gbps HDD specification. We used CentOS 6.6 as the operating system and MySQL 5.5.40 as the data management system for data storage in a Linux server environment. Java programming language was used for data parsing and computing program development, and JSP, HTML, and Java script programming languages were used for the web pages. The web server program was based on Apache, and Tomcat v7.0.55 was used as a web container.

We propose a computational framework to estimate propensities of virus infection between host species. The framework largely consists of four stages: 1) amino acid sequence analysis and predictive variable selection, 2) construction of the classification model, 3) model-based prediction, and 4) calculation of propensity scores (Fig. 1) . Based on the results of the discriminant analysis, three characteristics, which quantitatively indicate the evolutionary distance and similarity between two sequences, were selected as predictive variables and used to construct the model. The similarity of the interaction hotspots was excluded from the model structure due to low significance. We derived covariant matrices and pool-within-class covariant matrices for discriminant analysis. SPSS 24.0 was used to calculate inverse matrix and discriminant coefficients to derive the discriminant model and evaluate the contributions of predictive variables. A training data set consisting of 50 sequence pairs was used for model construction, and the constructed classification model was used in model-based prediction using the test data set consisting of six sequence pairs. The z-scores for all sequence pairs were used to classify or predict the group of individual sequence pairs and to estimate the probability of infection between host species in each data set. In the final step, the propensity score for each sequence pair was calculated according to each centroid and z-score transformation equation, and the predicted infection propensity value was presented as the Infectindex.

Based on the Infectindex, the infection properties were classified into two groups in the training data set: 32 infectious and 18 non-infectious properties. In the test data set, the infection properties were also classified into two groups: four infectious and two non-infectious properties. The discriminant model had 100% sensitivity, specificity, and total classification accuracy using the training and validation samples. The AUC was calculated as 1. The accuracy of the six test data sets was calculated as the ratio of the number of correctly predicted sequence pairs to the total number of test sequence pairs, and confirmed to be 100%. Despite the low level of mutations among the orthologous receptor sequences used as the current input data, the high sensitivity and specificity obtained in our results may be significant for prediction results produced by the statistical model based on evolutionary and genetic similarities between potential host species. Variables derived from evolutionary distance and sequence similarity appear to have acted as positive or negative factors influencing speciesspecific susceptibility. However, considering that this approach was limited to the currently available data, our results may be insufficient to obtain highly accurate predictions of cases where detailed and specific infection characteristics must be considered. This limitation serves as a disadvantage in predicting new cases, especially when the method is applied to exceptional cases where the similarity to the current input data is very low. As shown by the relationship among the predictive parameters and resulting propensity index (Table 4) , input data with close evolutionary distance and high sequence similarity produce an output result of an infectious group and high Infectindex value. In this regard, the present model should be cautiously applied to new cases that may confer more diverse and subtle host specificities under the assumption that, as evolutionary distance increases and sequence similarity decreases, the probability of sharing sequence properties associated with susceptibility to cross-species infections also decreases. Furthermore, the level of variation among receptor sequences may have a significant impact on host susceptibility, but exceptional patterns can be observed, for example new cases that do not follow linear species distances due to subtle differences in host specificity. In this study, substitution matrices were calculated from target sequences, and the discriminant model consisting of the combined variables was evaluated.

In the leave-one-out cross-validation, all original grouped cases were correctly classified. However, the cross-validation result for the test dataset (83.3%) suggests that incomplete coverage of potential polymorphisms in receptor sequences, which may affect infection propensities, can compromise prediction accuracy. Therefore, it remains difficult to predict risk of infection based on the primary structure information of the receptor proteins. However, the discriminant coefficient (7.026) of similarity within the sequence pair, which had the most positive effect on host susceptibility to viral infection, confirmed the importance of high-accuracy sequence polymorphism in improving classification and prediction accuracy. These results suggest that our approach could be improved by including more receptor sequence data. Both SPSS 24.0 and XLSTAT software were used to conduct the discriminant analysis and accuracy evaluation. The z-scores for all cases in the training set were −5.804-3.526, and the propensity scores were 0.001-99.993%. The propensity score ranges were 0.001-36.946%, and 68.294-99.993% for the non-infectious and infectious groups, respectively. These results confirmed that the model correctly classified the original cases into infectious and non-infectious groups with the pertinent propensity score range. From the constructed discriminant model, we devised a simple calculation process that can predict the infection propensity for new individual cases by deriving the first-order function formula using the z′-value, which is converted from the discriminant score using the group centroids. Table 4 shows cases where there is a high importance of crossspecies host susceptibility to viruses in the classification results of the discriminant analysis with distance and similarity scores. The representative zoonotic viruses MERS-CoV and SARS-CoV have an Infectindex of 98.759% for C. dromedaries-H. sapiens and 98.495% for F. catus-H. sapiens. Both cases show a high probability of interspecies infection. Considering the actual host range of SARS-CoV (Martina et al., 2003; Holmes, 2005) , these findings confirm that the difference in Infectindex value reflects the infection properties of the virus. In the case of Nipah virus, the interspecies infection propensity of S. scrofa-H. sapiens was 88.572%, while that of S. scrofa-I. punctatus was 4.769%. The results show that the infection properties of the viruses differed significantly among the host-pair data sets (Chua et al., 1999 ; AbuBakar (caption on next page) M. Cho and H.S. Son Infection, Genetics and Evolution 73 (2019) 71-80 et al., 2004) (Table 4 ). In the present study, the maximum calculated propensity value, which represents the greatest risk of cross-species infection, was used as the InfectIndex for viruses that recognize different receptors. For example, the InfectIndex between M. brandtii and H. sapiens for Nipah virus was 88.310 for ephrin B2 and 88.572 for ephrin B3; we defined the InfectIndex as the highest index value, i.e. 88.572.

In this study, we constructed ViCIPR, a server-based tool to enable users to easily extract cross-species infection propensities of specific viruses using a simple two-step procedure. ViCIPR can be accessed from a web server at http://lcbb3.snu.ac.kr/ViCIPR/home.jsp.

ViCIPR provides search functions for homology and calculates results based on query sequences with an interface that allows users to select from a set of databases, including sequence data sets uploaded by the user. The target sequence resource for calculating the Infectindex was the sequence database collected, processed, and stored in the NCBI protein database. The similarity of a query sequence to database reference sequences is calculated, and the result is presented via the 'Sequence Similarity Scoring System'. Users can enter a query sequence by pasting directly into the query box or by uploading a sequence as a FASTA file from a local computer. Currently, our sequence similarity scoring system is performed based on three databases: ViCIPR, all nucleotides (genomes); ViCIPR, all cds nucleotides (genes); and ViCIPR, all cds proteins (proteins). Fig. 2 shows the main components and data flow of ViCIPR. ViCIPR presents a selectable secondary (recipient) host species based on similarity search results for the input data corresponding to the query sequence (gene or protein) and the host species, and presents the Infectindex calculation result at the same time as selecting the options (Fig. 2 ). Fig. 3 shows the progress and results of extracting predicted infection propensities of SARS-CoV using ViCIPR. As shown in Fig. 3 , a simple two-step procedure makes it easy to obtain the Infectindex of two potential hosts for cross-species infection of a particular virus (Fig. 3) .

Infections caused by newly emerging viruses are serious threats to public health and have become a global concern. A variety of factors and their interactions may contribute to disease emergence. In this study, we focused on how viruses can be transmitted between an established reservoir species and a new host species and on what determines the potential of a virus to cross the barrier to a previously uninfected species. Based on the hypothesis that the major barrier to interspecies virus infection is the difference between cell-receptor sequences, we evaluated the genetic risk for cross-species infections Fig. 1 . A computational framework to estimate propensities for virus infection between host species. The framework largely consists of four stages: 1) amino acid sequence analysis and predictive variable selection, 2) construction of the classification model, 3) model-based prediction, and 4) calculation of propensity scores. Based on the results of the discriminant analysis, three measures were selected as predictive variables and used to construct the model. We derived covariant matrices and pool-within-class covariant matrices for the discriminant analysis. SPSS 24.0 software was used to calculate inverse matrix and discriminant coefficients, to derive the discriminant model, and to evaluate the contributions of predictive variables. In this figure, C 1 and C 2 indicate the group centroid of each group used for z′value computation, and α and β are the coefficients used to transform the z′-score for propensity estimation. Host1, original/donor host species; host2, alternative/recipient host species. g S i,1 , g S i,2 , and g S i,3 , pair-wise distance, relative distance, and total similarity, respectively. Groups were classified based on the discrimination scores (DSs) (1, infectious group; 2, non-infectious group). The DS was calibrated for correct classification and propensity calculation in the dataset. The group centroid of the discriminant function was 2.428 for the infectious group and −4.316 for the non-infectious group, and was used to calculate the Infectindex.

between potential host species based on evolutionary distance and similarity of receptor sequences. Correct results from both the training and test data sets may have been a result of accurate measurement of possible sequence variations that affect cross-species infection susceptibility. The accuracy of predictive variables that positively and negatively influence virus transmission suggests that this approach could be improved with additional receptor sequence data. Among the three variables, sequence similarity was the most important for classification and prediction. The possibility of identifying virus transmission between hosts without requiring complex structural analysis suggests the importance of this sequence-based approach. Due to the lack of available data, this method is limited to viruses whose receptor use has been determined. In our method, it is necessary to calculate distance matrices and perform MSA for parameterization of the characteristics obtained from protein sequence analyses. Thus, accurate protein sequence data comprising sufficient sequence lengths should be available for various species. As a result, the Infectindex was estimated for ephrin B2 and ephrin B3 of the Nipah virus, respectively, by excluding other receptor proteins whose sequence data for various species were insufficient from the analysis. For the receptors used by other viruses, further analyses will enable calculation of the Infectindex as soon as sufficient data are available. We will continue to conduct additional studies to update the relevant data and prediction tools. In addition, our results are based on primary structural information from full-length protein sequences and are limited in that they do not reflect the detailed mechanisms involved in virus-host-cell interaction. These limitations can make it difficult to explain subtle differences in the susceptibility of host organisms to viral infections, which exhibit differences in host range at the strain level. Therefore, the statistical model and model-based discriminant values should be cautiously accepted at the viral species level, and further experimental validation is required to improve applicability. For example, in the case of influenza virus, the type of sialic acid (2,3-or 2,6-linkage), which is determined by the biochemical repertoire of the host cell, is identified as an important factor involved in preferentially recognizing and binding to avian or human cells. In this study, these subtype-level properties were not applied to generate parameters for model construction. To take these detailed features into account, additional problems should be solved such as adjustment to the taxonomic level of other virus species constituting the data sets and weighting the values of specific amino acid residues that affect multiple receptor types. In considering all of these problems, the accuracy of multivariate-based classification and prediction can be impaired if a sufficient amount of reliable data cannot be guaranteed. Therefore, we believe that further research should be conducted based on our findings to discover more advanced methods that can be applied in special cases, such as influenza virus. As discussed above, the propensity for interspecies infection, which is estimated statistically for each virus species, has limitations in reflecting Fig. 2 . Main components and data flow of ViCIPR, a web-based prediction system. The data flow is shown with the process of establishing a discriminative model and predictive protocol, a database capable of dynamic interaction, and a user-friendly web interface as the key components for the operation of analytical systems in ViCIPR (http://lcbb3.snu.ac.kr/ViCIPR/home.jsp). As shown, we tried to construct ViCIPR based on our own statistical protocol. The results of the protocols and prediction studies were stored in a database that can be used in ViCIPR. Operation of the ViCIPR analysis system is initiated by the input of query sequences (nucleotides or proteins) and selection of the primary (donor) host species. Next, a similarity search is performed on the query sequence and host information with the user's input. Based on the similarity search result, a virus species with maximum similarity is given as the output, and a selectable secondary (recipient) host species is presented in connection with the MySQL database. The results show a calculated value for the Infectindex of the selected host pair of the corresponding virus species at the same time as the selection of the secondary host species.

subtle differences that may occur at lower virus classification levels.

However, under the assumption that the variable sequence similarity itself comprehensively includes variation in subtype-specific virus-cell interactions, which confer the receptor binding affinity that causes changes in host tropism, it is expected that polymorphisms of receptor sequences in a wide variety of potential host species can be accurately measured, and the resulting information about relevant residues utilized. In the present study, interaction hotspot similarity, which we thought would serve as an important variable following preliminary analysis, was excluded from the model due to its low significance. This seems to be due to the limitations of the use of predicted data resulting from the lack of information on protein tertiary structure and amino acid residues important in viral attachment to host cells. The identification of specific amino acid residues that participate in the virus-host interaction would enable more realistic simulation of the infection mechanism, which would lead to a more precise prediction of the host ranges and infection propensities of newly emerging viruses. Consideration of the effects of the secondary structure, physical properties and chemical properties of proteins, which are involved in virus-host interactions, on host susceptibility to cross-species viral infections may improve this method. In this study, predictive variables were determined based on receptor sequence analyses, but we believe that the receptor-binding domains of each virus will provide significant information to improve the predictive power of the model. Further effort should be made to improve the flexibility of the method in handling subtle differences in host specificity that do not follow linear species distances while ensuring high accuracy. In this regard, we are currently parameterizing the various properties of proteins to improve model prediction accuracy, and are working on docking simulations using virus and host protein information to produce reliable data that can be usefully applied in our future prediction studies. Complex factors that determine viral host tropism, such as a variety of viral transmission modes, the action of animal vectors or carriers, and the process of host immune response, could be included for better prediction. Although further refinements are needed, this approach may be useful as a basic tool for prior studies in accurately predicting host susceptibility to new viral infections.

The authors declare no conflict of interest. Fig. 3 . A simple two-step procedure in the ViCIPR web interface. The process and results of the extraction of Infectindex for the SARS-CoV are shown. In the first step, a similarity search of the viral genome sequence data library among the target sequence resources in the ViCIPR genomic database was performed, and the results were output. Using the built-in search function of ViCIPR, the maximum matching score, the virus species with the best and most relevant hits, and the virus species with hits (%) and selectable hosts corresponding to the results of significant sequence alignments were retrieved. In the second step, a list of selectable primary host species is presented by the user's selection of the virus, which is based on the information of the virus species with the maximum percent identity among the viruses corresponding to the target sequences. Finally, the cross-species infection propensity of the host pair determined according to the user's selection is calculated and output to the result window. As shown, ViCIPR database similarity search results indicated that the SARS-CoV was the virus species most similar to the query sequence. We can confirm that the selected secondary hosts H. sapiens and M. putorius furo for the primary host F. cattus are presented in the select box. Simultaneously with selection of the secondary host species, the results of the Infectindex calculation were output to the box.

Isolation and molecular identification of Nipah virus from pigs

Data Analysis and Statistics Software for Microsoft Excel

Classification of viral zoonosis through receptor pattern analysis

Fatal encephalitis due to Nipah virus among pig-farmers in Malaysia

IBM Spss Statistics for Windows

Virus entry: molecular mechanisms and biomedical applications

Mechanisms of viral emergence

MUSCLE: multiple sequence alignment with high accuracy and high throughput

Status of land cover classification accuracy assessment

Recombination, reservoirs, and the modular spike: mechanisms of coronavirus cross-species transmission

Using SPSS for Windows; Analyzing and Understanding Data

The cell biology of receptor-mediated virus entry

The application of genomics to emerging zoonotic viral diseases

Adaptation of SARS coronavirus to humans

The role of receptor binding specificity in interspecies transmission of influenza viruses

Global trends in emerging infectious diseases

Estimation of error rates in discriminant analysis

Virology: SARS virus infection of cats and ferrets

Early alterations of the receptor-binding properties of H1, H2, and H3 avian influenza virus hemagglutinins after their introduction into mammals

Fisher discriminant analysis with kernels

Prediction and prevention of the next pandemic zoonosis

Identifying genetic markers of adaptation for surveillance of viral host jumps

From Pasteur to genomics: progress and challenges in infectious diseases

Single amino acid substitutions in influenza haemagglutinin change receptor binding specificity

The central role of the propensity score in observational studies for causal effects

Cellular receptors for viruses: links to tropism and pathogenesis

Receptor binding and membrane fusion in virus entry: the influenza hemagglutinin

Emerging and reemerging diseases: a historical perspective

MEGA6: molecular evolutionary genetics analysis version 6.0

Host range and emerging and reemerging pathogens

Human viruses: discovery and emergence

Evolutionary divergence and convergence in proteins

Supplementary data to this article can be found online at https:// doi.org/10.1016/j.meegid.2019.04.016.