key: cord-0470072-ucg27sht authors: Chrysostomou, Charalambos; Alexandrou, Floris; Nicolaou, Mihalis A.; Seker, Huseyin title: Classification of Influenza Hemagglutinin Protein Sequences using Convolutional Neural Networks date: 2021-08-09 journal: nan DOI: nan sha: 1214ffbd67f7d15e63f5d039c41ae6541628eb40 doc_id: 470072 cord_uid: ucg27sht The Influenza virus can be considered as one of the most severe viruses that can infect multiple species with often fatal consequences to the hosts. The Hemagglutinin (HA) gene of the virus can be a target for antiviral drug development realised through accurate identification of its sub-types and possible the targeted hosts. This paper focuses on accurately predicting if an Influenza type A virus can infect specific hosts, and more specifically, Human, Avian and Swine hosts, using only the protein sequence of the HA gene. In more detail, we propose encoding the protein sequences into numerical signals using the Hydrophobicity Index and subsequently utilising a Convolutional Neural Network-based predictive model. The Influenza HA protein sequences used in the proposed work are obtained from the Influenza Research Database (IRD). Specifically, complete and unique HA protein sequences were used for avian, human and swine hosts. The data obtained for this work was 17999 human-host proteins, 17667 avian-host proteins and 9278 swine-host proteins. Given this set of collected proteins, the proposed method yields as much as 10% higher accuracy for an individual class (namely, Avian) and 5% higher overall accuracy than in an earlier study. It is also observed that the accuracy for each class in this work is more balanced than what was presented in this earlier study. As the results show, the proposed model can distinguish HA protein sequences with high accuracy whenever the virus under investigation can infect Human, Avian or Swine hosts. Pathogens, including viruses, can cause infectious diseases to spread within populations. These pathogens can be transmitted in multiple ways, with high transmission rates in most circumstances. Identification and diagnosis of infectious diseases are vital, especially in novel pathogens, to prevent and control global pandemics, such as the current COVID-19 pandemic [1] . Influenza viruses are part of the Orthomyxoviridae family of viruses that have negative-sense, single-stranded, segmented RNA genomes, with the majority of the virus burden being associated with influenza viruses type A and B [2] . Influenza viruses capable of infecting humans were introduced from birds and swine [3] . Their introduction to humans has begun global pandemics with the 1918 "Spanish flu" and 2009 "Swine flu" pandemics. Influenza viruses are responsible for more than 500,000 deaths worldwide and affect around 5-15% of the population each year [4] . The evolution United Kingdom hseker1970@yahoo.co.nz * Corresponding Author of influenza viruses enables them to infect individuals who have previously gained immunity through vaccination or previous infections. Furthermore, Influenza viruses can be effectively transmitted within populations from direct contact, respiratory droplets, objects, or materials in the environment, such as clothes, utensils, and furniture. Currently, vaccines are the most efficient and practical method in limiting the influenza virus's spread and stop Influenza epidemics. These vaccines must be renewed periodically to maintain their effectiveness against Influenza viruses. The Influenza virus genome contains eight gene segments that include: PB2 (polymerase subunit) responsible for encoding RNA; PB1 (polymerase subunit) responsible for encoding RNA and includes the PB1-F2 protein, which causes cell death; PA (polymerase subunit) responsible for encoding RNA and includes the PA-X protein, responsible for host transcription shutoff function; NP (nucleoprotein); M1 and M2 (matrix proteins); NS1 and NEP (non-structural proteins); NA (neuraminidase) which facilitates the release of new viruses from the infected host cell; HA (hemagglutinin) responsible for binding and entry of the virus to the host cell. As this study's primary aim is to classify the capability of an Influenza virus to infect different hosts, we will focus primarily on the HA gene. Influenza viruses are classified into subtypes based on the organisation of HA and NA glycoproteins on their surfaces. Currently, 18 HA subtypes and 11 NA subtypes exist in the wild [5] with the majority affecting Avian species. Only three subtypes are known to adapt and circulate in Humans H1N1, H2N2 and H3N2. Seasonal epidemics are caused by two of these Influenza subtypes H1N1 and H3N2. In the literature, various methods were developed and used for analysing and characterising protein sequences. More specifically, for the classification and characterisation of Influenza subtypes, [6] , [7] , [8] , [9] , [10] , [11] , methods such as Complex resonant recognition model in analysing influenza subtype protein sequences [7] , CISAPS: Complex informational spectrum for the analysis of protein sequences [10] , and Structural Classification of protein sequences based on signal processing and support vector machines [11] , have been proposed and developed. In this paper, a new method is proposed based on deep learning methodologies and, more specifically, Convolutional Neural Networks (CNN) to classify the Influenza protein sequences based on their capability to infect the specific host by utilising only the amino acid sequence of the protein sequence to accomplish unprecedented accuracy. The paper is organised as follows: Section II presents the methods and The Influenza HA protein sequences used in the proposed work are obtained from The Influenza Research Database (IRD) [12] . Specifically, complete and unique Hemagglutinin (HA) protein sequences were used for avian, human and swine hosts, which are the primary hosts affected with the virus. The data obtained for this work was 17999 humanhost proteins, 17667 avian-host proteins and 9278 swinehost proteins. The complete list of proteins, including HA subclasses, used in this study can be found in Table I , and Figure 1 illustrates the percentage of HA proteins per class and species. The proposed analysis was performed directly to the protein sequence using a Hydrophobicity index [13] to encode the protein sequences from alphabetical to numerical characters for analysis. Before encoding the protein sequences, the amino acid index values were re-normalised to 0-1. Any character beyond the standard 20 amino acids was encoded using the value 0. The complete list of the Hydrophobicity amino acid index can be found in Table II . The collected protein sequences have variable sizes, with 576 being the maximum number of amino acids. To encode the sequences and use the proposed model, proteins with a lower number of amino acids were padded with 0's at the end of the sequence to reach the mentioned number. The data was further encoded with one-hot encoding for the multiclass classification problem. The proposed work is based on a Convolutional Neural Network (CNN). CNN's are a subtype of deep feed-forward artificial neural networks that have been applied and used to analyse visual representations but recently have been used in other domains, including classification and characterisation of protein sequences [14] , [15] , [16] . CNN's utilise a variation of multi-layer perceptrons created to minimise the preprocessing required [17] , thus in comparison, CNN's require relatively minimal pre-processing of data in relation to other classification methodologies and translates to a substantial advantage where prior knowledge and expertise of any given problem is not available or unknown. As shown in Figure 2 , the proposed model structure consists of three convolutional layers of 32, 3x3 kernels, followed by a max-pooling layer and a rectified nonlinear activation function (Leaky ReLU) [18] transforming the feature space from 32x576 to 32x72. The fourth block consists of a Flatten layer that transforms the feature space from 32x72 to 2304x1. The fifth and sixth layers are fully connected layers of 128 and 64 neurons, respectively, with Leaky ReLU as the activation function. The output layer consists of 3 neurons that correspond to each of the available species and utilises the softmax activation function [17] . The model is trained using the "Adam" optimiser [19] and the "categorical cross-entropy" loss function [17] . The proposed model and hyperparameters were selected based on a trial and error process to maximise the training and validation accuracy. The proposed methodology source code is available at https://gitlab.com/charalambos. chrysostomou/embc21_influenza.git In this paper, a classification model is presented, based on Deep Learning and Convolutional Neural Networks (CNN), for the characterisations and classification Influenza type A based upon the ability to infect a specific host, more specifically human, avian and swine hosts, by solely using the HA protein sequence. To ensure that the proposed model is accurate and the results can be generalised, the model was trained 100 times for 1000 epochs, with different random subsets chosen uniformly and assigned to training, validation and testing sets. For each training cycle, the best weights were saved based on the validation error. To evaluate the proposed method's efficacy, the average of the test set, for the total accuracy, Matthews correlation coefficient (MCC) [20] and F-score (F1) [21] method yields almost 10%, 5% and 2% higher accuracy for Avian, Human and Swine, respectively, than those of the earlier study [22] . It is also observed that the accuracy for each class presented in our work is more balanced than what was presented in this earlier study. The final overall accuracy is also found to be as much as 5% higher than that of the earlier study. The paper presents a highly successful predictive model to classify Influenza viruses based on their capability to infect different hosts, including Human, Avian and Swine Hosts, as the results show compared to existing methods. For this study, the protein sequence of the HA gene was used, which is responsible for attaching to the host's cell and can be considered a promising antiviral drug candidate. The collected protein sequences were encoded using a normalised hydrophobicity amino acid index. As published in the literature, more than 600 unique amino acid indices exist that describes a physicochemical feature of the protein [10] . Future studies are needed to identify potential alternative amino acid index or set of indices capable of better representing HA proteins and delivering even more reliable results. As the recent events of the COVID-19 pandemic have shown, a computational tool capable of identifying novel and potentially dangerous viruses in the wild that have acquired the capability to infect Human hosts will be crucial and needed to predict and control future outbreaks. The socio-economic implications of the coronavirus and covid-19 pandemic: a review Epidemiology and pathogenesis of influenza Science forum: improving pandemic influenza risk assessment Influenza-who cares New world bats harbor diverse influenza a viruses Prediction of influenza a virus infections in humans using an artificial neural network learning approach Complex resonant recognition model in analysing influenza a virus subtype protein sequences Fuzzy rules for describing subgroups from influenza a virus using a multiobjective evolutionary algorithm Effects of windowing and zero-padding on complex resonant recognition model for protein sequence analysis Cisaps: complex informational spectrum for the analysis of protein sequences Structural classification of protein sequences based on signal processing and support vector machines Influenza research database: An integrated bioinformatics resource for influenza virus research Structural prediction of membrane-bound proteins Convolutional neural network architectures for predicting dna-protein binding Protein secondary structure prediction using deep convolutional neural fields Identification of protein lysine crotonylation sites by a deep learning framework with convolutional neural networks Deep learning Rectifier nonlinearities improve neural network acoustic models Adam: A method for stochastic optimization Comparison of the predicted and observed secondary structure of t4 phage lysozyme The truth of the f-measure Classification of host origin in influenza a virus by transferring protein sequences into numerical feature vectors