key: cord-0695502-k3nj0pc6 authors: Pau, Tirthankar; Vainio, Seppo; Roning, Juha title: Clustering and classification of virus sequence through music communication protocol and wavelet transform date: 2020-10-16 journal: Genomics DOI: 10.1016/j.ygeno.2020.10.009 sha: f9874d8c94b15a62cb050ef7f0a27b6d840f3816 doc_id: 695502 cord_uid: k3nj0pc6 The coronavirus pandemic became a major risk in global public health. The outbreak is caused by SARS-CoV-2, a member of the coronavirus family. Though the images of the virus are familiar to us, in the present study, an attempt is made to hear the coronavirus by translating its protein spike into audio sequences. The musical features such as pitch, timbre, volume and duration are mapped based on the coronavirus protein sequence. Three different viruses Influenza, Ebola and Coronavirus were studied and compared through their auditory virus sequences by implementing Haar wavelet transform. The sonification of the coronavirus benefits in understanding the protein structures by enhancing the hidden features. Further, it makes a clear difference in the representation of coronavirus compared with other viruses, which will help in various research works related to virus sequence. This evolves as a simplified and novel way of representing the conventional computational methods. are few studies based on protein music where the protein was translated into music. The study showed that the secondary structure of proteins could be converted into a musical sequence [19] . DNA music was also formed by mapping amino acid into different pitches. Therefore, 20 amino acid are assigned to 20 different notes and the protein music can be created [20] . This kind of musical transformation would identify the difference between protein folding and amino acids in terms of understanding the facts to regulate the cell process [21] . DNA music can also be created based on the presence of short tandem repeats (STR) in a CODIS region, making it very unique for the individuals [22] . The STR sequence and the STR frequency data were converted as a musical element and performed in musical instrument and digital interference (MIDI) format to make the music melodious. The MIDI is a commonly used protocol to make communication between musical device and computer, mostly in the music industries. Initially, it was developed to create polyphonic music [23] . A musical translation from protein fold can also be a useful method for comparative study of different genome sequences [24] . Musical mapping is not only limited to biological data, but mobile key-stroke data could also be mapped into music to secure the credentials [25] . Recently, a musical work from COVID 19 was created by translating protein into music in Massachusetts Institute of Technology, USA [26] . Musical notes can be displayed in several way. Most common way to present the music in a musical scale is with octave scale, diatonic scale, Tone scale and chromatic scale [15, 20, 21, 27, 28] . Our study shows auditory representation of virus. Sometime, the audio sequence sounds like music, but no musical scale was followed for the sonification. Therefore, the audio sequence of the virus was presented in form of piano sound [22, 29] . Biological studies demand high-performance data analysis. Manual data analysis of genome or long nucleotide sequences may not be efficient for classification or clustering, so machine algorithms are always preferred in this case. Clustering is one method among the others in the process, which plays a very important role in bioinformatics studies. A Most common way to show the clustering among the genomic sequence is phylogenetic analysis. An algorithm based on occurrence and position of k-tuples of DNA sequences was introduced for phylogenetic clustering [30] . A similar type method (Accumulated Natural Vector (ANV)) transforms the DNA sequence into eighteen data points, including nucleotides covariance and distribution [31] . This ANV is the advanced version of the Natural Vector (NV) method which translates the DNA into twelve points and claims to be the most accurate method of clustering [31, 32] . Genomic datasets are large scale structure of proteins/nucleotides. The complexity increases when a large set of data 'N' items are to be clustered to a large number 'K' cluster. 'Linclust' algorithms reduce the time complexity for the large dataset clustering [33] . Another algorithm 'MeShClust' was introduced for the classification of DNA sequence-based on mean shift algorithm of image processing study [34] . Sequence comparing the used discrete wavelet transform (DWT) was performed by extracting the k-mers from the genome sequences. The k-mers were mapped and transformed into discrete wavelet to get a numeric featured vector for the clustering [35] . A Haar wavelet filtering method was used to decompose the sequences for detecting cancerous genome by Liu et al. [36] . The author extracted statistical data of cancerous and non-cancerous genome and classified via machine learning [36] . Also, Haar wavelet is capable of identifying the short tandem repeats (STR) in a DNA sequence [37] . There is also a standard useful tool, k-means algorithm which is an easy to apply clustering algorithm for genome sequences [38, 39] . The above clustering algorithms and their significances are summarized in Table 1 . Also, the data can be classified using the support vector machine (SVM). Nucleotide sequences of different species were classified with the SVM model and reconstructed partitioning in Euclidean hyperplanes [40] . An SVM model was applied in a virus genome sequence and achieved a low mean error rate to classify the sequences [41] . Numerical representation of DNA-binding protein sequences was applied for the predicting and classifying the sequences in SVM classifier based on protein properties and features transformation methods [42] . Algorithm Significance Year [32] Natural Vector (NV) DNA sequences into twelve statistical points vectors. 2011 [30] mBKM with DMk The occurrence and position of k-tuples of DNA sequences 2012 [33] Linclust Reduce the time complexity for the large dataset 2018 [34] MeShClust Clustering method based on shift algorithm of image processing. 2018 [35] Haar wavelet filtering Detecting cancerous and non-cancerous genome. 2018 [31] Accumulated Natural Vector (ANV) In this study, we created auditory representation of coronaviruses, Influenza and Ebola virus. Our assumption is that the sound pattern could help the researchers to find the virus protein by searching an acoustic sequence. It is natural for a human being to be attracted towards music. In our society, many people have a good knowledge J o u r n a l P r e -p r o o f about music and the human brain has an excellent analysing power of sound. Our minds can identify the sound features such as pitch, timbre, volume, rhythm and melody. Translating genome sequence into audio sequence could opens a new door of hidden features of genome data in the form of sound sequence. The outlines of the work towards music is shown in Figure 1 . The RNA sequences of SARS-CoV-2 are found in the NCBI database [43] . The genome sequence of the virus has been assigned in GenBank with an accession number MN908947 [10] . The other family members of the virus (SARS-CoV) sequence are assigned in GenBank with an accession number AY278741, having 29727 nucleotides base [44] . Similarly, the Middle East respiratory syndrome (MERS) coronavirus can also be found in GenBank, and its accession number is KT006149 [45] . In this paper, the musical conversion was performed in the Matlab environment. The programming script was written in Matlab to map the RNA sequence into music. A MIDI toolbox was installed in the Matlab platform to translate nucleotide data to musical elements [29] . The Algorithm 1 is designed to a) download RNA sequence, b) count the protein present in the sequence and c) MIDI musical mapping. The methodology is adopted from the author previous publication [22] . First, the nucleotide sequence was downloaded from the open-source database NCBI. Then the sequence was converted into a numerical and protein sequences for further process in Matlab. The K-mer analysis was examined and the amino acids were mapped into music. Therefore, the size of the k-mer would be three base codons. So, the codon or amino acid may appear one or many times in the sequence. Each virus sequence has a distinct set of a protein. The number of proteins in the sequence differs with the type of virus. The protein was coded according to the number of occurrences in the sequence. A protein is replaced by a number which indicates the total number of the particular protein present in the sequence. The sequence length and each protein presence take a vital role in numerical mapping of the protein sequence. Later, the musical transformation will make a noticeable difference in the magnitude of the sequence. For example, MADADDAAA is a protein sequence. Total number of 'M' = 1, total number of 'A' = 5, total number of 'D' = 3. So, the coded sequence will be 153533555. Generally, music has seven different elements, i.e. pitch, volume, timbre, duration, form and texture. Here, pitch and duration were coded based on the physical appearance of amino acids. Also, volume, form and timbre were modulated according to the sequence. The communication message will be assigned a data format/ MIDI matrix to musical devices from the computer [23] . This matrix is the size of N*6 elements, where 'N' is the number of notes. In the MIDI file ( ), the 'N' row represents a note event, and the 6 columns define different features such as track number, MIDI channel, note value or MIDI pitch, volume, note starting time and note ending time of the MIDI events. Here, track ( ) and channel number ( ) for piano were set to 1. A constant volume ( ) for all the note events (N) was fixed to 75. The third column is for MIDI pitch (' '). The MIDI pitch is to define from the coded frequency (' ') which was mapped from the nucleotide sequence. A small difference in frequency (' ') will make a prominent difference in MIDI pitch ( ), for the scaling factor . The MIDI pitch conversion is shown in Equation 1. The fifth column is 'start time' ( ), and the sixth one is the 'end time' ( ) of the note and both columns combinedly represent the duration ( ) of the note. Music can be created by using the set of data as a form of note matrix in the Matlab platform with Ken Schutte MIDI toolbox [29] . The MIDI matrix can be defined as Equation 2. [ ⃗⃗⃗⃗ ⃗⃗⃗⃗ ⃗⃗⃗⃗⃗⃗ ⃗⃗⃗ ⃗⃗⃗ ] As described earlier ⃗⃗⃗⃗ ⃗⃗⃗⃗ , volume ( ) =75, ⃗⃗⃗⃗⃗⃗ and are mapped based on the protein sequences. Where | ⃗⃗⃗ ⃗⃗⃗ |. The was processed through matrix2midi module to generate the audio file of the protein sequences [29] . The pitch values from the MIDI file were taken for finding the similarities among the virus sequences. In this purpose, Euclidian distances were measured to compare the audio signals. The lengths of the signals should be the same to find the distance. However, it is not common to fetch same-length sequence of different species for the study. Here, in this study, three groups of virus, Influenza, Ebola and Coronavirus were examined. There are total of 56 viruses (Influenza, Ebola and Coronavirus) genome sequences were taken for the purpose, and they were downloaded from NCBI database. These three groups of virus have different lengths and sequences, and their auditory translations were filtered through discrete wavelet transform (DWT). A DWT or more precisely Haar wavelet, is fast in computing with reversible lossless transform, and most importantly memory efficient to compare and detect the genome sequence, as shown in the previous studies [36, 37] . Coefficients related with Haar wavelet provide the low and high frequency as well as location information in a form of approximate coefficients and detail coefficients of the signal/sequences, respectively [37] . The approximate coefficients and detail coefficients can be explained by the Equation 3 and 4. The mean value and standard deviation (SD) were J o u r n a l P r e -p r o o f obtained from the wavelet transformation to classify the viruses into different clusters. These statistical data with the accession number of the 56 viruses are given in the dataset section. Where, and are the output of low-pass and high-pass filter respectively. s[k] is the protein sequence in numerical form. The low-pass filter coefficient is g(n) and h(n) is the high-pass filter coefficient. Bayesian optimization model was used to get a good low loss in cross-validation for coronavirus and noncoronavirus data in Figure 2 [46] . The GenBank with an accession number of the genomes, which were not mentioned on the previous paragraph but studied in this paper, are given in Table 2 . The acoustic output of a genome sequence may reveal and open many new hidden features which cannot be seen in detail in microscopic images. The musical elements of coronavirus protein music may create a potential impact on our mind. Here, the music conversions were played with a piano instrumental sound. The piano sound of the RNA sequence of SARS-CoV-2 was plotted and shown in Figure 3a . Similarly, Figure 3b and Figure 3c represents the music for SARS-CoV and MERS-CoV, respectively. a. SARS-CoV-2 b. SARS-CoV c. MERS-CoV. Figure 3 . Piano roll plot of virus protein sequence The Euclidian distances were measured to show the similarities and dissimilarities among three sequences (SARS-CoV, SARS-CoV-2 and MERS-CoV). The protein sequence and music sequence distance are shown in the upper triangular matrix (marked in yellow) and lower triangular matrix (marked in green) respectively in Table 3 . The distance matrix shows distance of 224.0603 for the nucleotide sequence. For more than twentynine thousand bases, the Euclidian distance between the two virus sequences is negligible. On the other hand, the Euclidian distance increased to 1534 for the sequence when it was transformed into music. The same trend in Euclidian distance was found for MERS-CoV with SARS-CoV and SARS-CoV-2. The distance is much higher in the music sequence of coronavirus rather than the protein sequence. As a result, the converted music sequence can show a noticeable difference from two very similar nucleotide sequences. This distance can be measured for intra-family members, or those sequence lengths are almost the same. The two sequences need to be the same length to measure Euclidian distance. Influenza, Ebola and Coronavirus are come from different families and also have a vast range of sequence lengths. A Haar wavelet transformation was applied to obtain the statistical values sequences. The viruses were clustered based on the statistical values through K-means clustering algorithm. Figure 4 and Figure 5 are the clustered output of virus genome sequences and virus audio sequences. In Figure 4 , most of the virus sequences are placed close to each other, and the clustering algorithm was not significant to show the variation among different group of viruses. On the other hand, the K-means algorithm was applied into the virus pitch value sequences that show three distinct clusters of Influenza, Ebola and Coronavirus in Figure 5 . The average detail coefficient from Haar wavelet, of the viruses, are plotted in the boxplot in Figure 6 (a and b) . The boxes lay almost on the same height for genome sequence in Figure 6a . Differently, the difference of box heights can be shown in Figure 6b , for audio sequences of the viruses. The audio sequence was created based on the physical features of the protein sequence. Therefore, a small difference in protein sequences creates a J o u r n a l P r e -p r o o f considerable change in the statistical values of music sequences. Also, the difference can be visualized in the classifier in Figure 7 , where coronavirus and non-coronavirus audio data were classified with zero loss. On the other hand, the loss of optimized genome sequences data was recorded 0.0377 for the viruses. The classification result improves from genome (Figure 7a ) to audio sequences (Figure 7b ) of coronavirus and non-coronavirus. Therefore, the audio translation of the virus protein sequences enhances the hidden features, which can be identified in the form of a sound signal. a. Genome Sequence b. Audio Sequence Figure 7 . Classification based on Haar wavelet coefficients. This work suggests a way where the Influenza, Ebola and Coronavirus protein sequences can be a sound sequence instead of visual data. The auditory representation of the coronaviruses can help researchers to understand the protein structures in a different way. Sometimes, the primary protein structures are too tiny to watch, but it can be effectively heard in the music form. The virus music representation algorithm can be a beneficial tool to help in portraying the small mutation within the family (coronavirus family) in the form of music. All three scenario (among SARS-CoV, SARS-CoV-2 and MERS-CoV) show that the Euclidian distance of musical data is much higher than the protein sequence data for intra-family members. The pathogenic effect in coronavirus may enhance or limit with a small mutation which can be identified in the audio sequences of the virus. Moreover, in the inter-family scenario, the three different types of virus (Influenza, Ebola and Coronavirus) were classified through their translated audio sequence. Therefore, the comparison shows that the more promising difference is captured in the auditory representation of the protein spikes. The proposed algorithm is computationally efficient with time complexity , for the 'n' length sequences, 'k' is the cluster of kmeans algorithm and 't' is the number of iterations. The numerical mapping based on the physical presence of each protein and the length of the virus sequence played a dominating role towards audio translation. And the scaling factor ' ' made a noticeable difference in the magnitude of the audio sequence in MIDI conversion. This algorithm will be a helpful tool to find and classify virus sequences into virus family and species, and also make a difference from the other members of the same family without studying in a laboratory condition. COVID-19: a new challenge for human beings Epidemiological data from the COVID-19 outbreak, real-time case information The epidemiology and pathogenesis of coronavirus disease (COVID-19) outbreak A novel numerical representation for proteins: Three-dimensional Chaos Game Representation and its Extended Natural Vector Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison Splice sites detection using chaos game representation and neural network Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2, Science (80-. ) Inhibition of SARS-CoV-2 (previously 2019-nCoV) infection by a highly potent pan-coronavirus fusion inhibitor targeting its spike protein that harbors a high capacity to mediate membrane fusion A new coronavirus associated with human respiratory disease in China A novel coronavirus from patients with pneumonia in China A pneumonia outbreak associated with a new coronavirus of probable bat origin An emerging coronavirus causing pneumonia outbreak in Wuhan, China: calling for developing therapeutic and prophylactic strategies The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2 The all pervasive principle of repetitious recurrence governs not only coding sequence construction but also human endeavor in musical composition Basically musical All Greek to me Peer revue Bacterial problems A Physiological Approach to DNA Music Musical Synthesis of DNA Sequences Life Music: The Sonification of Proteins Conversion of amino-acid sequence in proteins to classical music: Search for auditory patterns Music translation of tertiary protein structure: Auditory patterns of the protein folding Towards personalised, DNA signature derived music via the short tandem repeats (STR) MIDI-based controller of electrical drives Melody discrimination and protein fold classification Authentication by mapping keystrokes to music through melody Music composition using genetic evolutionary algorithms The Musical Gene: Generating Harmonic Patterns from Sequenced DNA of E. coli Bacteria to Compose Music Visited on 13th A novel hierarchical clustering algorithm for gene sequences A novel approach to clustering genome sequences using internucleotide covariance A novel method of characterizing genetic sequences: Genome space with biological distance and applications Clustering huge protein sequence sets in linear time MeShClust: an intelligent tool for clustering DNA sequences SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform Automated detection of cancerous genomic sequences using genomic signal processing and machine learning Haar wavelet based approach for Short Tandem Repeats(STR) Detection DNA approach to solve clustering problem based on a mutual order A Partitional Approach for Genomic-Data Clustering Combined with K-Means Algorithm Classification of nucleotide sequences using support vector machines Virus genome sequence classification using features based on nucleotides, words and compression An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis Visited on 13th Identification of a novel coronavirus in patients with severe acute respiratory syndrome Complete genome sequence of Middle East respiratory syndrome coronavirus (MERS-CoV) from the first imported MERS-CoV case in China SVM kernel based on particle swarm optimized vector and Bayesian optimized SVM in atmospheric particulate matter forecasting Part II -The Positive Sense Single Stranded RNA Viruses Family Coronaviridae Whole-Genome Sequences of Influenza A(H1N1)pdm09 Virus Isolates from Kerala This research work was supported by Infotech Oulu Doctoral Program.