key: cord-0889507-13ir7swr authors: Touati, Rabeb; Haddad-Boubaker, Sondes; Ferchichi, Imen; Messaoudi, Imen; Ouesleti, Afef Elloumi; Triki, Henda; Lachiri, Zied; Kharrat, Maher title: Comparative genomic signature representations of the emerging COVID-19 coronavirus and other coronaviruses: High identity and possible recombination between Bat and Pangolin coronaviruses date: 2020-07-06 journal: Genomics DOI: 10.1016/j.ygeno.2020.07.003 sha: e6ad7e975c1db0ded4d6e6409243ab8001b9adc0 doc_id: 889507 cord_uid: 13ir7swr Abstract Coronaviruses are responsible on respiratory diseases in animal and human. The combination of numerical encoding techniques and digital signal processing methods are becoming increasingly important in handling large genomic data. In this paper, we propose to analyze the SARS-CoV-2 genomic signature using the combination of different nucleotide representations and signal processing tools in the aim to identify its genetic origin. The sequence of SARS-CoV-2 was compared with 21 relevant sequences including bat, yak and pangolin coronavirus sequences. In addition, we developed a new algorithm to locate the nucleotide modifications. The results show that the Bat and Pangolin coronaviruses were the most related to SARS-CoV-2 with 96% and 86% of identity all along the genome. Within the S gene sequence, the Pangolin sequence presents local highest nucleotide identity. Those findings suggest genesis of SARS-Cov-2 through evolution from bat and pangolin strains. This study offers new ways to automatically characterize viruses. Recently, unidentified human pneumonia has started from a local fresh seafood market in Wuhan, in December 2019. This emerging virus was later identified as coronavirus called SARS-CoV-2 responsible on Coronavirus Disease 19 [1] [2] . It spreads to more than 216 countries all around the world causing a pandemic [3] . It is considered as the third major zoonotic human coronavirus outbreak of this century after the prominence of the two coronavirus pandemics; severe acute respiratory syndrome coronavirus (SARS-CoV) and the Middle East respiratory syndrome coronavirus (MERS-CoV) [4] [5] . Despite the fact that COVID-19 has a death rate of 2.8% as of April, the 8,844,171 confirmed cases with 465,460 confirmed deaths in a few months (December 8, 2019 to June 22, 2020) are terrifying. Indeed, this virus is highly contagious and the number of infected peoples can be doubled in less than seven days with a basic reproductive number (R0) of 2.2-2.7 [6] . Phylogenetic studies proved a complex coronaviruses evolutionary history originated from an ancient common ancestor (around 10,000 years ago) [8] . Indeed, these viruses present a high rate of mutation and recombination that enable great plasticity and high dynamic genome evolution. The mutation rate of coronaviruses varies from 1 /1,000 to 1 /10,000 nucleotides during the replication of RNAdependent RNA polymerases (RdRP) [9] . Also, Coronaviruses are known to use template switching Till now, scientists are trying to know how SARS-CoV-2 was emerged and infected Humans. Different hypotheses have been proposed. Recently, analysis of the whole genome of two viruses (HKU-SZ-002a and HKU-SZ-005b) confirmed that SARS-CoV-2 belongs to lineage B (Sarbecovirus) of Betacoronavirus and demonstrated the existence of a novel coronavirus genetically closer to the bat SARS-like coronavirus bat-SL-CoVZXC21 (MG772934) and bat-SL-CoVZC45 (MG772933) [10] [11] . Another research in [12] proved that SARS-CoV-2 has the highest similarity (96.3%) with the bat coronavirus RaTG13 all along the genome, using phylogenetic and similarity plot analysis. Other ones suggest that the human SARS-CoV-2, could also evaluate from yak coronaviruses [13] and also through recombination with pangolin coronaviruses [14] [15] . The organism"s genomic signature is a very important graphical representation that allows understanding of the intragenic variations [16] [17] [18] , especially for handling large genomic data. This assay is based on the Chaos Game Representation (CGR) derived from the chaos theory of Jeffrey et al. (1990) and it was considered as a mapping method of large genome sequences [19] . In this study, we try to identify the SARS-CoV-2 genetic origin using a combination of different DNA representation and signal processing tools. We compared full genome sequence of SARS-CoV-2 to relevant viral genomes: sequences of the four Orthocoronavirinae genus, the 15th species included in the Betacoronavirus genus and also yak and pangolin coronaviruses. II [19] . This value is chosen according to previous research [16] [17] 19] . The proposed CGR steps are shown in the following Algorithm 1. The numerical genomic representation using coding methods is an important step to visualize and characterize the hidden information that can be contain in it, especially in this case where the nucleotide sequences do not have any continuously or homologous between them. Different coding techniques exist: the binary [20] , the structural bending trinucleotide (PNUC) [21] , the electron-ion interaction pseudo-potential (EIIP) mapping [22] , the FCGS [23] [24] [25] , and so on. In addition, several signal processing techniques were applied with success to detect the relationship between sequences and detect some biological repetitive sequences, and so on. In this paper we use the Frequency Chaos Game Signal (FCGS) as a coding technique (first step) and the Smoothed Discrete Fourier Transform (SDFT) and the Wavelet Transform as an analysis techniques (second step). Nucleotide sequences are converted into a numerical sequence (1D signals) before processing from our database extracted from NCBI platform. Then, 1-D signals are generated by applying the FCGS order 2: FCGS 2 . This type of coding technique is based on the apparition"s probability of two successive nucleotides in an entry sequence [23] [24] [25] . The probability (P 2_nuc ) of given L nucleotides in the sequence is as follows: J o u r n a l P r e -p r o o f Journal Pre-proof (1) N 2_nuc represents the apparition number two successive nucleotides in the sequence and L represents the length in base pairs of the sequence. After that, in position (k), the oligomer (i), which consists of 2 nucleotides, is replaced by the corresponding occurrence probability: We choose SDFT, a space-scale analysis, to be applied on nucleotide signals which is based on: -Dividing the helitron signal [ ] into portions with an overlap . -Dividing each portion into frames by multiplication with a sliding window [ ]: To ensure the best accurate smoothed spectrum, the Blackman window was chosen as window type. In addition we can follow the instantaneous frequencies evolution by considering the 2-D spectrogram representation resulting from the following mathematical equation: The final obtained matrix contains the time-frequency information corresponding to the studied sequence. This is an efficient representation to visualize the evolution of periodicities along a nucleotide sequence. The time-frequency nucleotide image with three color channels (Red, Green, Blue) is the best way to visualize the different patterns that can differentiate the sequences. If the pixels luminance in nucleotide image changes between two images, we can determine the modified patterns. To reach this goal, we choose the wavelet transform as an analyzing technique which present the nucleotide signal into a nucleotide image. Our choice is based on the performance of the wavelet transform which have been especially used in data related to the biological domain [22, 24, [26] [27] . In this paper, the complex Morlet wavelet (CWT) has been used as a wavelet type which is the best one in terms of time-frequency domains localization. The method's principle consists of decomposing a nucleotide signal into a sum of basic functions called wavelets. These wavelets are issued from the mother wavelet by expansion and translation operations. These wavelets analysis technique applied to a given signal takes into account both time and frequency variations. Unlike the mother wavelet that only has a time variation parameter expressed by the function ψ(t), the daughter wavelet depends on time (a) and scale parameters (b) and it is generated by the expression given by the following equations, where * indicates the conjugate complex and is the oscillation"s number that must be greater than 5 (admissibility condition). [28] [29] [30] : J o u r n a l P r e -p r o o f The CWT of a DNA signal is a matrix which contains the continuous wavelet coefficients. The DNA scalogram (2D) is a representation of the modulus | |. After obtaining the matrix of each sequence, for example SARS-COV-2 and RATG13, we develop a new algorithm which detects the nucleotide variation that exists between these sequences. The new algorithm we have developed is based on computing the correlation value between two matrices corresponding to two different genomes with a variable window. The following figure (Fig.2) shows a flowchart methodology to extract similar nucleotide sequences in two genomes, example of SARS-COV-2 and betacoronavirus RATG13. The aim here is to find similar sequences and the modified nucleotides in two genomes by computing the correlation values and shifting one position to the next base pair if not equal to 1 until obtaining similar matrices. Figure. 2: Flowchart diagram of the adopted localization methodology to extract similar nucleotide sequences between two genomes To detect recombination event, complete genomic sequences of Sars-Cov-2 with other coronavirus sequences were investigated. Sequences were first aligned using Clustal X program and then analyzed J o u r n a l P r e -p r o o f Journal Pre-proof by Simplot program [31] . The default settings were used. These included window size=200, a step size=20, replicate used=100, gap stripping= "on", distance model="Kimura", tree model="Neighbor Joining" III. RESULTS The CGR image ( The application of the SDFT spectral analysis to the genomic signal gives us the opportunity to detect any latent or hidden periodic signal in the original sequences. Here, the idea is to characterize each genome independently of their length with a specific specter (1D signature) and spectrum (2D signature) which indicate the variation region if it exist between more of genomes. Exploring the latent periodicities of global genomes using SDFT method can play a key role in the homology detection between these viruses" classes. For more investigation, we visualize the sequences in 1D spectrum and 2D spectrogram representations by applying the SDFT to the FCGS2 signals. These representations reflect the time-frequency signatures of each sequence which may differentiate each genome or indicate the similarities between them by highlighting its periodicities. In this work, the "Blackman" window was chosen as window type and we consider as the frames length, with a shift index , for the sub-frames, we take , with . The major effect of windowing is guarantee the converting the frequency response discontinuities into transition bands between values on either side of the discontinuity. The spectral representation (1D) of different genomes and the 2-D spectrum representation are presented in Figure 6 and supplementary figure 2. The superposed spectra presented in Figure 6 For more interpretation, we can use the Neighbor joining (NJ) clustering which is an alternative method for hierarchical cluster analysis [32] . Here, we can draw phylogenetic trees of the SARS-Cov-2 and other investigated viruses using the calculated spectra correlation values between SRS-Cov2 and other viruses (B to T) mentioned in "supplementary figure 3". The dendrogram in figure 7 was developed using the Past PAleontological Statistics program version 3.23 [33] . It shows the homology viruses degree to SARA-Cov2 generated from phylogenetic relationships using the spectra from SDFT method The combination of bioinformatics and signal processing tools applied to genetic sequences show that SARS-Cov-2 is highly related to RATG13 Betacoronavirus and Pangolin coronavirus isolate MP789 with 96% and 86% identity along the whole genome. This global result is similar to result obtained by BLAST platform. To highlight these results, we analyzed nucleotide modification positions of both viruses in comparison with SARS-Cov-2 genome. We developed a new algorithm based on the scalogram images resulting from wavelet transform applied to the genomic signal. Table 2 We can clearly see the great modification number in S gene which with 271 (7.09%) for RATG13 genome and 323 (8.45%) for Pangolin coronavirus genome. Comparing the scalogram images of S gene of SARS-Cov2 to RATG13 and Pangolin coronavirus genomes, we find that the homology changes depending on selected regions in this gene. For this reason, we adopted our analysis algorithm according to the region where the similarities in the alignment are strong (Table 3) . Between the nucleotide positions [1251-1600] pangolin sequence showed the lower nucleotide modification than RATG13. These result suggest a possible recombination event in the S gene. Our results of the comparison between SARS-CoV-2 and Cov-RATG13 along the genome, showed 1,173 nucleotide modifications dispatched as shown in the following figure 9 . The minimum mutation ratio is for the nucleotide G that becomes C with 1.22% ratio, corresponding to numbers of 6 mutations. The minimum mutation ratio is for the nucleotide T that becomes C with 1.22% ratio. The global results are presented in the "supplementary results" file. Figure 9 : Dispatching of mutation between SARS-COV-2 and betacoronavirus RATG13 genomes ratios found using our methods To confirm occurrence of recombination between RATG 13 and pangolin MP789 sequences, we assessed Simplot analysis. Figure 10 shows The used methods are an alignment free method, requesting only a sequence database and a bioinformatics and signal processing knowledges [16] [17] [18] . Those methods transformed every nucleotide sequence in numerical form. CGR technique has been used because it has a remarkable capacity to differentiate between genetic sequences belonging to different species [17] . Comparing to alignment- This study shows that based on genomic signatures representations, we can easily assess automatic homology identification between genomes and thus they allow rapid genomic characterization when we are compressed by time, such as this critical period during the outbreak of this novel viral. It permits to rapidly mobilize appropriate methods for diagnostic and medical care specific to the identified viruses. Furthermore, it"s also crucial to the choice of appropriate preventive methods and contingency plans. In the future, alignment free methods can also be advanced and adapted to specific biological tasks and biologist needs in different fields. Over all obtained results, signal processing tools highlight the high homology of SARS-Cov-2 and other betacoronavirus than other families and more precisely with Cov-RATG13 and Pangolin coronavirus isolate MP789 genomes. However, the Yak coronavirus strain YAK/HY24/CH/2017 seems to be different [13] . These results show evolutionary evidence of the SARS-CoV-2-from Bat and Pangolin coronaviruses as a consequence of accumulation of mutations and also acquisition of new genomic region by recombination events. Prior scientists searched the evolutionary history of the Wuhan COVID-19 virus [11] [12] [13] [14] [15] [34] [35] . They discovered that three bat SARS-like coronaviruses, Betacoronavirus RaTG13, bat-SL-CoVZC45 and bat-SL-CoVZXC21, were closely related to SARS-Cov-2 [35] . Nevertheless, it was not clear, till now, if the COVID-19 arose from a recombination event or no between those viruses [12] . Others scientists suggested possible genesis of COVID-19 virus through recombination between Bat [10] [11] and Pangolin coronaviruses [14] [15] A pneumonia outbreak associated with a new coronavirus of probable bat origin Molecular conservation and Differential mutation on ORF3a gene in Indian SARS-CoV2 genomes Molecular epidemiology, evolution and phylogeny of SARS coronavirus Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia High Contagiousness and Rapid Spread of Severe Acute Respiratory Syndrome Coronavirus 2. Emerging infectious diseases Ninth report of the International Committee on Taxonomy of Viruses Evolutionary insights into the ecology of coronaviruses Coronavirus diversity, phylogeny and interspecies jumping Bats and coronaviruses A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-toperson transmission: a study of a family cluster Full-genome evolutionary analysis of the novel corona virus (2019-nCoV) rejects the hypothesis of emergence as a result of a recent recombination event. Infection SARS-CoV-2: Structural diversity, phylogeny, and potential animal host identification of spike glycoprotein Emergence of SARS-CoV-2 through recombination and strong purifying selection Identifying SARS-CoV-2-related coronaviruses in Malayan pangolins Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison New Intraclass Helitrons Classification Using DNA-Image Sequences and Machine Learning Approaches MicroRNA categorization using sequence motifs and k-mers Chaos game representation of gene structure 3D spectrum analysis of DNA sequence: application to Caenorhabditis elegans genome Spectral analysis of global behaviour of C. elegans chromosomes Dwt based cancer identification using EIIP The Helitron family classification using SVM based on Fourier transform features applied on an unbalanced dataset Distinguishing between intragenomic helitron families using time-frequency features and random forest approaches Classification of intra-genomic helitrons based on features extracted from different orders of FCGS A new numerical approach for DNA representation using modified Gabor wavelet transform for the identification of protein coding regions Visualization of DNA methylation results through a GPU-based parallelization of the wavelet transform Decomposition of Hardy functions into square integrable wavelets of constant shape Wavelet theory and applications: a literature study. DCT rapporten The continuous wavelet transform and variable resolution time-frequency analysis Full-length human immunodeficiency virus type 1 genomes from subtype C-infected seroconverters in India, with evidence of intersubtype recombination The neighbor-joining method: a new method for reconstructing phylogenetic trees PAST: paleontological statistics software package for education and data analysis Genomic and protein structure modelling analysis depicts the origin and infectivity of 2019-nCoV Discovery of a rich gene pool of bat SARS-related coronaviruses provides new insights into the origin of SARS coronavirus Touati: Conceptualization, Investigation, methodology and Writing original draft Writingreview & editing Sondes Haddad-Boubaker: Conceptualization, Investigation, methodology and Writing original draft Writing -review & editing Imen Ferchichi: review & editing and biological evaluation of the results Imen Messaoudi: Data curation Afef Elloumi: Validation and Formal analysis Triki: biological evaluation of the results, review and editing Zied Lachiri: Validation of this article Kharrat: Conceptualization, Supervision, biological evaluation of the results, review and editing