key: cord-291523-4dtk1kyh authors: Nguyen, Thanh Thi; Abdelrazek, Mohamed; Nguyen, Dung Tien; Aryal, Sunil; Nguyen, Duc Thanh; Khatami, Amin title: Origin of Novel Coronavirus (COVID-19): A Computational Biology Study using Artificial Intelligence date: 2020-07-01 journal: bioRxiv DOI: 10.1101/2020.05.12.091397 sha: doc_id: 291523 cord_uid: 4dtk1kyh Origin of the COVID-19 virus has been intensely debated in the scientific community since the first infected cases were detected in December 2019. The disease has caused a global pandemic, leading to deaths of thousands of people across the world and thus finding origin of this novel coronavirus is important in responding and controlling the pandemic. Recent research results suggest that bats or pangolins might be the original hosts for the virus based on comparative studies using its genomic sequences. This paper investigates the COVID-19 virus origin by using artificial intelligence (AI) and raw genomic sequences of the virus. More than 300 genome sequences of COVID-19 infected cases collected from different countries are explored and analysed using unsupervised clustering methods. The results obtained from various AI-enabled experiments using clustering algorithms demonstrate that all examined COVID-19 virus genomes belong to a cluster that also contains bat and pangolin coronavirus genomes. This provides evidences strongly supporting scientific hypotheses that bats and pangolins are probable hosts for the COVID-19 virus. At the whole genome analysis level, our findings also indicate that bats are more likely the hosts for the COVID-19 virus than pangolins. The COVID-19 pandemic has rapidly spread across many countries and disturbed lives of millions of people around the globe. There have been approximately 10 million confirmed cases of COVID-19 globally, including nearly 500,000 deaths, reported to the World Health Organization at the end of June 2020 [1] . Studies on understanding the virus, which was named severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), are important to propose appropriate intervention strategies and contribute to the development of therapeutics and vaccines. Finding origin of the COVID-19 virus is crucial as it helps to understand where the virus comes from via its evolutionary relationships with other biological organisms and species. This will facilitate the process of identifying and isolating the source and preventing further transmissions of the pathogen to the human population. This will also help to understand the outbreak dynamics, leading to the creation of informed plans for public health responses [2] . A study by Wu et al. [3] using a complete genome obtained from a patient who was a worker at a seafood market in Wuhan city, Hubei province, China shows that the virus is closely related to a group of SARS-like CoVs that were previously found present in bats in China. It is believed that bats are the most likely reservoir hosts for the COVID-19 virus as it is very similar to a bat coronavirus. These results are supported by a separate study by Lu et al. [4] using genome sequences acquired from nine COVID-19 patients who were among early cases in Wuhan, China. Outcomes of a phylogenetic analysis suggest that the virus belongs to the genus Betacoronavirus, sub-genus Sarbecovirus, which includes many bat SARS-like CoVs and SARS CoVs. Another study in [5] confirms this finding by analysing genomes obtained from three adult patients admitted to a hospital in Wuhan on December 27, 2019. Likewise, Zhou et al. [6] advocate a probable bat origin of SARS-CoV-2 by using complete genome sequences of five patients at the beginning of the outbreak in Wuhan, China. One of these sequences shows 96.2% similarity to a genome sequence of a coronavirus, denoted RaTG13, which was previously obtained from a Rhinolophus affinis bat found in Yunnan province of China. Zhang and Holmes [7] also highlight a similarity of approximately 85% between SARS-CoV-2 and RaTG13 in their receptor binding domain, which is an important region of the viral genomes for binding the viruses to the human angiotensin-converting enzyme 2 receptor. In another study, Lam et al. [8] found two related lineages of CoVs in pangolin genome sequences sampled in Guangxi and Guangdong provinces in China, which have similar genomic organizations to SARS-CoV-2. That study suggests that pangolins could be possible hosts for SARS-CoV-2 although they are solitary animals in an endangered status with relatively small population sizes. These findings are corroborated by Zhang et al. [9] who assembled a pangolin CoV draft genome using a reference-guided scaffolding approach based on contigs taxonomically annotated to SARS-CoV-2, SARS-CoV, and bat SARS-like CoV. Xiao et al. [10] furthermore suggest that SARS-CoV-2 may have been formed by a recombination of a pangolin CoV-like virus with one similar to RaTG13, and pangolins are potentially the intermediate hosts for SARS-CoV-2. On the other hand, by analysing genomic features of SARS-CoV-2, i.e. mutations in the receptor binding domain portion of the spike protein and distinct backbone of the virus, Andersen et al. [11] determined that this novel coronavirus originated through natural processes rather than through a laboratory manipulation. This study presents a step further to suggesting the likely origin of the COVID-19 virus by using artificial intelligence (AI) methods to explore genome sequences obtained from more than 300 COVID-19 patients across the world. We use AI-enabled unsupervised clustering methods to demonstrate and emphasize the relationships between COVID-19 virus, bat CoVs and pangolin CoVs. Through analysing the results of clustering methods, we are able to suggest the sub-genus Sarbecovirus of the genus Betacoronavirus of SARS-CoV-2 and the more likely bat origin of the virus rather than a pangolin origin. We downloaded 334 complete genome sequences of SARS-CoV-2 available from the GenBank database, which is maintained by the National Center for Biotechnology Information (NCBI), in early April 2020. Among these sequences, 258 were reported from USA, 49 were from China and the rest were distributed through various countries from Asia to Europe and South America. Accession numbers and detailed distribution of these genome sequences across different countries are presented in Tables 4 and 5 in Appendix 1. Most of reference sequences, e.g. ones within the Alphacoronavirus and Betacoronavirus genera, are also downloaded from the NCBI GenBank and Virus-Host DB (https://www.genome.jp/virushostdb/) that covers NCBI Reference Sequences (RefSeq, release 99, March 2, 2020). Genome sequences of Guangxi pangolin CoVs [8] are downloaded from the GISAID database (https://www.gisaid.org) with accession numbers EPI_ISL_410538 -EPI_ISL_410543. A Guangdong pangolin CoV genome [10] is also downloaded from GISAID with accession number EPI_ISL_410721. We employ three sets of reference sequences in this study with details presented in Tables 1-3. The selection of reference genomes at different taxonomic levels is based on a study in [12] that uses the AI-based supervised decision tree method to classify novel pathogens, which include SARS-CoV-2 sequences. We aim to traverse from high to low taxonomic levels to search for the COVID-19 virus origin through discovering its genus and sub-genus taxonomy and its closest genome sequences. Unsupervised clustering methods are employed to cluster datasets comprising both query sequences (SARS-CoV-2) and reference sequences into clusters. In this paper, we propose the use of hierarchical clustering algorithm [13] and densitybased spatial clustering of applications with noise (DBSCAN) method [14] for this purpose. With these two methods, we perform two steps to observe the clustering results that lead to interpretations about the taxonomy and origin of SARS-CoV-2. In the first step, we apply clustering algorithms to cluster the set of reference sequences only, and then use the same settings (i.e. values of parameters) of clustering algorithms to cluster a dataset that merges reference sequences and SARS-CoV-2 sequences. Through this step, we can find out reference sequences by which SARS-CoV-2 sequences form a group with. In the second step, we vary the settings of the clustering algorithms and observe changes in the clustering outcomes. With the second step, we are able to discover the closest reference sequences to the SARS-CoV-2 sequences and compare the similarities between genomes. In the hierarchical clustering method, the cut-off parameter C plays as a threshold in defining clusters and thus C is allowed to change during our experiments. With regard to the DBSCAN method, the neighbourhood search radius parameter ε and the minimum number of neighbours parameter, which is required to identify a core point, are crucial in partitioning observations into clusters. In our experiments, we set the minimum number of neighbours to 3 and allow only the search radius parameter ε to vary. Outputs of the DBSCAN method may also include outliers, which are normally labelled as cluster "-1". To facilitate the execution of the clustering methods, we propose the use of pairwise distances between sequences based on the Jukes-Cantor method [15] and the maximum composite likelihood method [16] . The Jukes-Cantor method estimates evolutionary distances by the maximum likelihood approach using the number of substitutions between two sequences. With nucleotide sequences, the distance is defined as d = −3/4 * ln(1−p * 4/3) where p is the ratio between the number of positions where the substitution is to a different nucleotide and the number of positions in the sequences. On the other hand, the maximum composite likelihood method considers the sum of loglikelihoods of all pairwise distances in a distance matrix as a composite likelihood because these distances are correlated owing to their phylogenetic relationships. Tamura et al. [16] showed that estimates of pairwise distances and their related substitution parameters such as those of the Tamura-Nei model [17] can be obtained accurately and efficiently by maximizing this composite likelihood. The unweighted pair group method with arithmetic mean (UPGMA) method is applied to create hierarchical cluster trees, which are used to construct dendrogram plots for the hierarchical clustering method. The UPGMA method is also employed to generate phylogenetic trees in order to show results of the DBSCAN algorithm. We start the experiments to search for taxonomy and origin of SARS-CoV-2 with the first set of reference genome sequences (Set 1 in Table 1 ). This set consists of much more diversified viruses than the other two sets (Sets 2 and 3 in Tables 2 and 3 ) as it includes representatives from major virus classes at the highest available virus taxonomic level. With a large coverage of various types of viruses, the use of this reference set minimizes the probability of missing out any known virus types. Outcomes of the hierarchical clustering and DBSCAN methods are presented in Figs. 1 and 2, respectively. In these experiments, we use 16 SARS-CoV-2 sequences representing 16 countries in Table 5 (Appendix 1) for the demonstration purpose. The first released SARS-CoV-2 genome of each country is selected for these experiments. Clustering outcomes on all 334 sequences are presented in Fig. 9 in Appendix 2, which shows results similar to those reported here. Both clustering methods consistently demonstrate that SARS-CoV-2 sequences form a cluster with a representative virus of Riboviria among 12 major virus classes (Adenoviridae, Anelloviridae, Caudovirales, Geminiviridae, Genomoviridae, Microviridae, Ortervirales, Papillomaviridae, Parvoviridae, Polydnaviridae, Polyomaviridae, and Riboviria). The Middle East respiratory syndrome (MERS) CoV, which caused the MERS outbreak in 2012, is chosen as a representative of the Riboviria realm. In hierarchical clustering ( Fig. 1 ), when combined with reference genomes, SARS-CoV-2 genomes do not create a new cluster on their own but form a cluster with the MERS CoV, i.e. cluster "8". With the DBSCAN method ( Fig. 2) , SARS-CoV-2 genomes also do not create their own cluster but form the cluster "1" with the MERS CoV. These clustering results suggest that SARS-CoV-2 belongs to the Riboviria realm. (Table 1) with the cut-off parameter C equal to 5 * 10 −4 (left), and using a set that merges 16 representative SARS-CoV-2 sequences and reference sequences with C also set to 5 * 10 −4 (right). A number at the beginning of each virus name indicates the cluster that virus belongs to after clustering. Once we have been able to identify SARS-CoV-2 as belonging to the Riboviria realm, we move to the next lower taxonomic level that consists of 12 virus families within Riboviria. These families are presented in Set 2 ( , and using a set that merges SARS-CoV-2 sequences and reference sequences with ε also set to 0.7 (right). As Set 1 includes representatives of major virus classes and the minimum number of neighbours is set to 3 while ε is set to 0.7, DBSCAN considers individual viruses as outliers (left). When the dataset is expanded to include SARS-CoV-2 sequences, DBSCAN forms cluster "1" that includes all SARS-CoV-2 sequences and the MERS CoV, which represents the Riboviria realm (right). (Table 2) with the cut-off parameter C equal to 0.001 (left), and using a set that merges SARS-CoV-2 sequences and reference sequences with C also set to 0.001 (right). CoVs and bat SARS-like CoVs. Notably, we also include in this set 6 sequences of Guangxi pangolin CoVs deposited to the GISAID database by Lam et al. [8] and a sequence of Guangdong pangolin CoV by Xiao et al. [10] . Evolutionary distances between each of the reference genomes in Set 3 (Table 3) to the 334 SARS-CoV-2 genomes based on the Jukes-Cantor method are presented in Fig. 5 . We can observe that these distances are almost constant across 334 SARS-CoV-2 sequences, which are collected in 16 countries (Table 5) Table 2) with the search radius parameter ε equal to 0.6 (left), and using a set that merges SARS-CoV-2 sequences and reference sequences with ε also set to 0.6 (right). group contains genomes of AlphaCoV viruses (refer to the taxonomy in Table 3 ) that are much evolutionarily divergent from SARS-CoV-2 sequences. The middle group of lines comprises most of the BetaCoV viruses, especially those in the Sarbecovirus sub-genus. The bottom lines identify reference viruses that are closest to SARS-CoV-2, which include bat CoV RaTG13, Guangdong pangolin CoV, bat SARS CoV ZC45 and bat SARS CoV ZXC21. The bat CoV RaTG13 line at the bottom is notably distinguished from other lines while the Guangdong pangolin CoV line is the second closest to SARS-CoV-2. The similarities between bat CoV RaTG13, Guangdong pangolin CoV and Guangxi pangolin CoV GX/P4L with SARS-CoV-2/Australia/VIC01/2020, produced by the SimPlot software [18] , are displayed in Fig. 6 . Consistent with the results presented in Fig. 5 , bat CoV RaTG13 is shown closer to SARS-CoV-2 than pangolin CoVs. Fig. 7 shows outcomes of the hierarchical clustering method using Set 3 of reference sequences in Table 3 . With the cut-off parameter C is set equal to 0.7, the hierarchical clustering algorithm separates the reference sequences into 6 clusters in which cluster "5" comprises all examined viruses of the Sarbecovirus sub-genus, including many SARS CoVs, bat SARS-like CoVs and pangolin CoVs (Fig. 7A) . It is observed that the algorithm reasonably groups viruses into clusters, for example, the genus AlphaCoV is represented by cluster "4" while the sub-genera Embecovirus, Nobecovirus, Merbecovirus, and Hibecovirus are labelled as clusters "3", "6", "2", and "1", respectively. Using the same cut-off value of 0.7, we next perform clustering on a dataset that merges reference sequences and 16 representative SARS-CoV-2 sequences (see Fig. 7B ). Results on all 334 SARS-CoV-2 sequences, which are similar to those on the 16 representative sequences, are provided in Fig. 10 in Appendix 2. The outcome presented in Fig. 7B shows that the 16 representative SARS-CoV-2 sequences fall into cluster "5", which comprises the Sarbecovirus sub-genus. The number of clusters is still 6 and the membership structure of the clusters is the same as in the case of clustering reference sequences only (Fig. 7A) , except that the Sarbecovirus cluster now has been expanded to also contain SARS-CoV-2 sequences. By comparing Figs. 7A and 7B, we believe that SARS-CoV-2 is naturally part of the Sarbecovirus sub-genus. This realization is substantiated by moving to Fig. 7C that shows a clustering outcome when the cut-off parameter C is decreased to 0.1. In Fig. 7C , while 3 members of the Merbecovirus sub-genus (i.e. Pipistrellus bat CoV HKU5, Tylonycteris bat CoV HKU4 and MERS CoV) are divided into 3 clusters ("12", "2" and "4") or members of the Sarbecovirus cluster are separated themselves into 2 clusters "'1" and "11", sequences of SARS-CoV-2 still join the cluster "11' with other members of Sarbecovirus such as 3 bat viruses (bat SARS CoV ZC45, bat SARS CoV ZXC21, bat CoV RaTG13) and 7 pangolin CoVs. As the cut-off parameter C decreases, the number of clusters increases. This is an expected outcome because the cut-off threshold line moves closer to the leaves of the dendrogram. When the cut-off C is reduced to 0.03 (Fig. 7D) , there are only 2 viruses (bat CoV RaTG13 and Guangdong pangolin CoV) that can form a cluster with SARS-CoV-2 (labelled as cluster "15"). These are 2 viruses closest to SARS-CoV-2 based on the whole genome analysis. Results in Figs. 7C and 7D therefore provide evidence that bats or pangolins could be possible hosts for SARS-CoV-2. We next reduce the cut-off C to 0.01 as in Fig. 7E . At this stage, only bat CoV RaTG13 is within the same cluster with SARS-CoV-2 (cluster "17"). We thus believe that bats are the more probable hosts for SARS-CoV-2 than pangolins. The inference of our AI-enabled analysis is in line with a result in [19] that investigates the polyprotein 1ab of SARS-CoV-2 and suggests that this novel coronavirus has more likely been arisen from viruses infecting bats rather than pangolins. When the cut-off C is reduced to 0.001 as in Fig. 7F , we observe that the total number of clusters now increases to 29 and more importantly, SARS-CoV-2 sequences do not combine with any other reference viruses but form its own cluster "19". Could we use this clustering result (Fig. 7F) to infer that SARS-CoV-2 might not originate in bats or pangolins? This is a debatable question because the answer depends on the level of details we use to differentiate between the species or organisms. The cut-off parameter in hierarchical clustering can be considered as the level of details. With the results obtained in Fig. 7D (and also in the experiments with the DBSCAN method presented next), we support a hypothesis that bats or pangolins are the probable origin of SARS-CoV-2. This is because we observe that the similarity between SARS-CoV-2 and bat CoV RaTG13 (or Guangdong pangolin CoV) is considerably large compared to the similarity between viruses that originated in the same host. For example, bat SARS-like CoVs such as bat SARS CoV Rf1, bat SARS CoV Longquan-140, bat SARS CoV HKU3-1, bat SARS CoV Rp3, bat SARS CoV Rs672/2006, bat SARS CoV RsSHC014, bat SARS CoV WIV1, bat SARS CoV ZC45 and bat SARS CoV ZXC21 had the same bat origin. In Fig. 7D , these viruses however are separated into 2 different clusters ("3" and "2") while all 16 SARS-CoV-2 representatives are grouped together with bat CoV RaTG13 and Guangdong pangolin CoV in cluster "15". This demonstrates that the difference between the same origin viruses (e.g. bat SARS CoV WIV1 and bat SARS CoV ZC45) is larger than the difference between SARS-CoV-2 and bat CoV RaTG13 (or Guangdong pangolin CoV). Therefore, SARS-CoV-2 is deemed very likely originated in the same host with bat CoV RaTG13 or Guangdong pangolin CoV, which is bat or pangolin, respectively. Clustering outcomes of the DBSCAN method via phylogenetic trees using Set 3 of reference sequences (Table 3) are presented in Fig. 8 . We first apply DBSCAN to reference sequences only, which results in 3 clusters and several outliers (Fig. 8A) . The search radius parameter ε is set equal to 0.55. As we set the minimum number of neighbours parameter to 3, it is expected that viruses of the sub-genera Embecovirus, Nobecovirus and Hibecovirus are detected as outliers "-1" because there are only 1 or 2 viruses in these sub-genera. Three viruses of the Merbecovirus sub-genus (i.e. Tylonycteris bat CoV HKU4, Pipistrellus bat CoV HKU5 and MERS CoV) are grouped into the cluster "2". All examined viruses of the Sarbecovirus sub-genus are joined in cluster "1" while the AlphaCoV viruses are combined into cluster "3". Fig. 8B shows an outcome of DBSCAN with the same ε value of 0.55 and the dataset has been expanded to include 16 representative SARS-CoV-2 sequences. We observe that genomes of SARS-CoV-2 fall into the cluster "1", which includes all the examined Sarbecovirus viruses. When ε is decreased to 0.3 in Fig. 8C , all members of the Merbecovirus cluster or the AlphaCoV cluster become outliers while 16 SARS-CoV-2 genomes still stick with the Sarbecovirus cluster. In line with the results obtained by using hierarchical clustering in Fig. 7 , those obtained in Fig. 8B and 8C using the DBSCAN method give us the confidence to confirm that SARS-CoV-2 is part of the Sarbecovirus sub-genus. Fig. 8D shows that bat CoV RaTG13 and Guangdong pangolin CoV are closest to SARS-CoV-2 as they join with 16 SARS-CoV-2 representatives in cluster "2". This again substantiates the probable bat or pangolin origin of SARS-CoV-2. By reducing ε to 0.1 as in Fig. 8E , the Guangdong pangolin CoV becomes an outlier whilst SARS-CoV-2 sequences form a cluster ("3") with only bat CoV RaTG13. This further confirms our findings when using the hierarchical clustering in Fig. 7 that bats are more likely the reservoir hosts for the SARS-CoV-2 than pangolins. When ε is decreased to 0.01 as in Fig. 8F , SARS-CoV-2 genomes form its own cluster "3", which is separated with any bat or pangolin genomes. As with the result in Fig. 7F by the hierarchical clustering, this result also raises a question whether SARS-CoV-2 really originated in bats or pangolins. In Fig. 8D , it is again observed that the similarity between SARS-CoV-2 and bat CoV RaTG13 (or Guangdong pangolin CoV) is larger than the similarity between bat SARS CoVs, which have the same bat origin. Specifically, SARS-CoV-2, bat CoV RaTG13 and Guangdong pangolin CoV are grouped together in cluster "2" while bat SARS CoVs are divided into 2 clusters, i.e. bat SARS CoV ZXC21 and bat SARS CoV ZC45 are in cluster "2" whereas other bat SARS CoVs are in cluster "3". We thus suggest that SARS-CoV-2 probably has the same origin with bat CoV RaTG13 or Guangdong pangolin CoV. In other words, bats or pangolins are the probable origin of SARS-CoV-2. All results presented above are obtained using the pairwise distances estimated by the Jukes-Cantor method. Results based on distances calculated by the maximum composite likelihood method are reported in Appendix 3, which are similar to those obtained by using the Jukes-Cantor method. These AI-based quantitative results using the unsupervised hierarchical clustering and DBSCAN methods provide more evidences to suggest that 1) SARS-CoV-2 belongs to the Sarbecovirus sub-genus of the Betacoronavirus genus, 2) bats and pangolins may have served as the hosts for SARS-CoV-2, and 3) bats are the more probable origin of SARS-CoV-2 than pangolins. The severity of COVID-19 pandemic has initiated a race in finding origin of the COVID-19 virus. Studies on genome sequences obtained from early patients in Wuhan city in China suggest the probable bat origin of the virus based on similarities between these sequences and those obtained from bat CoVs previously reported in China. Other studies afterwards found that SARS-CoV-2 genome sequences are also similar to pangolin CoV sequences and accordingly raised a hypothesis on the pangolin origin of the COVID-19 virus. This paper has investigated origin of the COVID-19 virus using unsupervised clustering methods and more than 300 raw genome sequences of SARS-CoV-2 collected from various countries around the world. Outcomes of these AI-enabled methods are analysed, leading to a confirmation on the Coronaviridae family of the COVID-19 virus. More specifically, the SARS-CoV-2 belongs to the sub-genus Sarbecovirus within the genus Betacoronavirus that includes SARS-CoV, which caused the global SARS pandemic in 2002-2003 [20; 21] . The results of various clustering experiments show that SARS-CoV-2 genomes are more likely to form a cluster with the bat CoV RaTG13 genome than pangolin CoV genomes, which were constructed from samples collected in Guangxi and Guangdong provinces in China. This indicates that bats are more likely the reservoir host for the COVID-19 virus than pangolins. This study among many AI studies in the fight against the COVID-19 pandemic [22] has shown the power and capabilities of AI in this challenging battle, especially from the computational biology and medicine perspective. The findings of this research on the large dataset of 334 SARS-CoV-2 genomic sequences provide more insights about the COVID-19 virus and thus facilitate the progress on discovering medicines and vaccines to mitigate its impacts and prevent a similar pandemic in future. The race to produce treatment drugs and vaccines is still ongoing and no effective results have been reported yet. A further research in this direction is strongly encouraged by a recent success of AI in identifying powerful new kinds of antibiotic from a pool of more than 100 million molecules as published in [23] . As AI is capable of analysing large datasets and discovering knowledge from them in an intelligent and efficient manner, finding a COVID-19 vaccine using AI is a realistic hope [24] . In this Appendix, we first present results of the hierarchical clustering method applied to the dataset that combines Set 1 of reference sequences (Table 1 ) with all 334 SARS-CoV-2 sequences (see Fig. 9 ). We then show results of the hierarchical clustering (Fig. 10) and DBSCAN (Fig. 11 ) on a dataset that combines all 334 SARS-CoV-2 sequences and reference sequences in Set 3 (Table 3) . Fig. 9 . Results shown via a dendrogram plot (left) of the hierarchical clustering method applied to the dataset that combines reference sequences in Set 1 (Table 1 ) and all 334 SARS-CoV-2 sequences. The middle figure shows in detail (zoom in) the top part of the dendrogram plot while the right figure shows the bottom part of the plot. All 334 SARS-CoV-2 sequences are grouped in cluster "8", which also includes the Middle East respiratory syndrome CoV of the Riboviria realm. This means that SARS-CoV-2 belongs to the Riboviria realm. These results are consistent with those shown in Fig. 1B that, for the demonstration purpose, employed only 16 SARS-CoV-2 genomes, which are representatives of 16 countries in Table 5 . This appendix presents results of two clustering methods, i.e. hierarchical clustering and DBSCAN, using the sequence distances computed by the maximum composite likelihood method [16] , which was conducted in the MEGA X software [25] . These results are greatly similar to those obtained by using the Jukes-Cantor distance method shown throughout the paper. In these experiments, the clustering methods are applied to a dataset that combines reference sequences in Set 3 (Table 3 ) and 16 representative genomes of 16 countries in Table 5 . When a country has more than one collected genome, the first released genome of that country is selected for this experiment. Fig. 12 demonstrates the distances estimated by the maximum composite likelihood method between each of the reference sequences and 16 representative SARS-CoV-2 genomes. The lines are almost parallel indicating that SARS-CoV-2 genome is not altered much across countries, which is in line with the results obtained using the Jukes-Cantor distance estimates in Fig. 5 . The bat CoV RaTG13 is again shown much closer to SARS-CoV-2 than pangolin CoVs and other reference viruses although the distance range in Fig. 12 is larger than that in Fig. 5 In Fig. 13A , when the hierarchical clustering cut-off parameter is set equal to 0.1, all 16 representative SARS-CoV-2 genomes are grouped into cluster "12", which also includes other viruses of the Sarbecovirus sub-genus of the BetaCoV genus. When moving from Fig. 13A to Fig. 13B , even though members of the Sarbecovirus cluster ("12" in Fig. 13A ) are split into 2 clusters "1" and "2" in Fig. 13B , the SARS-CoV-2 sequences are still grouped into cluster "14" with other members of the Sarbecovirus sub-genus such as bat CoV RaTG13, Guangdong pangolin CoV, bat SARS CoV ZXC21 and bat SARS CoV ZC45. These results provide us with a confidence on confirming the Sarbecovirus sub-genus of the SARS-CoV-2. This is consistent with the result based on the Jukes-Cantor distances shown in Fig. 7 . Fig. 13C shows that SARS-CoV-2 genomes are combined only with that of bat CoV RaTG13 when the cut-off parameter is decreased to 0.001. This again indicates that bats are the more likely origin of SARS-CoV-2 than pangolins. When we reduce the cut-off parameter to 0.0001, the SARS-CoV-2 sequences create their own cluster "22" and this questions the probable bat or pangolin origin of SARS-CoV-2. However, in Fig. 13B , we also find that the similarity between SARS-CoV-2 and bat CoV RaTG13 (or Guangdong pangolin CoV) is larger than the similarity between viruses having the same origin. For example, bat SARS CoV WIV1 and bat SARS CoV ZC45 have the same bat origin but they are divided into 2 clusters ("1" and "14") while all 16 SARS-CoV-2 representatives are grouped into cluster "14" with bat CoV RaTG13 and Guangdong pangolin CoV. This implies that SARS-CoV-2 may have originated in bats or pangolins. Results of DBSCAN using distances estimated by the maximum composite likelihood method are presented in Fig. 14 , which are also consistent with those obtained by the Jukes-Cantor distance method in Fig. 8 , leading to the same suggestions on the sub-genus Sarbecovirus membership of SARS-CoV-2, its likely bat or pangolin origin, and the more probable bat origin than the pangolin origin of the virus at the whole genome analysis level. WHO Coronavirus Disease (COVID-19) Dashboard. Available at Origin of SARS-CoV-2 A new coronavirus associated with human respiratory disease in China Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding A novel coronavirus from patients with pneumonia in China A pneumonia outbreak associated with a new coronavirus of probable bat origin A genomic perspective on the origin and emergence of SARS-CoV-2 Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins Probable pangolin origin of SARS-CoV-2 associated with the COVID-19 outbreak Isolation of SARS-CoV-2-related coronavirus from Malayan pangolins The proximal origin of SARS-CoV-2 Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study Clustering methods A density-based algorithm for discovering clusters in large spatial databases with noise Evolution of protein molecules Prospects for inferring very large phylogenies by using the neighbor-joining method Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees Full-length human immunodeficiency virus type 1 genomes from subtype C-infected seroconverters in India, with evidence of intersubtype recombination An exclusive 42 amino acid signature in pp1ab protein provides insights into the evolutive history of the 2019 novel human-pathogenic coronavirus (SARS-CoV-2) Identification of a novel coronavirus in patients with severe acute respiratory syndrome Origins of major human infectious diseases Artificial intelligence in the battle against coronavirus (COVID-19): a survey and future research directions A deep learning approach to antibiotic discovery AI can help scientists find a Covid-19 vaccine MEGA X: molecular evolutionary genetics analysis across computing platforms Table 4 . Accession numbers of 334 SARS-CoV-2 genome sequences obtained from NCBI GenBank in early April 2020, sorted by date released MN908947, MN985325, MN975262, MN938384, MN988713, MN997409, MN994468, MN994467, MN988669, MN988668, MN996531, MN996530, MN996529, MN996528, MN996527, MT007544, MT019533, MT019532, MT019531, MT019530, MT019529, MT020881, MT020880, MT027064, MT027063, MT027062, MT039890, MT039888, MT039887, MT039873, MT049951, MT044258, MT044257, MT066176, MT066175, MT072688, MT093631, MT093571, MT106054, MT106053, MT106052, MT118835, MT123293, MT123292, MT123291 Tables 4 and 5 in this appendix present accession numbers and detailed distribution of 334 SARS-CoV-2 complete genomes across different countries obtained from the NCBI GenBank database.