id author title date pages extension mime words sentences flesch summary cache txt cord-352943-ztonp62x Nagpal, Sunil What if we perceive SARS-CoV-2 genomes as documents? Topic modelling using Latent Dirichlet Allocation to identify mutation signatures and classify SARS-CoV-2 genomes 2020-08-20 .txt text/plain 2776 174 47 Here we describe how SARS-CoV-2 genomic mutation profiles can be structured into a 'Bag of Words' to enable identification of signatures (topics) and their probabilistic distribution across various genomes using LDA. Use of the non-phylogenetic albeit classical approaches like topic modeling and other data centric pattern mining algorithms is therefore proposed for supplementing the efforts towards understanding the genomic diversity of the evolving SARS-CoV-2 genomes (and other pathogens/microbes). In fact, Latent Dirichlet Allocation (LDA), an unsupervised machine learning approach, is particularly known for identifying latent topics in large document collections and deciphering the words that define the inferred topics using a generative statistical model. Classical LDA was employed to generate topic models leading to identification of 16 amino acid mutation signatures and 18 nucleotide mutation signatures (equivalent to topics) in the corpus of chosen genomes through rigorous hyper-parameter tuning for coherence optimization (Figure 2) . ./cache/cord-352943-ztonp62x.txt ./txt/cord-352943-ztonp62x.txt