key: cord-0754718-24et9foe authors: Barbosa, Raquel de M.; Fernandes, Marcelo A.C. title: Chaos game representation dataset of SARS-CoV-2 genome date: 2020-04-25 journal: Data Brief DOI: 10.1016/j.dib.2020.105618 sha: b94949b26f0f9386f3fa83049db8678a299754e7 doc_id: 754718 cord_uid: 24et9foe Abstract As of April 16, 2020, the novel coronavirus disease (called COVID-19) spread to more than 185 countries/regions with more than 142,000 deaths and more than 2,000,000 confirmed cases. In the bioinformatics area, one of the crucial points is the analysis of the virus nucleotide sequences using approaches such as data stream, digital signal processing, and machine learning techniques and algorithms. However, to make feasible this approach, it is necessary to transform the nucleotide sequences string to numerical values representation. Thus, the dataset provides a chaos game representation (CGR) of SARS-CoV-2 virus nucleotide sequences. The dataset provides the CGR of 100 instances of SARS-CoV-2 virus, 11540 instances of other viruses from the Virus-Host DB dataset, and three instances of Riboviria viruses from NCBI (Betacoronavirus RaTG13, bat-SL-CoVZC45, and bat-SL-CoVZXC21). Biochemistry, Genetics and Molecular Biology (General) Specific subject area Bioinformatics Type of data • These data are useful because they provide numeric representation of the COVID-2019 epidemic virus (SARS-CoV-2). With this form of the data, it is possible to use data stream, digital signal processing, and machine learning algorithms. • All researchers in bioinformatics, computing science, and computing engineering field can benefit from these data because by using this numeric representation they can apply several techniques such as machine learning and digital signal processing in genomic information. • Data experiments that use clustering and classification techniques in SARS-CoV-2 virus genomic information can be used with this dataset. • These data represent an easy way to evaluate the SARS-CoV-2 virus genome. Each sub-directory "Matlab" contains three files called "RawDataTable- .mat", "RawData.mat" and "CGRData.mat". "RawDataTable.mat" and "Raw- Data.mat" files store the raw data information from the viruses database; Each sub-directory "Excel and txt" is composed of a file and another sub-28 directory called "RawData.xlsx" and "CGRData", respectively. Each "Raw- Data.xlsx" file has the raw data information from the viruses database, and where N is the length of sequence and s n is the n-th nucleotide of the sequence. Each n-th nucleotide, s n , is mapped to bi-dimensional symbol (s x (n), s y (n)) and it can be expressed as ( After the mapping, each n-th symbol (s x (n), s y (n)) is transformed in CGR values by equations expressed as p x (n) = 1 2 s x (n) + 1 2 p x (n − 1), for n = 1, . . . , N and p y (n) = 1 2 s y (n) + 1 2 p y (n − 1), for n = 1, . . . , N where for the initial condition, n = 0, p x (0) = α x and p y (0) = α y [7, 8]. The 41 dataset was generated with α x = 0 and α y = 0. Figures 1a, 1b, 1c and 1d 42 show a example of CGR points (p x (n), p y (n)) from dataset presented in this 43 work. Acknowledgments 45 The authors wish to acknowledge the financial support of the Coorde- The authors declare that they have no known competing financial inter-50 ests or personal relationships that could have appeared to influence the work 51 reported in this paper. Linking virus genomes with 61 host taxonomy Machine learning using intrinsic genomic signatures 67 for rapid classification of novel pathogens: Covid-19 case study Machine learning-based analysis of genomes suggests 74 associations between wuhan 2019-ncov and bat betacoronaviruses Chaos game representation of gene structure Encoding dna sequences by integer chaos game representation Numerical encoding of dna 88 sequences by chaos game representation with application in 89 similarity comparison