key: cord-0686882-xgoel4eb
authors: Gong, Zheng; Zhu, Jun-Wei; Li, Cui-Ping; Jiang, Shuai; Ma, Li-Na; Tang, Bi-Xia; Zou, Dong; Chen, Mei-Li; Sun, Yu-Bin; Song, Shu-Hui; Zhang, Zhang; Xiao, Jing-Fa; Xue, Yong-Biao; Bao, Yi-Ming; Du, Zheng-Lin; Zhao, Wen-Ming
title: An online coronavirus analysis platform from the National Genomics Data Center
date: 2020-11-18
journal: Zool Res
DOI: 10.24272/j.issn.2095-8137.2020.065
sha: 7f4e2e58eb1db07add5bd589fa0a39a3f526e7af
doc_id: 686882
cord_uid: xgoel4eb

Since the first reported severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection in December 2019, coronavirus disease 2019 (COVID-19) has become a global pandemic, spreading to more than 200 countries and regions worldwide. With continued research progress and virus detection, SARS-CoV-2 genomes and sequencing data have been reported and accumulated at an unprecedented rate. To meet the need for fast analysis of these genome sequences, the National Genomics Data Center (NGDC) of the China National Center for Bioinformation (CNCB) has established an online coronavirus analysis platform, which includes de novoassembly, BLAST alignment, genome annotation, variant identification, and variant annotation modules. The online analysis platform can be freely accessed at the 2019 Novel Coronavirus Resource (2019nCoVR) (https://bigd.big.ac.cn/ncov/online/tools).

established in several online platforms worldwide. For example, NCBI has provided the BLAST alignment tool (Altschul et al., 1990) in SARS-CoV-2 Resources (https://www.ncbi.nlm.nih.gov/sars-cov-2/). The University of California, Santa Cruz (UCSC) SARS-CoV-2 Genome Browser has integrated the visualization browser with BLAT alignment and variant annotation tools (https://genome. ucsc.edu/covid19.html) (Fernandes et al., 2020) . The National Microbiology Data Center (NMDC) has provided various analysis tools, such as BLAST alignment and phylogenetic analysis, in the Global Coronavirus Data Sharing and Analysis System (http://nmdc.cn/coronavirus/). The Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences (CAS), has established the Virus Identification Cloud (VIC, https://www.biosino.org/vic/), offering online analysis services for viral sequence identification and genome assembly. The Genome Detective webserver has also provided a virus identification workflow for high-throughput sequencing data (https://www.genomedetective.com/) (Cleemput et al., 2020) . Although the above SARS-CoV-2 analysis tools provide online services, their functions are relatively limited and do not cover all aspects of SARS-CoV-2 research (Table 1) .

Thus, to provide a unified and convenient approach for processing SARS-CoV-2 sequencing data, the National Genomics Data Center (NGDC) of the China National Center for Bioinformation (CNCB) established an online coronavirus analysis platform based on viral genomes collected in 2019nCoVR (https://bigd.big.ac.cn/ncov/online/tools), offering free analysis services for researchers. The platform includes five functional modules (Figure 1 ), which cover various SARS-CoV-2 genomic data analyses.

1. De novo assembly module This module can be used for de novo assembly of nextgeneration sequencing (NGS) data. First, raw reads are trimmed for quality using Trimmomatic (Bolger et al., 2014) Received: 17 September 2020; Accepted: 12 October 2020; Online: 12

October 2020 (Li et al., 2015) is then used for sequence assembly with default parameters. The assembled sequences are compared with the SARS-CoV-2 reference genome (NC_045512.2) using BLASTN (Altschul et al., 1990) to identify target sequence(s), and assembly quality is evaluated using QUAST (Gurevich et al., 2013) . The assembly results depend on the qualities of samples and sequencing data and may consist of a complete genome or several contigs. In the future, we plan to assemble those contigs into a single sequence by alignment with the reference genome, and to support genome assembly for third-generation sequencing data.

2. BLAST module To compare sequences among virus strains, the analysis platform includes a BLAST alignment module, with three algorithms (BLASTN, Mega BLAST and discontinuous Mega BLAST) (Altschul et al., 1990) . Users can select the SARS-CoV-2 reference genome, 2019nCoVR genome database, or coronavirus genome database (including alpha/beta/delta/ gamma genus) for online BLAST.

To perform sequence comparison and evolutionary analysis on specific viral genes, gene annotations are required. However, most viral genomes in the above SARS-CoV-2 databases are not annotated. Therefore, we built a genome annotation module based on VAPiD (Shean et al., 2019) , which can identify coding sequences (CDS) or protein sequences and generate a GenBank annotation file.

The variant identification function consists of the Genometo-Variants and Fastq-to-Variants modules. Both modules use the genome NC_045512.2 as a default reference, but users can customize the reference by uploading a genome file. Genome-to-Variants can detect mutation sites from complete or partial genomes, using Muscle (Edgar, 2004) for sequence alignment. Fastq-to-Variants can identify genome variants from NGS raw data and connect seamlessly to the GSA system to load massive raw sequencing data to the server automatically. Sequencing reads are aligned to the SARS-CoV-2 reference genome (NC_045512.2) using BWA (Li & Durbin, 2009) , after which Picard is used to remove duplicate reads and calculate aligned read number, error rate, sequencing depth, and genome coverage (http://broadin stitute.github.io/picard/). Single nucleotide polymorphisms (SNPs) and insertions and deletions (indels) are identified using GATK (McKenna et al., 2010) .

To clarify the mutation influence on gene function, the variation annotation module integrates the Ensembl Variant Effect Predictor (VEP) (McLaren et al., 2016) to show codon and amino acid changes, and then calculates the degree of function influence.

It is worth mentioning that the parameters for the data analysis modules have been highly optimized to improve efficiency and reduce computing time. For example, when testing the running time with the Fastq-to-Variants module using one 24-core server, it cost ~1 min to process 1 Gb of NGS data and less than 4 min for handling 8 Gb of NGS data (Table 2) . For this online platform, we established five servers to provide public service, which indicates that the platform has the capacity to analyze 7 200 NGS data in one day if the data size is less than 1 Gb. In general, a notification email will be automatically sent to users when computing jobs are finished.

For future applications, we will continue to improve this specialized online platform by integrating more tools, software, and pipelines for SARS-CoV-2 data analysis and provide oneclick and public data analysis services for coronavirus researchers. 

Basic local alignment search tool

Trimmomatic: a flexible trimmer for Illumina sequence data

Genome Detective Coronavirus Typing Tool for reduced time and space complexity

The UCSC SARS-CoV-2 genome browser

QUAST: quality assessment tool for genome assemblies

The sequence read archive

MEGAHIT: an ultrafast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph

Fast and accurate short read alignment with Burrows-Wheeler transform

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data

The ensembl variant effect predictor

VAPiD: a lightweight cross-platform viral annotation pipeline and identification tool to facilitate virus genome submissions to NCBI GenBank

GISAID: Global initiative on sharing all influenza data -from vision to reality

The global landscape of SARS-CoV-2 genomes, variants, and haplotypes in 2019nCoVR. bioRxiv

GSA: genome sequence archive

The 2019 novel coronavirus resource. Hereditas (Beijing)

The authors declare that they have no competing interests.