key: cord-266794-oyppubq5 authors: Zhang, Dachuan; Zhang, Tong; Liu, Sheng; Sun, Dandan; Ding, Shaozhen; Cheng, Xingxiang; Cai, Pengli; Ren, Ailin; Han, Mengying; Liu, Dongliang; Jia, Cancan; Gong, Linlin; Zhang, Rui; Xing, Huadong; Tu, Weizhong; Chen, Junni; Hu, Qian-Nan title: SARS2020: An integrated platform for identification of novel coronavirus by a consensus sequence-function model date: 2020-09-01 journal: Bioinformatics DOI: 10.1093/bioinformatics/btaa767 sha: doc_id: 266794 cord_uid: oyppubq5 MOTIVATION: The 2019 novel coronavirus outbreak has significantly affected global health and society. Thus, predicting biological function from pathogen sequence is crucial and urgently needed. However, little work has been performed to identify viruses by the enzymes that they encode, and which are key to pathogen propagation. RESULTS: We built a comprehensive scientific resource, SARS2020, that integrates coronavirus-related research, genomic sequences, and results of anti-viral drug trials. In addition, we built a consensus sequence-catalytic function model from which we identified the novel coronavirus as encoding the same proteinase as the Severe Acute Respiratory Syndrome virus. This data-driven sequence-based strategy will enable rapid identification of agents responsible for future epidemics. AVAILABILITY: SARS2020 is available at http://design.rxnfinder.org/sars2020/. SUPPLEMENTARY INFORMATION: The 2019 novel coronavirus (2019-nCoV) outbreak is an ongoing pandemic. As of 20 May 2020, 4,900,647 cases were confirmed, and 320,107 deaths were attributed to the virus. On 30 Jan. 2020, the World Health Organization declared the outbreak a Public Health Emergency of International Concern. Identification of the virus is crucial for public health authorities to contain the spread of the disease and for researchers to find methods to cure the disease (Wang, et al., 2020) . The genome sequence of the 2019-nCoV became available on 10 January 2020 (Wu, et al., 2020) . However, sequence alone is insufficient for accurate identification because pathogens are not defined by "taxonomy". To circumvent this limitation, we built an integrated 2019-nCoV scientific resource platform and a consensus sequence-catalytic function model with which we developed novel methodology to analyze pathogen sequences for catalytic functions. This model predicted that the 2019-nCoV has an enzymatic activity unique to SARS viruses. We systematically collected reports of coronavirus-related research, genomic sequences, biochemical reactions, government policies, media public opinion, and anti-viral drugs in clinical trial (Table S1 , Hu, et al., 2011; Khan, et al., 2020; Shu and McCauley, 2017) . This information was used to build SARS2020, an integrated scientific resource about 2019-nCoV, to provide foundation data for researchers in various fields. For data quality, we imposed strict evaluation and validation criteria. All 2019-nCoV related data were checked one-by-one to ensure authenticity. In addition, we integrated a consensus sequence-function model (Zhang, et al., 2020) , a genome browser (Ham, et al., 2012) , and a catalytic function annotation tool (Dawson, et al., 2017) into the platform to assist in the research of novel viruses. Sequence-function model: We adopted a consensus strategy to annotate enzymatic functions of biological sequences. For sequence function annotation, the family classification method captures common properties from the samples and extracts their feature vectors using machine learning algorithms, then merges the sequences into clusters or families. This consensus strategy enables efficient integration of these computational resources to maximize the accuracy and comprehensiveness of enzyme function prediction. Web server: SARS2020 runs on a Linux server under a Nginx environment. The backend program and algorithm were written in Python using the Django framework in combination with MySQL to manage the data. Bootstrap, CSS, and JavaScript were used to implement the frontend data presentation and interactions. Identification of 2019-nCoV: We obtained the coding sequences of 2019-nCoV from NCBI (NC_045512) and constructed a gene model from sequence based on an interpolated Markov model. We used the long-orfs tool from Glimmer3 (Delcher, et al., 2007) to identify the coding regions of bacterial, archaeal, and viral genomes. Protein translation of coding regions was performed with Biopython (Cock, et al., 2009) . Then we used a consensus sequence-catalytic function model provided by SARS2020 to analyze the pathogen sequence for likely catalytic functions. The SARS2020 system is an integrated scientific resource platform about 2019-nCoV. At present, the system includes ~60,000 units of information. It provides powerful assistance for scientists to grasp the progress of 2019-nCoV research and to share data. SARS2020 is also a platform to assist in the identification of new viruses. We analysed the 2019-nCoV genome by the method described above. All predicted catalytic functions were derived from orf1ab (GeneID: 43740578), which seems to encode multiple proteins (Fig 1) . The most likely predicted catalytic activity was SARS coronavirus main proteinase, which Enzyme Commission (EC) number is 3.4.22.69. This prediction suggested that 2019-nCoV was most likely a SARS virus, and this result was consistent with the conclusion of the International Committee on Taxonomy of Viruses. At the same time, we also predicted other possible catalytic functions in the 2019-nCoV genome, including RNA-directed RNA polymerase (EC: 2.7.7.48), dolichyl-phosphate-mannose-protein mannosyltransferase (EC: 2.4.1.109), NAD+ ADP-ribosyltransferase (EC: 2.4.2.30), and Ubiquitinyl hydrolase 1(EC: 3.4.19.12). These predicted functions will provide valuable reference for further study of biological activity and pathogenesis of the 2019-nCoV. We built an integrated platform to assist 2019-nCoV research, and we proposed a novel consensus sequence-function model for using genome sequence data to identify unknown species. Our data-driven sequencebased strategy will enable rapid identification of constantly emerging pathogens. Biopython: freely available Python tools for computational molecular biology and bioinformatics CATH: an expanded resource to predict protein function through structure and sequence Identifying bacterial genes and endosymbiont DNA with Glimmer Design, implementation and practice of JBEI-ICE: an open source biological part registry platform and tools RxnFinder: biochemical reaction search engines using molecular structures, molecular fragments and reaction similarity Phylogenetic Analysis and Structural Perspectives of RNA-Dependent RNA-Polymerase Inhibition from SARs-CoV-2 with Natural Products Global initiative on sharing all influenza datafrom vision to reality Human Intestinal Defensin 5 Inhibits SARS-CoV-2 Invasion by Cloaking ACE2 Author Correction: A new coronavirus associated with human respiratory disease in China Bio2Rxn: sequence-based enzymatic reaction predictions by a consensus strategy