key: cord-0025176-eklbia7m authors: Zhang, Yuansheng; Zou, Dong; Zhu, Tongtong; Xu, Tianyi; Chen, Ming; Niu, Guangyi; Zong, Wenting; Pan, Rong; Jing, Wei; Sang, Jian; Liu, Chang; Xiong, Yujia; Sun, Yubin; Zhai, Shuang; Chen, Huanxin; Zhao, Wenming; Xiao, Jingfa; Bao, Yiming; Hao, Lili; Zhang, Zhang title: Gene Expression Nebulas (GEN): a comprehensive data portal integrating transcriptomic profiles across multiple species at both bulk and single-cell levels date: 2021-09-30 journal: Nucleic Acids Res DOI: 10.1093/nar/gkab878 sha: c9319f2d790fb6f2b7ba4df2f0f2555d465383cd doc_id: 25176 cord_uid: eklbia7m Transcriptomic profiling is critical to uncovering functional elements from transcriptional and post-transcriptional aspects. Here, we present Gene Expression Nebulas (GEN, https://ngdc.cncb.ac.cn/gen/), an open-access data portal integrating transcriptomic profiles under various biological contexts. GEN features a curated collection of high-quality bulk and single-cell RNA sequencing datasets by using standardized data processing pipelines and a structured curation model. Currently, GEN houses a large number of gene expression profiles from 323 datasets (157 bulk and 166 single-cell), covering 50 500 samples and 15 540 169 cells across 30 species, which are further categorized into six biological contexts. Moreover, GEN integrates a full range of transcriptomic profiles on expression, RNA editing and alternative splicing for 10 bulk datasets, providing opportunities for users to conduct integrative analysis at both transcriptional and post-transcriptional levels. In addition, GEN provides abundant gene annotations based on value-added curation of transcriptomic profiles and delivers online services for data analysis and visualization. Collectively, GEN presents a comprehensive collection of transcriptomic profiles across multiple species, thus serving as a fundamental resource for better understanding genetic regulatory architecture and functional mechanisms from tissues to cells. Transcriptomic profiling, involving both transcriptional and post-transcriptional modifications or events at wholegenome level, is of great importance for uncovering functional elements across the three domains of life, including 'Bacteria', 'Archaea' and 'Eukarya' (1) (2) (3) . High-throughput RNA sequencing (RNA-seq) (4) , which can qualitatively and quantitatively capture any type of RNA, promises to help researchers characterize transcriptome comprehensively due to the capacities of whole-genome expression profiling (5-7), detection of novel RNA forms and variants (8) (9) (10) (11) (12) and genome reannotation (13, 14) . With the continuous developments of RNA-seq technology, it has become a routine and indispensable approach for systematically characterizing transcriptome across diverse developmental stages and physiological conditions (1, 10, (15) (16) (17) . Of note, over the past years, transcriptomic studies have made the leap from bulk RNA-seq to single-cell RNA-seq (scRNA-seq), unveiling new insights into cell type classification and cellular heterogeneity exploration (18, 19) . As RNA-seq has been widely used in a broad diversity of species worldwide, a huge amount of transcriptomic data has been generated at unprecedentedly exponential rates, accordingly posing great challenges in large-scale data aggregation and standardized processing. To facilitate more effective reuse, integration, and mining of those data, valuable efforts have been made to construct comprehensive or specialized database resources, such as Gene Expression Omnibus (GEO) (20) , Expression Atlas (21) , Human Cell Atlas (HCA) (22) and Genotype-Tissue Expression (GTEx) (23) . Specifically, GEO, a widely used resource developed by NCBI (24) , is devoted to archiving worldwide transcriptomic data (as well as other omics data), yet ignoring standardized data processing and structured metadata management. Expression Atlas in EBI (25) , contains both bulk and single-cell expression profiles with unified processing, nevertheless lacking co/post-transcriptional events (e.g. RNA editing and splicing). HCA is specialized in human single-cell expression profiling, whereas GTEx focuses on human gene expression and regulation across tissues. To sum up, existing resources have two major shortcomings. First, none of them takes good account of transcriptomic profiles (e.g. expression, RNA editing, splicing, etc.). Second, they do not well curate and categorize experimental metadata under the framework of biological contexts. Given the large-scale data volumes and heterogeneous types of data and metadata, it is challenging to build a comprehensive database that integrates transcriptomic profiles at both bulk and single-cell levels, accompanying with standardized data processing, metadata curation, and online tools. To address these challenges, here we present Gene Expression Nebulas (GEN, https://ngdc.cncb.ac.cn/gen/), an open-access data portal integrating transcriptomic profiles under various conditions across multiple species. It was originally established in 2016, along with the foundation of the National Genomics Data Center (NGDC; previously named as BIG Data Center) (26, 27) , China National Center for Bioinformation (CNCB). Since its inception, GEN, as one of the core resources in CNCB-NGDC, has been frequently updated by importing and processing datasets obtained from a variety of raw sequencing data archives. Unlike existing resources, GEN provides a curated collection of high-quality bulk and single-cell RNA-seq datasets with uniformed data processing and adopts a structured curation model to categorize diverse experimental conditions into different biological contexts. Accordingly, GEN features large-scale integration of diverse transcriptomic profiles and provides online tools for analysis and visualization of both bulk and single-cell RNA-seq data. A number of high-throughput RNA-seq projects and their associated datasets were collected from several public raw sequencing databases, including Genome Sequence Archive (GSA, https://ngdc.cncb.ac.cn/gsa/) (28, 29) , Sequence Read Archive (SRA, https://www.ncbi.nlm.nih.gov/ sra/) (30), European Nucleotide Archive (ENA, https:// www.ebi.ac.uk/ena) (31) and DDBJ Sequence Read Archive (DRA, https://ddbj.nig.ac.jp/DRASearch/) (32) . Only the datasets with median mapping rates ≥70% for bulk RNAseq and ≥40% for scRNA-seq were kept for further processing. As a result, a total of 296 RNA-seq projects and 323 high-quality datasets were obtained. For bulk RNA-seq datasets, Fastp v0.20.0 (33) was used for trimming and filtering raw reads. And, HISAT2 v2.0.5 (34) was used for quick alignment to evaluate the data quality, and RseQC v2.6.4 (35) was implemented for inferring the strand specificity of the sequencing library. Then high-quality RNA-seq reads were aligned to the reference genome by STAR v2.7.1a (36) . After that, quantification of gene/isoform assembly was performed by RSEM v1.3.1 (37) with default parameters. 'Raw counts', 'FPKM' (Fragments Per Kilobase of transcript per Million mapped fragments) and 'TPM' (Transcripts Per Million) values of each gene/isoform were calculated. For circular RNA (cir-cRNA) expression analysis, the cleaned RNA-seq reads were mapped to the reference genome by BWA-MEM (38) . Next, CIRCexplorer2 (39) and CIRI2 v2.0.6 (40) were used to identify circRNA candidates by recognizing the backsplicing junction (BSJ) reads (≥2) with default parameters. Moreover, RNA editing sites were identified with the genome from GENCODE v33 (41) as reference. All known RNA editing sites were retrieved from REDIportal v2.0 (42) (http://srv00.recas.ba.infn.it/atlas/). Novel human RNA editing sites were detected by Parallel Strategy of REDItool 2.0 (43) . To obtain more accurate novel editing sites, a filtration step was added for non-Alu regions using additional criteria as the non-Alu regions usually have sporadic editing sites. Meanwhile, pblat v1.0 (44) was used to discover the mismatched RNA-seq reads and multi-mapping reads, which were then trimmed to remove duplicate reads by using SAMtools v1.9 (45) . Editing sites of both A-to-I and C-to-U were maintained for further analysis. RepeatMasker (http://www.repeatmasker.org) and SNP files used for annotating high-confidence novel RNA editing sites were both downloaded from UCSC (https:// hgdownload.soe.ucsc.edu/downloads.html). In addition, for alternative splicing analysis, high-quality RNA-seq reads were mapped to the reference genome by STAR. Then, detection of differentially spliced events was mainly executed with BAM files by rMATS v3.1.0 (46) . The high-quality RNA-seq reads were mapped to the reference genome by STAR. Each 'case' group was compared to the 'control' group to identify differentially spliced events, and parameter of '-cstat 0.0001' was used for 0.01% difference, to compute p-values and FDRs of splicing events with the absolute value of exon inclusion level (| |) > 0.01% cutoff. For scRNA-seq datasets, notably, alignment approach was consistent with bulk RNA-seq datasets, while gene quantification tools varied with the data generated by different platforms/strategies to deal with cell barcodes and unique molecular identifiers (UMIs). Currently, pipelines for the three most commonly adopted scRNA-seq technologies were as follows (47) (48) (49) : (i) for data generated by plate-or fluidigm-based protocol, such as Smart-seq2 (50) and SMARTer (Fluidigm C1) strategies, STAR v2.7.1a and RSEM v1.3.1 were used to align and calculate 'raw counts', 'FPKM' and 'TPM' values of each gene/isoform with the parameter '-single-cell-prior'; (ii) for data from dropletbased protocol including Drop-seq (51) and inDrop (52), dropEst v0.8.6 (53) was used to provide more accurate estimates of molecular counts in individual cells by barcode corrections, classification of cell quality, and diagnostic information about the droplet libraries; and (iii) specifically for data from 10× Genomics platform (54), CellRanger v3.1.0 (https://support.10xgenomics.com/single-cell-geneexpression/software/overview/welcome) was implemented as a one-stop analysis pipeline for quality control, sample de-multiplexing, barcode processing and generation of feature-barcode matrices. For all collected species, a wide range of gene functional annotations were extracted from Ensembl (55) and NCBI (24) , roughly falling into basic information including genomic location and functional description, and associated terms or ontologies like Gene Ontology (GO) (56) . Particularly, for Homo sapiens, housekeeping and tissue-specific genes were derived from GTEx (57), genes were annotated based on Disease Ontology (DO) (58) along with GO, and a gene structure visualization on the basis of Genome Browser (59) was provided. Furthermore, annotation information of editome-disease associations from Editome-Disease Knowledgebase (EDK, https://ngdc.cncb. ac.cn/edk) (60) and RT-qPCR reference genes from Internal Control Genes (ICG, http://icg.big.ac.cn) (61) were also included for corresponding genes, while external links to GTEx (https://www.gtexportal.org/home/) (23), REDIportal (http://srv00.recas.ba.infn.it/atlas/) (62) and GeneCard (https://www.genecards.org) (63) were added to each gene (if available). A series of popular downstream analysis tools were implemented in GEN. For bulk RNA-seq data, four tools were included for different analysis purposes, namely, differential expression analysis with limma (64), weighted gene co-expression network analysis with WGCNA (65), functional enrichment analysis with clusterProfiler (66), and gene regulatory network inference with GENIE3 (67). For scRNA-seq data, Seurat (68) was integrated for the selection and filtration of cells based on quality-control metrics, data normalization and scaling, detection of highvariance genes, linear dimensional reduction (i.e. principal component analysis), graph-based clustering, visualization of cluster assignment and identification of cluster markers. Marker gene enrichment analysis was generated with Enrichr (69), and trajectory inference was performed with Monocle (70) . Furthermore, SingleR (71) was employed to infer cell type identity of each cell independently by leveraging reference transcriptomic datasets of pure cell types. Here, reference datasets from Human Primary Cell Atlas (72) , BLUEPRINT (73) , and Human Immune Cell RNAseq Data (74), Human Hematopoietic Cell RNA-seq Data (75) and DICE (Database of Immune Cell Expression, Expression quantitative trait loci (eQTLs) and Epigenomics) Project (76) were used for human cell type annotation, while those from Mouse RNA-seq Data (77) and Immunological Genome Project (ImmGen) (78) were used for mouse cell type annotation, separately. GEN was implemented using Spring Boot (https://spring. io/projects/spring-boot; a framework easy to create standalone java applications) as the back-end framework. All data were stored and managed by using MySQL (https: //dev.mysql.com; a free and popular relational database management system). To provide user-friendly and highly interactive web applications, web pages were constructed using HTML5 and rendered using JSP (https://jakarta. ee/specifications/pages/3.0/, Jakarta Server Pages, a template engine for web applications). Front-end interfaces were built using Semantic UI (https://semantic-ui.com; a development framework that helps create beautiful, responsive layouts HTML) and JQuery (https://jquery.com; a fast, small, and feature-rich JavaScript library). Furthermore, data visualization was built by HighCharts (https:// www.highcharts.com; a JavaScript plug-in to create interactive charts), Plotly.js (https://plotly.com/javascript/; a highlevel, declarative charting library) and DataTables (https:// datatables.net; a plug-in for the jQuery JavaScript library to render HTML tables). Interactive visualization of scRNAseq data was powered by Cerebro (79) . Online tools were developed with Shiny (https://shiny.rstudio.com/, an R package to build interactive web applications). GEN features comprehensive integration, manual curation and standardized analysis of high-quality transcriptomic datasets at bulk and single-cell levels based on a structured curation model and uniformed data processing pipelines. More importantly, diverse experimental conditions of all incorporated datasets are categorized into more informative biological contexts. In the current version, GEN houses a collection of transcriptomic profiles of 323 datasets covering 50 500 samples and 15 540 169 cells across 30 species. For each dataset, a full range of transcriptomic profiles including gene expression, alternative RNA splicing and RNA editing (if applicable) are provided in GEN. Moreover, GEN accommodates value-added gene annotations based on differential expression analysis across diverse experimental conditions and cell clusters. Accordingly, GEN provides user-friendly web functionalities and applications for large-scale data query, retrieval, analysis and visualization ( Figure 1 ). GEN adopts a structured curation model, incorporating manually curated items in light of dataset, profile (expression/splicing/editing), and sample: (i) datasets are annotated and categorized into six biological contexts of general interest, involving baseline, genetic (e.g. mutation, natural variation), phenotypic (e.g. disease, aging), environmental (e.g. abiotic stress, biotic stress), spatial (e.g. organism, tissue, cell type) and temporal (e.g. development, circadian, time series); (ii) Expression/splicing/editing profiles include the main steps and parameters of data processing together with reference genome and annotation details and (iii) samples contain a wealth of descriptive information, including basic information, sample characteristic, biological condition, experimental variable, experimental protocol, sequencing strategy and platform, quality assessment and data analysis procedure (reference genome, annotation file, software and parameter setting). All descriptive terms with controlled vocabularies are extracted and abstracted by manual curation of 293 published articles. In particular, diseases, tissues, and cell types are further linked to controlled terms from Disease Ontology (DO, https://diseaseontology.org) and BRENDA Tissue Ontology (BTO, http: //www.ontobee.org/ontology/bto). More details about the curation model are publicly available at https://ngdc.cncb. ac.cn/gen/documentation. Specifically, for each dataset, GEN provides a curated summary of metadata, covering species, tissue, healthy condition, RNA type, sample number, sequencing strategy, sequencing quality & quantity and experimental con-dition (https://ngdc.cncb.ac.cn/gen/browse/datasets, Figure 2A ). To manage all collected datasets, GEN assigns an accession number prefixed with 'GEND' for each dataset. Moreover, since each dataset associates with specific sample(s) (prefixed with 'GENS'), manual curation is conducted for all datasets by linking to controlled terms from DO and BTO via sample meta-information. As a result, all datasets incorporated in GEN cover 128 tissues and 46 cell types (originally curated from metadata provided by submitters). Based on these curated metadata, as a consequence, users can conveniently find the dataset(s) of interest. Structured metadata for all collected datasets is provided in a tabular form and also freely downloadable (https: //ngdc.cncb.ac.cn/gen/download). Overall, bulk RNA-seq and scRNA-seq datasets involve 30 and 22 species, 89 and 64 tissues, respectively (Table 1) . Regarding the specific biological contexts, GEN incorporates 153 baseline (Table 1) . Gene expression profiles can be visualized in heatmap/boxplot charts ( Figure 2B ). Moreover, GEN incorporates circRNA expression profiles of 456 samples from 10 human datasets. Based on the expression profiles, differentially expressed genes (DEGs) are identified between biological condition groups, which can be accessed in tabular form and visualized in heatmap charts ( Figure 2C ). In addition, GEN integrates a valuable collection of RNA editing events and alternative RNA splicing isoforms in 10 datasets with 574 human samples (involving 18 tissues and 16 diseases) as valueadded profiles on co/post-transcriptional levels. At the single-cell level, GEN provides high-quality expression profiles of 15 540 169 cells from 166 datasets covering 22 species (17 animals, 2 plants, 2 protists and 1 fungus), 64 tissues and 42 human diseases (Table 1) . To reveal biological functions underlying expression profiles, further downstream analyses including cell clustering, identification of marker genes for each cluster and functional enrichment are performed. To facilitate easy access to cell clustering results for each dataset/sample, GEN is capable of visualizing the clustered cells using t-SNE and UMAP plots, which can be color-coded according to metadata information, cell clusters and inferred cell types ( Figure 2D) since sufficient cell type annotation reference only exists for them (see details in Materials and Methods). In addition, marker genes for each cluster and gene enrichment analysis results can be browsed and downloaded. GEN provides an abundance of gene annotations for a total of 1 191 846 genes across 30 species. In addition to basic annotation (such as genomic location, biotype, functional description), GEN integrates value-added annotations derived from transcriptomic profiles, including quantitative (expression levels across different conditions) and qualitative (differential expression patterns between condition groups). For any specific gene(s), expression levels in a given dataset can be visualized by interactive heatmap and boxplot charts, and expression patterns from differential expression analysis (also applicable to the identification of marker genes for specific cell types) are annotated and incorporated in GEN. Moreover, GEN incorporates additional annotations for each gene, including editomedisease associations, internal control genes, and ontology terms (from GO, DO; see details in Materials and Methods). Consequently, GEN allows users to retrieve single or multiple genes by gene name/ID/symbol (https://ngdc. cncb.ac.cn/gen/browse/genes). Based on all collected annotations in GEN, users can conveniently find the genes of interest with specific annotations/profiles and investigate expression patterns across diverse biological conditions. GEN is equipped with a series of online tools in aid of further downstream data analysis and visualization (see details in Materials and Methods). For bulk RNA-seq data, GEN offers online services for differential expression analysis, weighted gene co-expression network analysis (WGCNA), functional enrichment analysis and gene regulatory network inference. For scRNA-seq data, users can perform multiple analyses including quality control, data normalization, scaling and regression, dimensional reduction, graphbased clustering, and identification of marker genes for cell clusters (68) . Furthermore, GEN is able to help users conduct gene enrichment analysis for cell markers, cell trajectory inference, and cell type annotation. Meanwhile, singlecell analysis results can be visualized by Cerebro (79), which allows interactive investigation and inspection of single-cell transcriptomic profiles incorporated in GEN. All these results can be downloaded in CSV and Excel formats and visualized images can be exported to PNG or PDF. GEN features systematic integration, manual curation and standardized data processing of 323 high-quality bulk and single-cell RNA-seq datasets across 30 species. It enables easy access to a comprehensive range of transcriptomic profiles, which are critical for unravelling both transcriptional and post-transcriptional regulatory mechanisms. Moreover, GEN provides abundant gene annotations based on value-added curation of transcriptomic profiles and delivers online services for bulk and single-cell data analysis and visualization. Future directions of GEN include continuous integration and analysis of high-quality RNA-seq datasets with diverse sequencing strategies (e.g. miRNA-seq, single-cell spatial RNA-seq, nanopore long-read RNA-seq) across more species. Also, GEN will be frequently updated by enriching gene annotations based on manual curation of the ever-increasing transcriptomic profiles (13) . Particularly, since the field of single-cell genomics is under rapid development, we will keep an eye on cutting-edge scRNA-seq analysis methods and make updates on GEN data processing pipelines accordingly. GEN will also provide online services to accept user-submitted expression profiles with quality control and manual curation. Furthermore, interconnections with external and internal database resources at multiomics levels (e.g. variome (80) , methylome (81) and interactome (82)) will be added and enhanced. Web tools for RNA editing profiling, alternative splicing detection and batcheffect correction across different technologies and conditions will be developed and/or implemented in GEN. GEN is freely available online at https://ngdc.cncb.ac.cn/ gen/ and does not require user to register. Single-cell transcriptomics to explore the immune system in health and disease Spatially resolved transcriptome profiling in model plant species RNA-Seq: a revolutionary tool for transcriptomics The single-cell transcriptional landscape of mammalian organogenesis Global gene expression profiling of circulating tumor cells Global expression profiling applied to plant development RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays Methods to study splicing from high-throughput RNA sequencing data RNA sequencing: the teenage years Genetic variation and microRNA targeting of A-to-I RNA editing fine tune human tissue transcriptomes Deep-learning augmented RNA-seq analysis of transcript splicing IC4R 2.0: rice genome reannotation using massive RNA-Seq Data Araport11: a complete reannotation of the Arabidopsis thaliana reference genome Next generation sequencing technologies for next generation plant breeding Genomic and transcriptomic profiling expands precision cancer medicine: the WINTHER trial High-hroughput transcriptome profiling in drug and biomarker discovery A single-cell RNA-seq survey of the developmental landscape of the human prefrontal cortex COVID-19 immune features revealed by a large-scale single-cell transcriptome atlas NCBI GEO: archive for functional genomics data sets-update Expression Atlas update: from tissues to single cells The Genotype-Tissue Expression (GTEx) project Database resources of the national center for biotechnology information The EMBL-EBI search and sequence analysis tools APIs in 2019 Database resources of the national genomics data center, china national center for bioinformation in 2021 Database resources of the big data center in 2019 The genome sequence archive family: Toward explosive data growth and diverse data types GSA: genome sequence archive The sequence read archive: explosive growth of sequencing data The european nucleotide archive in 2020 The DNA Data Bank of Japan launches a new resource, the DDBJ Omics Archive of functional genomics experiments fastp: an ultra-fast all-in-one FASTQ preprocessor Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and ballgown RSeQC: quality control of RNA-seq experiments STAR: ultrafast universal RNA-seq aligner RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM Diverse alternative back-splicing and alternative splicing landscape of circular RNAs Circular RNA identification based on multiple seed matching GENCODE reference annotation for the human and mouse genomes REDIportal: a comprehensive database of A-to-I RNA editing events in humans REDItools: high-throughput RNA editing detection made easy pblat: a multithread blat algorithm speeding up aligning sequences to genomes The Sequence Alignment/Map format and SAMtools rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data Single-cell RNA-Seq technologies and related computational data analysis Comparative analysis of droplet-based ultra-high-throughput single-cell RNA-Seq systems Current best practices in single-cell RNA-seq analysis: a tutorial Smart-seq2 for sensitive full-length transcriptome profiling in single cells Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells dropEst: pipeline for accurate estimation of molecular counts in droplet-based single-cell RNA-seq experiments Massively parallel digital transcriptional profiling of single cells The Gene Ontology Resource: 20 years and still GOing strong Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification Disease ontology: a backbone for disease semantic integration JBrowse: a dynamic web platform for genome visualization and analysis Editome Disease Knowledgebase (EDK): a curated knowledgebase of editome-disease associations in human ICG: a wiki-driven knowledgebase of internal control genes for RT-qPCR normalization REDIportal: a comprehensive database of A-to-I RNA editing events in humans GeneCards Version 3: the human gene integrator ) limma powers differential expression analyses for RNA-sequencing and microarray studies WGCNA: an R package for weighted correlation network analysis Inferring regulatory networks from expression data using tree-based methods Comprehensive integration of Single-Cell data Enrichr: a comprehensive gene set enrichment analysis web server 2016 update The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage An expression atlas of human primary cells: inference of gene function from coexpression networks BLUEPRINT: mapping human blood cell epigenomes RNA-Seq signatures normalized by mRNA abundance allow absolute deconvolution of human immune cell types Densely interconnected transcriptional circuits control cell states in human hematopoiesis Impact of genetic polymorphisms on human immune cell gene expression Remodeling of epigenome and transcriptome landscapes with aging in mice reveals widespread induction of inflammatory responses The Immunological Genome Project: networks of gene expression in immune cells Cerebro: interactive visualization of scRNA-seq data Genome Variation Map: a worldwide collection of genome variations across multiple species Novel cell types and altered cell states in asthma revealed by single-cell RNA sequencing of airway wall biopsies NucMap: a database of genome-wide nucleosome positioning map across species We would like to thank Zhuojing Fan for her help on web interface design and a number of users for providing supports and reporting bugs. We also appreciate the anonymous reviewers for their valuable comments on this work.