key: cord-312228-wbyqmvhs authors: Xiao, Minfeng; Liu, Xiaoqing; Ji, Jingkai; Li, Min; Li, Jiandong; Yang, Lin; Sun, Wanying; Ren, Peidi; Yang, Guifang; Zhao, Jincun; Liang, Tianzhu; Ren, Huahui; Chen, Tian; Zhong, Huanzi; Song, Wenchen; Wang, Yanqun; Deng, Ziqing; Zhao, Yanping; Ou, Zhihua; Wang, Daxi; Cai, Jielun; Cheng, Xinyi; Feng, Taiqing; Wu, Honglong; Gong, Yanping; Yang, Huanming; Wang, Jian; Xu, Xun; Zhu, Shida; Chen, Fang; Zhang, Yanyan; Chen, Weijun; Li, Yimin; Li, Junhua title: Multiple approaches for massively parallel sequencing of HCoV-19 (SARS-CoV-2) genomes directly from clinical samples date: 2020-03-19 journal: bioRxiv DOI: 10.1101/2020.03.16.993584 sha: doc_id: 312228 cord_uid: wbyqmvhs COVID-19 has caused a major epidemic worldwide, however, much is yet to be known about the epidemiology and evolution of the virus. One reason is that the challenges underneath sequencing HCoV-19 directly from clinical samples have not been completely tackled. Here we illustrate the application of amplicon and hybrid capture (capture)-based sequencing, as well as ultra-high-throughput metatranscriptomic (meta) sequencing in retrieving complete genomes, inter-individual and intra-individual variations of HCoV-19 from clinical samples covering a range of sample types and viral load. We also examine and compare the bias, sensitivity, accuracy, and other characteristics of these approaches in a comprehensive manner. This is, to date, the first work systematically implements amplicon and capture approaches in sequencing HCoV-19, as well as the first comparative study across methods. Our work offers practical solutions for genome sequencing and analyses of HCoV-19 and other emerging viruses. As of 14 March 2020, human coronavirus 2019 (HCoV-19) has surpassed severe acute 54 respiratory syndrome coronavirus (SARS-CoV) and Middle East respiratory syndrome coro-55 navirus (MERS-CoV) in every aspect, infecting over 140,000 people in more than 110 coun-56 tries, with a mortality of over 5,000 1,2 . So far, coronaviruses have caused three major 57 epidemics in the past two decades, posing a great challenge to global health and economy. single nucleotide variations (iSNVs) and its allele frequency have also contributed to anti-86 viral therapy and drug resistance, e.g., to reveal highly conserved genes during the outbreak 87 that potentially serve as ideal therapeutic targets 19, 21 . However, it is a challenge to accurately 88 detect iSNVs from clinical samples, especially when the samples are subjected to extra 89 steps of enrichment and amplification. 90 Therefore, we aim to comprehensively compare the sensitivity, inter-individual (variant) and 91 intra-individual (iSNV) accuracy, and other general features of different approaches by sys-92 tematically utilizing ultra-high-throughput metatranscriptomic, hybrid capture-based, and 93 amplicon-based sequencing approaches to obtain genomic information of HCoV-19 from 94 serial dilutions of a cultured isolate and directly from clinical samples. We present a reason-95 able sequencing strategy that fits into different scenarios, and estimate the minimal amount 96 of sequencing data necessary for downstream HCoV-19 genome analyses. Our study, to-97 gether with our tailor-made experimental workflows and bioinformatics pipelines, offers very the situation, but this requires high-standard laboratory settings and expertise apart from 114 being time-consuming. Also, unwanted mutations that are not concordant with original clin-115 ical samples may be introduced during the culturing process. 116 To enrich adequate viral content for whole-genome sequencing in a convenient manner, we 117 pursued two other methods: multiplex PCR amplification (amplicon) and hybrid capture 118 (capture) (Fig. 1) . We designed a systematic study to comprehensively validate the bias, 119 sensitivity, inter-individual (variant) and intra-individual (iSNV) accuracy of multiple ap-120 proaches by sequencing serial dilutions of a cultured isolate (unpublished), as well as the 121 eight clinical samples (Fig. 2) . We performed qRT-PCR of 10-fold serial dilutions (D1-D7) of 122 the cultured isolate, and the Ct was 17.3, 20.8, 24.5 for, 28.7, 31.8, 35, and 39.9, respec-123 tively, indicating the undiluted RNA (D0) of the cultured isolate contained ~1E+08 genome 124 copies per mL. For amplicon sequencing, we utilized two kits comprising of two set of pri-125 mers generating PCR products of 300-400 bp and 100-200 bp, respectively. The ~400 bp 126 amplicon-based sequencing was implemented in all samples and analyzed throughout the 127 study, while the ~200 bp amplicon-based sequencing was only applied in the cultured isolate 128 for coverage analysis. to inadequate depth in amplicon sequencing (Fig. 3a) . Another pitfall is that amplification 142 across the genome can hardly be unbiased, causing difficulties in complete genome assem-143 bly. Indeed, amplicon sequencing exhibited a higher level of bias compared with meta-144 transcriptomic sequencing, in terms of coverage across the viral genomes from the cultural 145 isolate and the clinical samples tested in our study (Fig. 3b, d, Supplementary Fig. 1 ). To 146 our surprise, however, capture sequencing was almost as unbiased as meta sequencing, 147 demonstrating better performance than the previous capture method used to enrich ZIKV 148 despite the HCoV-19 genome is ~3 fold larger than ZIKV 22 (Fig. 3b, c) . Two reasons 149 amongst others were likely to be accountable to this improvement, 1) we utilized 506 pieces 150 of 120 ssDNA probes covering 2x of the HCoV-19 genome to capture the libraries, 2) we 151 employed the DNBSEQ sequencing technology that features PCR-free rolling circle replica-152 tion (RCR) of DNA Nanoballs (DNBs) 23,24 . 153 The sequencing results of amplicon and capture approaches revealed dramatic increases 154 in the ratio of HCoV-19 reads out of the total reads compared with meta sequencing, sug-155 gesting the enrichment was highly efficient -5596-fold in capture method and 5710-fold in 156 amplicon method for each sample on average (Supplementary Table 1 -2). To further com-157 pare the sensitivity of different methods, we plotted the number of HCoV-19 reads per million 158 (HCoV-19-RPM) of total sequencing reads against the viral concentration for each sample. 159 The productivity was similar between the two methods when the input RNA of the cultured 160 isolate contained 1E+05 genome copies per mL and above (Fig. 3e) . However, amplicon 161 sequencing produced 10-100 fold more HCoV-19 reads than capture sequencing when the input RNA concentration of the cultured isolate was 1E+04 genome copies per mL and lower, suggesting amplicon-based enrichment was more efficient than capture for more 164 challenging samples (conc. ≤ 1E+04 genome copies per mL, or Ct ≥ 28.7) (Fig. 3e) . Meta 165 sequencing -as expected -produced dramatically lower HCoV-19-RPM than the other two 166 methods among clinical samples tested with a wide range of Ct values, whereas amplicon 167 and capture were generally comparable to each other (Fig. 3e) . Considering the costs for 168 sequencing, storage, and analysis increase substantially with larger datasets, we tried to 169 estimate how much sequencing data must be produced for each approach in order to 170 achieve 10X depth across 95% of the HCoV-19 genome, and the results can be found in 171 Supplementary Table 3 . As a practical, cost-effective guidance for future sequencing, we 172 also assessed the minimum amount of data required to pass the stringent filters (≥ 95% 173 coverage and method-specific depth, see Methods) in our pipelines corresponding to differ-1E+06 copies of HCoV-19 genome per mL, while capture sequencing requires 24,474 to 9 177 Mb for the same situation (Fig. 3g, Supplementary Table 4) . 178 To determine the accuracy of dif-179 ferent approaches in discovering inter-individual genetic diversity, we tested each method 180 in calling the single nucleotide variations (SNVs) and verified some of the SNVs with Sanger 181 sequencing ( Supplementary Fig. 2 ). Two to five SNVs were identified within each clinical 182 sample, and in all the seven samples, SNVs identified by the three methods were concord-183 ant except that capture missed one SNV at position 16535 in GZMU0014 (Fig. 4a) . We then 184 investigated the allele frequencies of these sites across methods, and found that alleles 185 identified by capture sequencing displayed lower frequencies than the other two methods, 186 especially for GZMU0014, GZMU0030, and GZMU0042 where the viral load was lower (Ct 187 ≥ 29), which explained why capture sequencing neglected an SNV in our pipeline when the 188 cutoff of SNV calling was set as 80% allele frequency (Fig. 4b) . These data indicate that 189 amplicon sequencing is more accurate than capture sequencing in identifying SNVs, espe-190 cially for challenging samples. sequencing has also allowed us to investigate RNA expression patterns of the overall mi-210 crobiome and host content and thus suitable for discovering new viruses, distinguishing co-211 infections, and dissecting virus-host interactions. To explore the microbiota, we performed 212 further metatranscriptomics analysis of the clinical samples. We were able to identify host 213 nucleic acids in all of the samples, and over 95% of total reads were from the host in sputum, 214 nasal, and throat samples ( Supplementary Fig. 4a ). Virus contributed to less than 5% of 215 reads in anal swab and throat swab while more than 50% of reads in nasal swab (Supple-216 mentary Fig. 4b) . These results suggest nasal swab could be the most ideal sample type for 217 viral detection among the four sample types, which agrees with recent clinical evidence 25 . 218 Among the viral reads, over 90% were Coronaviridae, which is consistent with clinical diag-219 nostics ( Supplementary Fig. 4c ). Reads from other viruses were also identified, indicating 220 further measurements could be taken to confirm if co-infection exists ( Supplementary Fig. 221 4). Bacterial composition was also shown, providing support for scientific research, as well bioinformatics pipelines tailored in this study, e.g., 1) the bias of amplicon sequencing can 252 be improved by reducing the amount of cycles in the 1st PCR or optimize the molar ratios 253 of primers (Fig. 1a) , 2) the amplicon sequencing is particularly convenient compared with 254 previous counterparts since the fragmentation and library construction steps are omitted 255 cons using single-end 400 nt reads (Fig. 1a) , 3) using anything less than 506 pieces of 120 257 ssDNA probes in hybrid capture may attenuate the sequencing coverage while increase the 258 bias, 4) metatranscriptomic sequencing was conducted with an ultra-high-throughput se-259 quencing platform so that the successful rate was substantially higher than usual. 5) the 260 minimal amount of data necessary for analyzing the HCoV-19 genome from clinical samples 261 across methods is higher than that predicted by data from the cultured isolate was probably 262 due to the high nucleic acids background from the host and other microbes (Supplementary 263 Table 3 -4, Supplementary Fig. 4) . Also, we do not take into account the time spent in se- 100 nt strategy using the same protocol described above, generating 37 Gb sequencing 305 data for each sample on average. 306 Total RNA was reverse transcribed to synthesize the first-strand cDNA with random 308 hexamers and Superscript II reverse transcriptase kit (Invitrogen, Carlsbad, USA). Table 5 . 20 µl of first-strand instructions, including lambda phage gDNA unless specified. 2 ng of Human gDNA was 317 added to each PCR reaction of the cultured isolate. The PCR was performed as follows: 5 318 min at 37°C, 10 min at 95°C, 15 cycles of (10 s at 95°C, 1min at 64°C, 1min at 60°C to 10 s reaction were unique dual indexed. After the 2nd PCR, products were purified following the 324 same procedures as the 1st PCR and quantified using the Qubit dsDNA High Sensitivity 325 assay on Qubit 3.0 (Life Technologies). PCR products of samples yielding sufficient material 326 (> 5 ng/µl) were pooled at equimolar to a total DNA amount of 300 ng before converting to 327 single-stranded circular DNA. DNBs-based libraries were generated from 20 μl of single-328 stranded circular DNA pools and sequenced on the MGISEQ-2000 platform with single-end 329 400 nt strategy, generating 1.8 Gb sequencing data for each sample on average. 330 For metatranscriptomic and hybrid capture sequencing data, total reads were first processed 332 by Kraken v0.10.5 26 (default parameters) with a self-build database of Coronaviridae ge-333 nomes (including SARS, MERS and HCoV-19 genome sequences downloaded from 334 GISAID, NCBI and CNGB) to efficiently identify candidate viral reads with a loose manner. 335 These candidate reads were further qualified with fastp v0.19.5 27 (parameters: -q 20 -u 20 -336 n 1 -l 50) and SOAPnuke v1.5.6 28 (parameters: -l 20 -q 0.2 -E 50 -n 0.02 -5 0 -Q 2 -G -d) to 337 remove low-quality reads, duplications and adaptor contaminations. Low-complexity reads 338 were next filtered by PRINSEQ v0.20.4 29 (parameters: -lc_method dust -lc_threshold 7). 339 After the above process, HCoV-19-like reads generated from metatranscriptomics and 340 hybrid capture method were obtained. 341 For amplicon sequencing data, SE 400 reads were first processed with fastp v0.19.5 27 (pa-342 rameters: -q 20 -u 20 -n 1 -l 50) to remove low quality-reads and adaptor sequences. Primer Amplicon sequencing data were processed as described above, except that duplications 371 were not removed. A heatmap was generated to visualize the viral genome coverage for all 372 samples sequenced by the amplicon method with the pheatmap 37 package in R (v3.6.1) 36 . 373 The depth at each nucleotide position was binarized, and was shown in pink if the depth 374 was 100x and above. 375 HCoV-19 reads of metatranscriptomic and hybrid capture sequencing data were identified 379 by aligning the HcoV-19-like reads to the HCoV-19 reference genome (GISAID accession: 380 EPI_ISL_402119) with BWA in a strict manner of coverage ≥ 95% and identity ≥ 90%. For 381 comparisons of the coverage and depth of the viral genome across samples and methods, 382 we normalized the viral reads to total sequencing reads with HCoV-19 Reads Per Million 383 (HCoV-19-RPM). HCoV-19-RPM for amplicon sequencing data was calculated by the same 384 pipelines we applied for metatranscriptomic and hybrid capture sequencing data. 385 To estimate the minimum data requirements for genome assembling and intra-individual 386 variation analysis, we applied gradient-based sampling to the HCoV-19 genome align-387 ments (referred to BAM files) to each dataset using Samtools (v1.9) 34 . The effective genome 388 coverage was set as 95% for all three MPS methods. Considering the distinct 389 technologies used in different methods, we set method-dependent thresholds of effective 390 depth as follows: 1) ≥ 10X for metatranscriptomic sequencing; 2) ≥ 20X for hybrid capture 391 sequencing; and 3) ≥ 100X for amplicon sequencing. We next calculated the coverage and 392 depth within each subsampled BAM file per sample to determine the minimal BAM file that 393 could meet the above thresholds of both coverage and sequencing depth. The method-394 dependent minimum amount of sequencing data of each sample were estimated accord-395 ingly. We assessed the correlations between the HCoV-19 genome copies per mL in diluted 396 samples of cultured isolates and the minimum amount of sequencing data for amplicon-397 and capture-based methods using Pearson correlation coefficient (R) with the function 398 scatter from the R package ggpubr (v3.6.1) 38 . 399 Except for amplicon sequencing samples, variants calling in metatranscriptomic and hybrid 401 For amplicon sequencing data, only base positions covered by ≥100X reads were used for 417 iSNVs calling. For metatranscriptomic and hybrid capture-based sequencing data, the 418 thresholds of depth were set as 10X and 20X, respectively. The candidate iSNVs were fur-419 ther filtered for quality as follows: 1) frequency filtering, only minor alleles (frequency ≥ 5% 420 and <50%) and major alleles (frequency ≥ 50% and ≤ 95%) were remained; 2) depth filter-421 ing, iSNVs with fewer than five forward or reverse reads were removed; and 3) strand bias 422 filtering (not applicable to single-end reads of amplicon sequencing), iSNVs were removed 423 if there were more than a 10-fold strand bias or a 5-fold difference between the strand bias 424 of the variant call and that of the reference call. 425 For metatranscriptomic sequencing of clinical samples, raw sequencing data of a single se-427 quence lane (approximately 60-75 Gb per sample) was used to simultaneously assess the 428 RNA expression patterns of human, bacteria and viruses in clinical samples from COVID-429 19 patients. We first used software fastp (v0.19.5) 27 to filter low-quality reads and remove 430 adapter with parameters: -5 -3 -q 20 -c -l 30. After QC, we mapped high-quality reads to 431 hg19 and removed human ribosomal RNA (rRNA) reads by SOAP2 v2.21 42 (parameters: -432 m 0 -x 1000 -s 28 -l 32 -v 5 -r 1), and the remaining RNA reads were then aligned to hg19 433 by HISAT2 43 with default settings to identify non-rRNA human transcripts as previously de-434 scribed. Next, we applied Kraken 2 44 (version 2.0.8-beta, parameters: --threads 24 --confi-435 dence 0) to assign microbial taxonomic ranks to non-human RNA reads against the large A pneumonia outbreak associated with a new coronavirus of probable bat 508 origin RNA based mNGS approach identifies a novel human coronavirus from two 510 individual pneumonia cases in 2019 Wuhan outbreak A new coronavirus associated with human respiratory disease in China Evaluation of advanced reverse transcription-PCR assays and an 515 alternative PCR target region for detection of severe acute respiratory syndrome-516 associated coronavirus Coronaviruses and the human airway: a universal system 519 for virus-host interaction studies A familial cluster of pneumonia associated with the 2019 novel 521 coronavirus indicating person-to-person transmission: a study of a family cluster First Case of 2019 Novel Coronavirus in the United States Stochastic processes constrain the within and between host 526 evolution of influenza virus Intra-host dynamics of Ebola virus during Ebola Virus Epidemiology, Transmission, and Evolution during Seven 530 Viral quasispecies evolution Zika virus evolution and spread in the Americas Reliable multiplex sequencing with rare index mis-assignment on DNB-based 536 NGS platform Advanced Whole Genome Sequencing Using a Complete R: A language and environment for statistical computing ggplot2" based publication ready plots. R package version 0 Haplotype-based variant detection from short-read sequencing Snippy: fast bacterial variant calling from NGS reads A program for annotating and predicting the effects of single nucleotide 573 polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain SOAP2: an improved ultrafast tool for short read alignment Graph-based genome 578 alignment and genotyping with HISAT2 and HISAT-genotype Improved metagenomic analysis with Kraken 2 Bracken: estimating species 583 abundance in metagenomics data Double indexing overcomes inaccuracies in multiplex 619 Sequencing coverage and depth of the cultured isolate and eight clinical 621 samples. a, Amplicon sequencing coverage by sample (row) across the HCoV-19 Three 628 independent experiments were performed for amplicon sequencing. Pink, ~400 bp 629 amplicon-based sequencing including human and lambda phage nucleic acids 630 background; red, ~200 bp amplicon-based sequencing; orange, ~400 bp amplicon-based 631 sequencing excluding human and lambda phage nucleic acids background (NAB); light 632 blue, capture sequencing. f, HCoV-19-RPM (Reads Per Million) sequenced plotted 633 against qRT-PCR Ct value for the clinical samples Alleles with ≥ 80% frequencies were called. *, SNVs verified 641 by Sanger sequencing. b, Allele frequencies of the identified SNVs. Pink, amplicon; light 642 blue, capture; blue, meta. Minor allele frequencies detected in serial dilutions of the cultured 643 isolate (c) and clinical samples (d) across methods. Pink, amplicon vs meta; light blue, 644 capture vs meta. Minor alleles are defined with ≥ 5% and < 50% frequencies. Besides 645 general quality filter and human nucleic acids. PE100, paired-end 100-nt reads; SE400, single-end 100-nt 617 reads. 618