key: cord-1036775-h8ncmox8 authors: Kubik, Slawomir; Marques, Ana Claudia; Xing, Xiaobin; Silvery, Janine; Bertelli, Claire; De Maio, Flavio; Pournaras, Spyros; Burr, Tom; Duffourd, Yannis; Siemens, Helena; Alloui, Chakib; Song, Lin; Wenger, Yvan; Saitta, Alexandra; Macheret, Morgane; Smith, Ewan W.; Menu, Philippe; Brayer, Marion; Steinmetz, Lars M.; Si-Mohammed, Ali; Chuisseu, Josiane; Stevens, Richard; Constantoulakis, Pantelis; Sali, Michela; Greub, Gilbert; Tiemann, Carsten; Pelechano, Vicent; Willig, Adrian; Xu, Zhenyu title: Guidelines for accurate genotyping of SARS-CoV-2 using amplicon-based sequencing of clinical samples date: 2020-12-03 journal: bioRxiv DOI: 10.1101/2020.12.01.405738 sha: 9c013518aa3211f28bfc40f3e950627b79412e0d doc_id: 1036775 cord_uid: h8ncmox8 Background SARS-CoV-2 genotyping has been instrumental to monitor virus evolution and transmission during the pandemic. The reliability of the information extracted from the genotyping efforts depends on a number of aspects, including the quality of the input material, applied technology and potential laboratory-specific biases. These variables must be monitored to ensure genotype reliability. The current lack of guidelines for SARS-CoV-2 genotyping leads to inclusion of error-containing genome sequences in studies of viral spread and evolution. Results We used clinical samples and synthetic viral genomes to evaluate the impact of experimental factors, including viral load and sequencing depth, on correct sequence determination using an amplicon-based approach. We found that at least 1000 viral genomes are necessary to confidently detect variants in the genome at frequencies of 10% or higher. The broad applicability of our recommendations was validated in >200 clinical samples from six independent laboratories. The genotypes of clinical isolates with viral load above the recommended threshold cluster by sampling location and period. Our analysis also supports the rise in frequency of 20A.EU1 and 20A.EU2, two recently reported European strains whose dissemination was favoured by travelling during the summer 2020. Conclusions We present much-needed recommendations for reliable determination of SARS-CoV-2 genome sequence and demonstrate their broad applicability in a large cohort of clinical samples. 8 4 4 ( 9 9 . e s p e c i f i c i t y o f t h e a s s a y a n d a l l e v i a t e s t h e l e g a l , e t h i c a l a n d t e c h n i c a l c o n c e r n s 16 r e s u l t i n g f r o m t h e p r e s e n c e o f p a t i e n t s e q u e n c e i n f o r m a t i o n s e e n i n m e t a g e n o m i c o r c a p t u r e -b a s e d 17 m e t h o d s ( 3 , 2 8 , 2 9 ) . W e s y s t e m a t i c a l l y d e t e c t e d a n d r e m o v e d s e q u e n c i n g a r t i f a c t s g e n e r a t e d b y p r i m e r d i m e r e c e i v e r o p e r a t i n g c h a r a c t e r i s t i c ) c u r v e s p l o t t e d 1 f o r s a m p l e s w i t h d i f f e r e n t e x p e c t e d v a r i a n t f r a c t i o n s ( F i g u r e 3 C ) s h o w e d t h a t a t 0 . 1 e x p e c t e d V A F t h e 2 s e p a r a t i o n b e t w e e n t h e t r u e a n d f a l s e p o s i t i v e c a l l s w a s w i t h i n a n a c c e p t a b l e r a n g e f o r t h i s t y p e . p r o c e s s e d b y m u l t i p l e c l i n i c a l l a b o r a t o r i e s t o d i s s e c t t h e f a c t o r s w h i c h n e e d t o b e t a k e n i n t o a c c o u n t 1 t o o b t a i n r e l i a b l e S A R S -C o V -2 g e n o m e i n f o r m a t i o n u s i n g a n a m p l i c o n -b a s e d a p p r o a c h ( F i g u r e 5 E ) . V a r i a n t c a l l i n g a c c u r a c y s t r o n g l y d e p e n d s o n t h e q u a l i t y a n d q u a n t i t y o f s t a r t i n g m a t e r i a l b e c a u s e t v i r t u a l l y a l l c l i n i c a l s a m p l e s h a v e t h e D 6 1 4 G v a r i a n t c h a r a c t e r i s t i c o f t h e S A R S -C o V -2 s t r a i n a t t h e 1 o r i g i n o f t h e E u r o p e a n o u t b r e a k . W e a l s o d e t e c t e d s o u r c e -s p e c i f i c p o l y m o r p h i s m s a n d t h e p r e s e n c e o f 2 r e p r e s e n t a t i v e s a m p l e s f r o m t h e m a i n E u r o p e a n c l a d e s . T h e s e i n c l u d e 2 s t r a i n s t h a t a r e t h o u g h t t W a n g Y , W a n g D , Z h a n g L , S u n W , Z h a n g Z , C h e n W , e t a l . I n t r a -h o s t V a r i a t i o n a n d E v o l u t i o n a r y 1 D y n a m i c s o f S A R S -C o V -2 P o p u l a t i o n i n C O V I D -1 9 P a t i e n t s [ I n t e r n e t ] . . p o i n t r e p r e s e n t s t h e r e s u l t s f o r a s a m p l e c o l o u r c o d e d a c c o r d i n g t o t h e s o u r c e l a b . T h e d a s h e d l i n e 1 i n d i c a t e s 7 5 0 K m a p p e d r e a d s / M . T h e p e r c e n t a g e o f s a m p l e s w i t h a t l e a s t 7 5 0 K r e a d s p e r m i l l i o n m a p p e d 2 r e a d s ( y -a x i s ) b e l o w a g i v e n C t ( x -a x i s ) i s r e p r e s e n t e d i n t h e i n s e t . ( D ) F r a c t i o n o f v i r a l g e n o m e c o v e r e L e n g t h o f t h e b r a n c h e s r e f l e c t s t h e n u m b e r o f m u t a t i o n s ( x -a x i s ) . T h e t r e e v i s u a l i z a t i o n w a s g e n e r a t e d 1 u s i n g t h e N e x t s t r a i n p l a t f o r m . ( E ) S c h e m a t i c r e p r e s e n t a t i o n o f t h e r e c o m m e n d a t i o n s f o r r e l i a b l e 2 g e n o t y p i n g w i t h a m p l i c o n -b a s e d a p p r o a c h . W e u s e d s y n t h e t i c v i r a l g e n o m e s t o d e t e r m i n e t h e m i n i m a Fragment size based on control DNA ladder is indicated on the left and the expected size of SARS-CoV-2 amplicons (red arrow) or artifacts (blue arrow) is indicated on the right. (D) Fraction of DNA fragments with sizes corresponding to SARS-CoV-2 amplicon or artifact quantification from the electrophoretic library profiles obtained for samples with varying amounts of synthetic SARS-CoV-2 genomes. Bars represent average of at least 3 replicates for each amount and whiskers the standard deviation. (E) Genome browser view examples of artifact containing reads and their impact on inferred genotype before and after their removal Varying amounts of SARS-CoV-2 Control 1 or 4 (blue) were mixed with SARS-CoV-2 synthetic genome reference (Control 2, orange) to obtain desired VAFs and varying viral genome copy mixes (between 100, 1000 and 10000 g.c.p.r.) spiked into human RNA. Variant calling was performed for samples sequenced at varying depth. (B) Distribution of the variant fraction for known variants (y-axis), expected fraction 0.1, 0.125 and 0.18 (yellow, blue and purple) or background call (red), as a function of the viral genome copies in the input material. The horizontal line in the boxplot indicates the median and the whiskers the 5% and 95% quantile (C) Sensitivity (y-axis) as a function of the specificity (x-axis) for variants at different VAF. The ROC curves for are colour coded depending on expected VAF of the variants analysed. Data for samples with 100, 1000 and 10 000 viral copies are represented as dotted Relationship between the Ct value (y-axis) and the viral genome copies based either on synthetic SARS-CoV-2 RNA or plasmids encoding viral genes, performed by two independent laboratories (SOPHiA Genetics (SG) or Source B). (B) Distribution of Ct values for all clinical samples where the value was provided Summary of the variant calling analysis, taking into account all VAF values, for all unique clinical samples (rows) sorted by the Ct value (left); the position of each variant (blue) is displayed along the genome (central panel)