key: cord-0310440-6qra2ega
authors: Seoane, Pedro; Díaz-Martínez, Luis; Viguera, Enrique; Claros, M. Gonzalo; Grande-Pérez, Ana
title: QuasiFlow: a bioinformatic tool for genetic variability analysis from next generation sequencing data
date: 2022-04-30
journal: bioRxiv
DOI: 10.1101/2022.04.05.487169
sha: 9a6ad65e8706d70df8708d83a3795dfbae23d787
doc_id: 310440
cord_uid: 6qra2ega

Populations of RNA and ssDNA viruses within their hosts contain a heterogeneous collection of variant genomes known as quasispecies. Large variability in mitochondrial DNA has also been found within the same organism, drawing an interesting parallel between the two situations. The advent of next-generation sequencing technologies facilitated studying genetic variation, but many open-source bioinformatic tools have to be combined in a non-trivial approach. Here it is presented QuasiFlow, a workflow based on well-stablished software that extracts reliable mutations and recombinations, even at low frequencies (~10-4), provided that at least 250 million nucleotides are analysed. Accurate prediction of mutations and recombinations has been demonstrated with synthetic reads and with in vitro rolling-circle amplification of a plant geminivirus. An in-depth analysis of viral quasispecies was performed and QuasiFlow revealed the coexistence in the plant of three virus genomes and distinct recombinations between some of them. Human mitochondrial variants were also investigated and high level of heteroplasmy (75%) was confirmed, and the relation between low-frequency heteroplasmy (0.1- 0.2%) and some human diseases, regardless of sex, was established. Hence, we propose that QuasiFlow may find use with known and emerging viruses to reveal evolutionary jumps and co-infections, with mitochondrial DNA to detect relevant heteroplasmy would otherwise be elusive, or even in other population studies such as those considering single cell sequencing.

where the relevant bioinformatic tools are indicated in italics. Magenta boxes refer to QuasiFlow input files. White 115 rounded boxes correspond to input/output data recoverable when QuasiFlow finishes; their title containing the task 116 number will facilitate its citation throughout the text. Solid the four synthetic datasets. "RejectedReads" are reads removed due to quality, complexity or length issues. 122 "ReferenceReads" are reads devoid of artefacts (useful reads) and therefore suitable for genetic variance analyses. 123 "MappedReads" are useful reads mapped onto reference sequence of TYLCV-Mld. "TrueRECs" represent 124 recombination events that were predicted at the expected position. "False RECs" gather those recombination events 125 that were not detected at the expected position. "True MUTs" correspond to mutation events predicted at the expected 126 position. "Haplotypes" are those retained after task 7 cross-validation (T1_Supp_Ht in Fig. 1 The whole variation profile was then investigated (Fig. 2B-D) . As expected, the control group 143 of mut_rec presented no mutations and no recombination events, being consistent with a 144 very low frequency of genetic variability (Fig. 2B ). For high variation profiles such as 145 MUT_REC (Fig 2C) , 261 new mutations with frequencies ranging from 0.31 to 1.72 × 10 -4 , 146 and 13 recombination events were detected. This illustrates that QuasiFlow can find both 147 low-and high-frequent mutation and recombination events. A test using a real-world dataset 148 from a RCA-amplified infectious clone of the begomovirus TYLCV-Mld (tomato yellow leaf 149 curl virus, mild strain), where low mutation frequency is expected as well as absence of 150 recombination, 72 mutations with frequencies ranging from 4.2 × 10 -4 to 2.2 × 10 -4 were 151 detected and no recombination events were identified (Fig. 2D ). The apparent hotspot 152 detected from the nucleotide 1110 to 1128 for the ORF so called 315-258-dir ( Fig 2D, track  153 B) is due to the quasi-dimer (1,7-mer) nature of infectious clone compared to the reference 154 genome, being the region where digestion and fusion was produced and is not identical in all 155 virus haplotypes. Since the RCA-amplified TYLCV-Mld ( Fig 2D) presents values closer to 156 mut_rec ( Fig 2B) than to MUT_REC (Fig 2C) , it can be suggested QuasiFlow was able to 157 detect the low amount of variation introduced by bacterial cloning and RCA manipulation, 158 and that this low level of variability would have a negligible impact on genetic variation. This 159 also supports that mutation and recombination events and the resulting haplotypes provided heteroplasmies, a remarkable 80% had very low frequency, that is, between 0.1% and 0.2%, 237 and the rest (20%) had a frequency no higher than 2% (Fig 4A) . These results suggest that 238 mtDNA populations in these samples maintain a very low level (below 0.3%) of 239 heteroplasmies while homoplasmies were very frequent (greater than 98%) with a stricking 240 absence of intermediate cases. When the relationship between mtDNA variants and disease 241 was inspected, 451 variants were associated with different diseases (Supplementary File 2,  242   Table S3 ). Interestingly, more than 75% of these mtDNA variants were heteroplasmies. 243

Using the data retrieved from MITOMAP a network was constructed for each sample 244 illustrating the relationship between mtDNA variants per ORF and disease. An example is 245 shown in Figure 4B 

The impact of sex and tissue factors were inspected using QuasiComparer. No significant 265 differences (P < 0.05; Table S4 in Supplementary File 2) were detected among the 47 266 samples for any of the factors, but differences become evident by PCA for tissue factor (Fig.  267 4C), that explained 48% of the total variance. In fact, hierarchical clustering grouped 268 samples in four clusters (Fig. 4D) two of them by tissue, while no association between sex 269 and mtDNA variants was found. The blue-and cyan-boxed clusters in Fig. 4D contained 270 mainly lymphoblastoid samples and the red-boxed cluster was enriched with blood samples. 271

Differences between these clusters were found significant (t-Student, P < 0.05). using synthetic sets of data (Fig. 2) and applied to a real-world data from plant viral (Fig. 3 ) 310 and human mitochondrial (Fig. 4) The reliable analyses of the genetic heterogeneity of human mtDNA (Fig. 4) Raw reads were quality controlled using FastQC, and pre-processed using SeqTrimBB to 368 remove low quality bases, vectors and adapters. SeqTrimBB default parameters were used, 369 except for the following: 1) minimum phred base mean quality was set to 26 to avoid 370 sequencing errors to be confounded with low frequency mutations and reduce the false 371 positive ratio on ulterior analyses (Morelli et al. 2013; Welkers et al. 2014) ; 2) the minimum 372 read length was increased to 100 nt to remove shorter reads confounding haplotype 373 reconstruction (Zagordi et al. 2011 ). Viral reads were extracted from host and contaminant 374 reads using the reference genome. This more reliable but reduced number reads (Fig 1,  375 T1_U_rd) guaranteed soundness and robustness in further steps (Daly et al. 2015) . 376

The useful reads (T1_U_rd) were then treated as single reads (even though they come from 378 paired-end reads) with ViReMa (Routh and Johnson 2014) using default parameters, excepting 379 the aligner, which was configured to BWA (Li and Durbin 2009) instead of the default Bowtie 380 since BWA is widely used for recombination studies. To avoid the identification of indels or 381 misalignments as recombination events due to confounding reads, the seed minimum length 382 was set to 55, the allowed mismatches was set to 1, and the microindels size was set to 55. 383

Spurious recombination events with a frequency lower than 1 ´ 10 -3 were removed to 384 increase reliability in predicted recombination events. Additionally, recombination events 385 located between the start and end positions of a circular reference sequence were 386 discarded. Recombination events are classified by ViReMa as HomoRecs (Fig 1, T2_sR) for 387 those resulting in a sequence deletion, or NonHomoRec (Fig 1, T2_dR) Useful reads T1_U_rd (Fig. 1) were mapped on the genome reference to prepare data for ERROR: reference with no printed form.]). Then, the pileup for one BAM file is saved (Fig 1,  413 T4_pile). 414

19/04/2022 20

The haplotype reconstruction relied on the recent CliqueSNV (Knyazev et al. 2021 ) that 416 seems to outperform other haplotyping methods, even with low coverage samples where 417 SNVs are at low frequencies. CliqueSNV opened each BAM file in T4_pile (Fig. 1) to identify 418 groups of linked SNVs and remove sequencing errors. It resulted in a graph where SNVs are 419 nodes and edges connect linked SNVs. Merging cliques in that graph identified haplotypes 420 that were then saved (Fig 1, T5_Ht) together with their frequency. 421 allows detection of real SNVs with frequencies as low as 7 ´ 10 -6 . The resulting files of 431 mutations (Fig 1, T6_mut) and indels (Fig 1, T6_ind) were processed with the command vcf-432 stats from the VCFtools suite (Danecek et al. 2011 ) to obtain parameters such as variant 433 counts, transition/transversion frequencies, as well as computing the frequencies of each 434 type of base substitution that will be used below in QuasiComparer. 435 and mixed with 1 volume of phenol:chloroform (1:1). Following centrifugation at 12 000 × g 525 at room temperature for 5 min, the aqueous phase was transferred to a new tube and mixed 526 with 1 volume of isopropanol. After 2 min at room temperature, nucleic acids were recovered 527 by centrifugation at 12 000 × g at room temperature for 5 min. The pellet was washed with 528 polymerase has an associated 3′-5′ exonuclease "proofreading" activity and has a reported 556 error rate of 3-5 ´ 10 -6 . The contribution of the Phi29 polymerase error during RCA is about 557 100-fold lower than the observed mutation frequency of begomoviruses (around 10 -4 558 mutations per nucleotide or higher). 

FastQC: A quality control tool for high throughput sequence data

Novel insights into breast cancer copy number genetic heterogeneity revealed 642 by single-cell genome sequencing

A method for near full-length amplification and sequencing for six hepatitis 645 C virus genotypes

A beginner's guide for FMDV 647 quasispecies analysis: sub-consensus variant detection and haplotype reconstruction using next-648 generation sequencing

Sanger Sequencing, and Melting Curve Analysis for Detection of Low-Frequency 651

Macrolide-Resistant Mycoplasma pneumoniae Quasispecies in Respiratory Specimens

Tumour heterogeneity and resistance to cancer therapies

Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery 657

Using Next Generation Sequencing Data

The variant call format and VCF tools

Historical Perspective on the Discovery of the 662

Viral quasispecies

Quasispecies: from theory to experimental systems

Phylogenetic evidence for rapid rates of molecular evolution in the single-666 stranded DNA begomovirus tomato yellow leaf curl virus

Rates of evolutionary change in viruses: patterns and 668 determinants

A simple and rapid method for the preparation of plant 670 genomic DNA for PCR analysis

Genetic structure and population variability of tomato yellow leaf 672 curl China virus

Prevalence of nuclear and mitochondrial DNA mutations related to 675 adult mitochondrial disease

Viral quasispecies 677 complexity measures

Application and prospects of single cell sequencing in 679 tumors

ART: A next-generation sequencing read simulator

FactoMineR, An R package dedicated to exploratory multivariate

Quasispecies nature of three maize 685 streak virus isolates obtained through different modes of selection from a population used to assess 686 response to infection of maize cultivars

Accurate assembly of minority viral haplotypes from next-689 generation sequencing through efficient noise reduction

VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome 692 sequencing

Quasispecies theory and the behavior of RNA viruses

Fast and accurate short read alignment with Burrows-Wheeler transform

The Sequence Alignment/Map format and SAMtools

mtDNA Variation and Analysis Using Mitomap and Mitomaster

Transfer of a prion strain to 703 different hosts leads to emergence of strain variants

Complex Recombination Patterns Arising during Geminivirus Coinfections Preserve and 706

RDP4: Detection and analysis of 709 recombination patterns in virus genomes

Measurements of Intrahost Viral Diversity Are Extremely Sensitive to 711

Systematic Errors in Variant Calling

Evolution of foot-713 and-mouth disease virus intra-sample sequence diversity during serial transmission in bovine hosts

The challenges of tumor genetic diversity

Qualimap 2: Advanced multi-sample quality 717 control for high-throughput sequencing data

Recent advances in inferring viral diversity 719 from high-throughput sequencing data

Discovery of functional genomic motifs in viruses with ViReMa-a virus 721 recombination mapper-for analysis of next-generation sequencing data

Multiple infection, recombination and genome relationships among begomovirus isolates found in 725 cotton and other plants in Pakistan

Benchmarking of viral haplotype reconstruction programmes: 727 An overview of the capacities and limitations of currently available programmes

TransFlow: a modular framework for assembling and 731 assessing accurate de novo transcriptomes in non-model organisms

Versatile Workflow Engine Illustrated by Assembling an Optimised de novo Transcriptome for a Non-734

Model Species, such as Faba Bean (Vicia faba)

Extreme heterogeneity of human mitochondrial DNA from organelles 738 to populations

The dynamics of mitochondrial DNA heteroplasmy: implications for 740 human health and disease

Viral quasispecies 742 assembly via maximal clique enumeration

VirVarSeq: A low-frequency virus variant detection pipeline for Illumina sequencing 745 using adaptive base-calling accuracy filtering

Improved detection of 747 artifactual viral minority variants in high-throughput sequencing data

ShoRAH: estimating the genetic 749 diversity of a mixed sample from next-generation sequencing data

Estimation of 751 genetic diversity in viral populations from next generation sequencing data with extremely deep 752 coverage

detected SNVs. To do so, haplotypes in T5_Ht were aligned to the genome reference using 440Clustal Ω, and the variable positions were cross-validated with previously obtained SNVs 441 (T6_mut and T6_ind). Variability not supported by a SNV was edited, and haplotypes whose 442 variations were all edited were then rejected as due to sequencing errors or untrusted 443 variations. The resulting SNV-supported haplotypes (Fig. 1, T7_Supp_Ht) were realigned 444 again and haplotype frequency was used to score diversity indices described by Gregori et 445 al (Gregori et al. 2016 ), such as normalized Shannon, Simpson, Gini-Simpson, Hill numbers, 446 sample nucleotide diversity, etc (Fig. 1 The variability analysis usually implies comparison of several samples or conditions. Since 457QuasiFlow can only manage one sequencing library (sample) at a time, inter-sample 458 comparisons required the development of a dedicated comparative system called 459QuasiComparer. This analysis can be accomplished automatically by the same daemon 460 script mentioned above provided that a description file with the following data is supplied: 1) 461 the sample name, 2) a replicate identifier that allows to group the replicates that belongs to 462 the same sample, and 3) the experimental group to which the sample is assigned, 463 depending on the user experimental design (e.g., an experimental factor such as days post 464 infection with three sample replicates taken at 7 days and other three replicates at 28 days). 465

19/04/2022 22QuasiComparer uses the parameters indicated in Supplementary Table S5 to 

19/04/2022 23 Four (mut_rec, mut_REC, MUT_rec and MUT_REC) synthetic sample datasets were 492 generated by triplicate. "mut_rec" corresponded to reads where no mutation and no 493 recombination were introduced and was considered as control group. "mut_REC" cotains 494 reads where only recombinations were introduced and were considered as control for 495 recombination. "MUT_rec" contains reads where only mutations were introduced, which 496 makes it suitable as a control for mutation. Finally, the dataset "MUT_REC" contains reads 497where both mutations and recombinations were introduced and was considered the testing 498 group. 499As real-world dataset, an infectious clone of TYLCV-Mld was sequenced as a control before 500 its passage through a plant to test the ability to detect mutations or recombination events 501 introduced during high-throughput sequencing. This dataset was sequenced with a paired-502 end layout 2 × 300 nt resulting in 1 106 300 reads. After pre-processing and filtering to keep 503 only TYLCV-Mld useful reads, this number decreased to 218 884. 504 Figure S1 : Plot of the number of SNPs detected depending on the number of 630 nucleotides analysed. Nine MiSeq datasets (yellow lines; ranging from 115 to 335 631 million reads) and three HiSeq dataset (blue lines, ranging from 500 to 501.5 million 632 reads) were used. Every dataset was randomly subset in bulks of 50 000 read folds 633 and plotted against the number of SNPs detected in it by Varscan2. 634

•