key: cord-0064617-zyo3w7vy
authors: Donato, Luigi; Scimone, Concetta; Rinaldi, Carmela; D’Angelo, Rosalia; Sidoti, Antonina
title: New evaluation methods of read mapping by 17 aligners on simulated and empirical NGS data: an updated comparison of DNA- and RNA-Seq data from Illumina and Ion Torrent technologies
date: 2021-06-16
journal: Neural Comput Appl
DOI: 10.1007/s00521-021-06188-z
sha: 996f1ee7f7f4d619e331f4f3b18d131b81760978
doc_id: 64617
cord_uid: zyo3w7vy

During the last (15) years, improved omics sequencing technologies have expanded the scale and resolution of various biological applications, generating high-throughput datasets that require carefully chosen software tools to be processed. Therefore, following the sequencing development, bioinformatics researchers have been challenged to implement alignment algorithms for next-generation sequencing reads. However, nowadays selection of aligners based on genome characteristics is poorly studied, so our benchmarking study extended the “state of art” comparing 17 different aligners. The chosen tools were assessed on empirical human DNA- and RNA-Seq data, as well as on simulated datasets in human and mouse, evaluating a set of parameters previously not considered in such kind of benchmarks. As expected, we found that each tool was the best in specific conditions. For Ion Torrent single-end RNA-Seq samples, the most suitable aligners were CLC and BWA-MEM, which reached the best results in terms of efficiency, accuracy, duplication rate, saturation profile and running time. About Illumina paired-end osteomyelitis transcriptomics data, instead, the best performer algorithm, together with the already cited CLC, resulted Novoalign, which excelled in accuracy and saturation analyses. Segemehl and DNASTAR performed the best on both DNA-Seq data, with Segemehl particularly suitable for exome data. In conclusion, our study could guide users in the selection of a suitable aligner based on genome and transcriptome characteristics. However, several other aspects, emerged from our work, should be considered in the evolution of alignment research area, such as the involvement of artificial intelligence to support cloud computing and mapping to multiple genomes. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s00521-021-06188-z.

Starting from Sanger sequencing 40 years ago, more precise and rapid sequencing technologies expanded scale and resolution of various biological applications, including the detection of genome-wide single nucleotide polymorphisms (SNPs) and structural variants [1] , quantitative analysis of transcriptome (RNA-Seq) [2] , identification of protein binding sites (ChIP-Seq) [3] , understanding methylation patterns in DNA [4] , the assembly of new genomes or transcriptomes [5] , determining species composition using metagenomic workflows. However, the huge amount of generated data explains almost nothing about the DNA without the appropriate analysis tools and algorithms. Therefore, bioinformatics researchers started to think about new ways to efficiently manage and analyze such enormous amount of data. The first crucial step in the analysis of next-generation sequencing (NGS) data, posterior to quality control and filtering steps, is alignment (mapping) of generated sequencing reads to the respective reference [6] . However, this step is biased by many errors due to the following reasons [7] : (1) a reference genome is generally long (* billions) and presents complex regions such as repetitive elements (repetitive regions are usually masked because there is no consensus about how to deal with them, yet), (2) reads are short in length (typically, 50-150 bp), causing issues with efficiency and accuracy, aligning more likely in multiple locations rather than to unique positions in the reference genome, (3) the subject genome could inherently be different from the reference genome due to acquired alterations over time. In order to face the above challenges, many new alignment tools have been developed during the last years. Such tools exploit the specific advantages of each new sequencing technology, such as the short sequence length of Helicos (range of read length = 25-1000 bp), Illumina (range of read length = 36-300 bp) and SOLiD reads (range of read length = 35-75 bp), the high base quality toward the 5'-end of Illumina and 454 reads [8, 9] , the di-base encoding of SOLiD reads, the low InDel error rate of Illumina reads (InDel rate = 0.000005%) [10] , the low substitution error rate of Helicos reads (substitution rate = 0.2%) [11] and the Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT) third generation sequencing technologies, superior in long-read assembly (PacBio Maximum Read Length = up to 40 Kb; ONT Maximum Read Length = 10 kb), with respect to accuracy and completeness [12] . The common alignment tools are based on (1) spaced-seed indexing or (2) Burrows-Wheeler transform [13, 14] . The first are slow, use more memory but correctly maps long gaps [15] . The second, instead, based on heuristic approaches, are fast, consume less memory and can be used to map short reads [16] . Spliced aligners such as GEM [17] and Tophat [18] are most frequently used for aligning transcripts. BWA [19] , BLAT [20] and Bowtie2 [21] are frequently used for aligning DNA sequences. BWA and Bowtie2 are index-based aligners exploiting Burrows-Wheeler indexing algorithm. NovoAlign [22] uses dynamic programming, taking advantage of Needleman-Wunsch algorithm with affine gap penalties to score the alignment. Many other algorithms have been developed [23] . Selecting mapping tool, based on the characteristics of the organism, shall be a fundamental focus of bioinformatics, because the choice may affect downstream analysis. Several benchmarking analyses guided users in choosing aligners, but although several studies have been published for evaluating sequence mapping tools, the problem is still open and further perspectives were not faced [24] [25] [26] [27] . Limitations of such works regard the focus on tool group classifications rather than evaluations of their own performance on selected settings [28] , the use of small and unrealistic data sets (e.g., 500,000 reads) jointly with small reference genomes (e.g., 500 Mbps) [29] , the use of simulated data only [30] , as well as the mis-usage of the aligner options and algorithmic features. Therefore, selection of aligners based on genome characteristics is poorly studied, and a quantitative evaluation to systematically compare mapping tools in multiple aspects is still needed. We extend the ''state of art'' evaluations of mapping algorithms, comparing 17 different aligners (BBMap (sourceforge.net/projects/bbmap/) [31] , Bowtie2, BWA, BWA-MEM [19, 32] , Qiagen CLC Genomics Workbench (https://www.qiagen bioinformatics.com) [33] [34] [35] [36] , DNASTAR Lasergene Suite [37, 38] , GEM [39] , Hisat2 [40] , Magic-BLAST [41] , Minimap2 [42] , Novoalign [43] , YARA [44] , RUM [45] , Segemehl [46] , STAR [47] , Subread [48] , TopHat2 [18, 49] ) applied on empirical and simulated DNA-and RNA-Seq human and mouse datasets. Our study provides guidelines for the selection of a suitable aligner based on genome and transcriptome characteristics.

In order to realize a complete evaluation of aligning tools, minimizing the possible bias coming from real sequenced data, we firstly produced simulated NGS reads, mapped them to the human and murine reference genomes and assessed read alignment accuracy using a tool evaluating if each individual read has been aligned correctly. In order to take in account the genomic features for following mapping analyses, we decided to simulate data from two organisms with two different read lengths (50 bp and 150 bp) using five different reads simulators. Each used read simulator introduced mutations within the homo sapiens GRCh38.p13 (RefSeq assembly accession: GCF_000001405.39, downloaded on January 4, 2021) or within the mus musculus Genome Reference Consortium Mouse Build 39 (GRCm39) (RefSeq assembly accession: GCF_000001635.27, downloaded on January 4, 2021) FASTA reference genomes and produced reads as genomic substrings with randomly added sequencing errors. Such simulations of artifacts and errors observed in real data were generated by different statistical models, mainly based on coverage, read sequencing errors and genomic mutations distributions, as well as CG-content. Finally, the origin of each read (generated for a given simulator) is encoded in a read group specific for each algorithm, and the reads were saved into FASTQ files. In order to enforce the usefulness of simulations, we used 5 different simulators, each one with own features: (1) ART [50] ; (2) DWGSIM (http://github.com/nh13/dwgsim); (3) WGSIM (http://github.com/lh3/wgsim); (4) MASON [51] ; (5) CURESIM [52] . The average Phred score over all simulated reads was of 29. Details of each algorithm, such as type of simulated data (Illumina or Ion Torrent), are available in Table 1 . 

The 17 benchmarked aligners were selected to represent different algorithms, many of them which are indexingbased (e.g., Bowtie2, BWA), hashing based (e.g., NovoAlign) and exploiting suffix array approaches (e.g., STAR). Main elements we considered to choose aligners were the number of citations (the highest cited, well-exploited aligners. and the lowest, already fully explored ones) and the continuously updating of algorithm code. All alignments were realized using the GRCh38.p13 and the GRCm39 reference genomes, for human and mouse, respectively. The list of all chosen mappers, along with their salient features, is highlighted in Table 3 . Aligner parameters used during evaluations are listed in Table 4 . About real data, YARA alignment was only completed on single-end RPE cells transcriptome, probably due to computational hardware limitations. Additionally, RUM mapping on WGS and simulated data outputted errors, probably due to conflicts between algorithmic specific features and qualities of real sample data. 

The mapping is a string-matching problem. It needs to integrate the known properties of the DNA sequences with the sequencing technologies, increasing complexity to the mapping procedure. To speed up the alignment process, most tools (and especially all the benchmarked tools we assess in this paper) pre-build an index on the reference genome to query faster the origin of each read. Several algorithms build index on the reads, but the most recent ones create index on the reference genome [53] . The latter method depends on the fact that the same index once built on a reference genome can be used repeatedly for aligning different read sets. After realized alignments, we used the Picard command line tool suite (https://broadinstitute. github.io/picard/) for processing and analyzing obtained data, especially focusing on internal control metrics, alignment summary metrics, GC bias metrics, quality by cycle, quality distribution, duplication metrics, insert size and deletion metrics. Additionally, clipped reads were also analyzed by the specific ExtractSVReads module of Genome Rearrangement IDentification Software Suite (GRIDSS) [54] . As regards the simulated datasets, where we have a gold standard for the origin of each read in the reference genome, we used the evaluation tool Alfred [55] to assess if any read has been affected to the correct location in the reference genome (and, possibly, with the appropriate edit operations). Final statistics strongly depended on the definition of a correctly mapped read, mapping qualities and multimapped reads.

Mapping task can represent a bottleneck in the NGS analysis pipeline due to the ever-increasing volume of the sequencing data, giving above approaches the need to adopt a trade-off between accuracy and speed, e.g., based on gaps allowed. Consequently, it is important to assess the performance of the aligners on both accuracy and computational efficiency of read alignment, particularly because mapping accuracy directly influences the results of many downstream tasks and the running time could potentially be a computational burden. Sensitivity (S) is determined as the ratio of reads mapped correctly to the reads mapped incorrectly at a particular threshold (S = number of reads mapped correctly/number of reads mapped incorrectly) [27] . We take the advantage of having simulated reads with known biases (considered as gold standard) to assess the performance of the 17 aligners and then test them on the empirical datasets. We decided to compare alignment performance firstly by stratifying against all reported MAPQ and Alignment Scores (AS), found in SAM format [56] , then analyzing mapping efficiency (E), as accounted for other authors [57] , and bearing in mind that the MAPQ assumes specific ranges for each alignment algorithm. AS is a score describing sequence similarity between a query and a reference. AS increases with the number of matches and decreases with the number of mismatches and gaps (rewards and penalties for matches and mismatches depend on the used scoring matrix). MAPQ is a metric affected by position. It equals 10 log 10 of probability that mapping position is wrong, rounded to the nearest integer. Based on quantiles from AS and MAPQ distributions (Table S1 ), we considered high AS and low MAPQ if reads aligned perfectly at multiple positions (''non-uniquely mapped reads''), while low AS and high MAPQ if reads aligned with mismatches but the reported position is still much more probable than any other (''uniquely mapped reads'').

In order to make comparable MAPQ and AS across aligners, we evaluated the percentages of reads beyond the two parameters thresholds, mainly focusing on MAPQ. The mapping efficiency, corresponding to the total number of reads that align, generally depends on various factors such as (1) the read length, (2) quality of the reads, (3) the absence of contaminants (such as other species or adapter contamination), (4) the mapping software used and (5) the quality of the reference genome assembly. Additionally, the type of library may impact the aligner efficiency: for instance, some library protocols [58] tend to enrich for repeated sequences, possibly due to a high number of PCR cycles and/or starting from a small amount of material, and would display lower mapping efficiency than others. Additionally, regardless of the limitations related to both reads and the quality of the reference, several aligners are able to reach high mapping efficiencies by performing ''clipping'' of the reads. With this procedure, the portions of the read that do not align to the reference on either side of the read are ignored, though label in the CIGAR string. This process usually comes along with a small penalty for each clipped base within the read, but aims to maximize the alignment score and thus amounts to a significantly smaller alignment penalty than mismatched bases. Mapped reads clipping profile, representing the distribution of clipped nucleotides across reads were compared for all algorithms.

Different aligned data coming from different aligners were analyzed to calculate how mapped reads were distributed over genome feature (like CDS exon, 5'UTR exon, 3 0 UTR exon, Intron, Intergenic regions), using the Human GEN-ECODE annotation v.36 (www.gencodegenes.org). When genome features were overlapped (e.g., a region could be In this table are shown all tools exploited for alignment comparison with their own specific features. Prog. Lang. = programming language. Clip = clipping. Bib. = bibliography. SW = Smith-Waterman. BWT = Burrows-Wheeler transform. NW = Needleman-Wunsch. Max InDels = maximum n°allowed InDels for read (default). Gaps = gapping alignment. MAPQ = mapping quality score (its main value and range differ across mappers). AS = alignment score. N°Cit. = number of citations by January 2021. * This score is calculated on the basis of ''Local Alignment = 20 ? 8.0 * ln(L),'' where ''L'' is the read length. 

Sequencing reads covering InDels typically map with more difficulties since their correct alignment involves complex gapped alignment. Software tools needed during highthroughput sequencing focus their improvements on InDel detection analysis. Several studies described the effects of different alignment tools on detection's efficiency [59] . These works recommend the use of gap-aware aligners. Nevertheless, further knowledge on the effects of these tools is still required. Distributions of deletions and inserted nucleotides across reads were calculated for all considered algorithms, distinguishing the effects of gapped alignments.

The GC content distribution was evaluated by FastQC [60] . Among the challenges, GC bias in NGS data is known to aggravate genome assembly/alignment [61] [62] [63] [64] [65] [66] . However, it is not clear to what extent GC bias affects genome assembly or mapping in general. The average GC% along the reads indicates, for example, if considered reads are properly trimmed, in order to avoid residual adapter sequences. The whole distribution of GC content is shown as a function of the position in each read. In an unbiased library, the mean GC per read distribution must be Gaussian and centered on expected GC% (the GC% for humans is known and is 41%), if no contamination/technical bias is present. An unusually shaped distribution could indicate a contaminated library or some other kinds of biased subset, possible emerging from incorrect alignment. Warning is raised if the sum of the deviations from the normal distribution represents more than 15% of the reads.

Duplicate reads are defined as originating from a single fragment of DNA. Duplicates can arise during sample preparation (e.g., library construction using PCR) or result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument (optical duplicates). They were detected with Picard MarkDuplicates tool [67] , which compares the anchors of mapped reads from a SAM/BAM file. After duplicate reads are collected, the tool differentiates the primary and duplicate reads using an algorithm that ranks reads by the sums of their base-quality scores. It is still unclear whether removing read duplicates computationally improves accuracy and precision by reducing PCR bias and noise or whether it decreases accuracy and precision by removing genuine information. Generally, we mark duplicates (e.g., do not remove them) only for data from WGS/ exome experiments or from analyses where amplification artifacts might be a problem (ChIP-Seq for example), while should never be removed in any quantitative experiment, such as RNA-seq, because they may be part of small highly expressed transcripts [68, 69] . Two strategies were used to determine read duplication rate: (1) sequence based: reads with identical sequence are considered duplicated reads. (2) mapping based: reads mapped to the exactly same genomic location are evaluated as duplicated reads. For spliced reads, reads mapped to the same starting position and splice the same way are regarded as duplicated reads. Then all transcripts were sorted in ascending order according to expression level (RPKM) and finally divided into 4 groups: 

We performed all analyses on a high-end MacBook Pro with IntelÒ Core TM (Intel Core i7-8750H CPU @ 2.2 GHz with Turbo Boost to 4.1 GHz six-core) processor and a maximum memory of 32 GB of RAM with AMD Radeon Pro 555X graphics and MacOS Big Sur 11.2.0/Ubuntu 20.04 LTS OS for alignment jobs. The scalability of the mapping tools may be different under different parallel settings. Many tools support multithreading, which is expected to yield linearly increasing speedup with an incremented number of CPU cores. However, using multiprocessing is more general and may improve the throughput even for tools that do not support multithreading, where multiprocessing refers to using more than one process in a distributed memory fashion while communicating through a message passing interface. All our used tools support multithreading/multiprocessing, except YARA and RUM.

Alfred (https://www.gear-genomics.com/alfred) allowed a complete integration of high-throughput settings, enabling the monitoring of sequence data quality and characteristics across samples. It parsed the aligned BAM files only once and pre-allocated data structures for counting primary, secondary, supplementary and spliced alignments. Sequencing error rates were elaborated separately for mismatch, insertion and deletion errors. InDel size distribution was estimated and potential homopolymer sequence regions and a fragment-based GC bias curve were estimated from the reference context. Evaluated parameters of Alfred alignment metric are available in Table 5. 14 Results

Alignment of simulated data by all 17 considered algorithms showed that the best mapping was reached by BBMAP, BWA-MEM, Novoalign, DNASTAR, YARA, Segemehl and TopHat2 for all paired-end simulated data, while the worst one was Subread. Only Mason data showed a high-quality alignment for BWA. A different situation resulted from mapping of CuReSim data, due to the ability of tool to generate single-end reads only. Thus, this time the best outputs emerged from BBMAP, YARA and DNASTAR, while non-optimal alignments were performed by Bowtie2, BWA, Hisat2, Minimap2, CLC and STAR. Distributions of mapped and unmapped reads by selected aligners on simulated data are shown in Fig. 1 , while detailed statistics of each simulated data alignment is available in Table S2 .

We evaluated the tools on four types of data sets, namely two DNA-Seq samples (paired-end WES and single-end WGS). During an evaluation procedure, it is essential to choose the right data set type to improve the applicability of the tools. We concatenated our 4 datasets in a single file and added a read group to find back the origin of each read. The uniquely mapped reads were separated from multiple mapped (the percentage of reads mapped to more than one location with the same number of mismatches, highlighting that these reads could fall in repetitive regions) and unmapped ones by Alfred. Furthermore, we reported the distribution of mapped reads over genomic features, as number of mapped tags per kb (tags/kb) of reference genome. Indexing and matching times for selected tools were split (see further). Figure 2 reports the percentage of mapped (uniquely and multiple) and unmapped reads of various sizes aligned using all chosen 17 different aligners. The absence of tick indicates that the specific metric is not available for that aligner Table 3 for MAPQ and AS thresholds for each aligner Neural Computing and Applications exhibited a very particular behavior, with many and irregular peaks across the reads. Results of insertion and deletion analyses are plotted in Figs S5-S12.

In this work, we also conducted a systematic analysis on the effects of GC bias on genome mapping. In represented plots, the presence of double peaks in GC distribution could reflect some noise at the start of the per-base nucleotide distribution due to not-so-random hexamers. In the case of RNA samples, the second peak could also refer to an incomplete rRNA depletion during library preparation. Finally, the theoretical GC curve should be centered closer to the expected mean GC% value of the studied organism. If the read GC distribution is nonsignificantly different to the GC distribution of the reference genome, that implies no bias, assuming all of the reads originated from that reference genome. High and low GC tails, relative to the expected GC% of the organism, are more likely to represent repetitive regions and tandem repeats. So, using a poor mapper or a highly-repetitive organism determines that the seemingly higher coverage of extreme GC areas is actually due to the fact that they are collapsed repeats. Mapping with high error rates is not an issue, but there could be edge effects if you are mapping to short contigs. In RNA-Seq samples, BWA-MEM, CLC and Novoalign shown a normal distribution very close to the theoretical normal one. Moreover, the same result was obtained for Magic-BLAST, Minimap2 and Subread for RPE cell transcriptome data, and for BBMap and Hisat2 for osteomyelitis one. In the latter sample, interestingly, several tools (Bowtie2, BWA, GEM, Minimap2, STAR and Subread) highlighted a double peak, and two algorithms (Magic-BLAST and Segemehl) presented even irregular trend. About DNA-Seq data, Magic-BLAST and Segemehl show the most normal distribution of GC in WES sample, where all the other tools revealed a double peak in their trends, with the second higher than the first. In WGS data, instead, no algorithm could reach an approximal normal distribution, probably due to sequencing biases: all aligners shown a left-skewed distribution, with Magic-BLAST and Segemehl also characterized by a lower density of reads. GC distribution emerged from all alignments are represented in Figs. S13-S16. Table 3 for MAPQ and AS thresholds for each aligner. YARA results are only available for RPE RNA-Seq sample, probably due to computational hardware limitations Neural Computing and Applications

As expected, read duplication rates are very high in RNA-Seq samples and almost paltry in DNA-Seq ones, especially in WGS data. Read duplicates are not normally removed from the RNA-Seq data, because PCR duplicates are not distinguishable from the same fragments that are highly expressed. In RPE transcriptome, BBMap, BWA-MEM, Magic-BLAST and Segemehl detected the highest number of total duplications in the highest number of reads. Duplicate sequences detection resulted more divergent from duplicate mapped reads in BBMap, CLC and Magic-BLAST alignments. The same deviating trend is shown by Magic-BLAST and by Segemehl in the other three samples, with the latter aligner able to detect the most consistent number of duplicated mapped reads in WGS data. Figures S17-S20 highlight duplication rates from selected tools.

Saturation analyses of both RNA-Seq samples highlighted relevant differences between aligners in transcripts with expression level ranked below 25 percentile (Q1). In detail, RPE transcriptome data shown different RPKM saturations distributed in five tool groups: (1) Bowtie2, BWA and TopHat2, in which the median of percent relative error decreases from 100 to 75% and, soon after, under 50%, between 30 and 35 resampling percentage; (2) BWA-MEM, Magic-BLAST, Minimap2 and Segemehl, characterized by percent relative error median reaching 50% around 10-15% of resampling percentage; (3) GEM and Novoalign, whose median of percent relative error dropped below 50% only after 20% of resampling rate; (4) Hisat2, YARA and Subread, whose saturation plots reached the 50% of median of percent relative error at 25% of resampling percentage; (5) STAR that highlighted a trend very similar to group 4 aligners, but with first box-plots showing shorter quartiles than their counterpart in group 4. BBMap 

NGS technology has grown very rapidly during the last years, leading to huge data outputs of million sequences in a single run. However, such enormous quantity of generated data gives no really useful information about DNA [70] without the availability and the development of specific analysis algorithms and tools. For this reason, the bioinformatics research takes care to find new ways to efficiently manage and analyze such big data [71] . One of the most promising area that involves the most NGS bioinformaticians deals with the mapping of generated sequences [72] . Thus, selection of the correct aligner is fundamental. Mapper performance and specificity are determined by genome characteristics, so their evaluation depends on various criteria such as read distribution, properly paired, InDel and saturation profiles, duplications, and incorrectly mapped reads. We benchmarked 17 different aligners, starting from artificial reads obtained by 5 reads simulators, and then focused on different types of real data. Simulated data was obtained from both homo sapiens and mus musculus, with different read lengths, in order to widen the genomic features involvement into aligning analysis. In this way, we wanted to provide guidelines for the choice of the most suitable aligner [73] [74] [75] . The most unique feature of our study consisted of the use of all mappers on both RNA-and DNA-Sequencing data, even if several aligners were specifically developed for only one of them [76] . The idea of using an RNA-Seq mapping tool deals with the ability to align across intron-exon junctions. Whole genome/exome data consist of exon sequences (maybe plus a little intronic sequence in WES, depending on the probes), which are not supposed to be spliced together. Since it is not necessary to align across junctions, an RNA-designed aligner will not have an advantage over a DNA-designed aligner, but there is no evidence regarding why it could not be use either. Firstly, alignment performance by stratifying against all reported MAPQ scores and efficiency were evaluated. It is well known that they are both affected by genome size, read size and distribution of repeats [77] . Single-end long reads (150 bp and 200 bp) increased the number of unmapped reads, but decreased the number of multiple alignments, increasing the performance of alignment for all the mappers. Paired-end short reads, instead, showed the highest level of total mapped reads, increasing efficiency, and also reaching the highest percentage of uniquely mapped reads in WES data, probably the best compromise between mapping quality and efficiency. Furthermore, the trade-off between mapping quality and efficiency also depends on clipping process, that in our analyses evidenced the lowest number of clipped reads in DNA-Seq data, probably shifting balances in favor of align quality. Among different selected aligners, Magic-BLAST was the one able to map the highest number of reads in quite all samples, but at the expense of accuracy. Novoalign and DNASTAR, instead, were found to have the highest mapping quality with short and long reads, both in RNA-and DNA-Seq data. Furthermore, CLC showed similar pattern of align quality. The highest distribution of reads assigned to specific genome features was, instead, reached by Segemehl and Novoalign, also able to map the huge number of exon sequences. The other aspect we focused on was the insertion/deletion profile, which usefully describe how downstream tools could correctly infer InDels [78] . Gapped alignment-based InDel detection algorithms require interpretation of the alignment results from a gapped aligner [79] such as BWA. The major obstacle of these methods is represented by the need that InDels have to be entirely contained within a read and correctly detected during the initial read mapping step (identified, in the CIGAR string, as 'I' for insertion and 'D' for deletion [80] ). Furthermore, even if it is sufficient for small InDels detection, it becomes very difficult for InDels longer than 15% of the read length. In this case, supporting Neural Computing and Applications reads will frequently present too few bases able to match the reference genome or may contain only one end able to map correctly to the reference genome but the rest of the bases following the InDel get trimmed or soft-clipped by the NGS aligner [81] . Therefore, the previously clipping analysis resulted needed before interpreting data coming from insertion and deletion profiles. The highest insertion percentage was shown by GEM in RPE cell RNA-Seq sample, while the same parameter was really low in all other samples and aligners. Deletion percentage, instead, reached its highest peak in both Illumina samples, thanks to CLC algorithm, highlighting a basilar independency of this parameter from NGS kind of experiment. Another interesting side of alignment comparison deals with duplication rate. High coverage due to repeats has been known [82] , and duplication in the genome plays a critical role in determining the quality of the aligners. Moreover, duplication should not be removed from RNA-Seq data, because they might be present due to high expression level [83] . BBMap, CLC, Magic-BLAST and, above all, Segemehl were able to detect the most elevated number of mappingbased duplications across all four samples, especially in WGS one. Furthermore, several aligners were more sensitive to several sequencing biases, like altered GC content and unbalanced nucleotide composition of reads [84] . Among them, BWA-MEM, CLC and Novoalign have suffered less from the effect of these errors in RNA-Seq samples, while Magic-BLAST and Segemehl did the same in WES data. Different situations appeared in WGS abnormal results for all aligners, probably due to low quality of the sample. Another interesting feature, this time only evaluable on RNA-Seq data, is represented by sequencing saturation, a measure of the fraction of library complexity [85] . The inverse of the sequencing saturation can be interpreted as the number of additional reads requested to detect a new transcript. Sequencing saturation is dependent on the library complexity and sequencing depth. Novoalign showed the best performance during evaluation of this parameter, highlighting the ability to detect new transcript easier than other tools. Relevant results in the same analysis were, also, reached by BWA-MEM, Magic-BLAST, Minimap2 and Segemehl. Finally, comparative analysis of computational performance showed Subread was significantly faster in all samples, followed by Minimap2. Interestingly, two of the commercial tools, CLC Genomics Workbench and DNASTAR Lasergene, resulted within the first three position in RNA-Seq and DNA-Seq alignment, respectively. Segemehl, TopHat2 and Novoalign, instead, evidenced the longest computational time of read mapping, probably due to their predisposition toward other parameters (e. g. accuracy). As already cited in results, simulated data alignment corroborated the mapping outputs of real data alignment. Data summarizing which aligner excels in relationship with input data (reads length and DNA-vs RNA-Seq) are reported in Table 6 .

In this table are shown all tools exploited for alignment comparison with their own specific features. For each empirical dataset, each aligner is scored using the following criteria: Efficiency, Accuracy, Read Distribution, Deletion Profile, GC Count, Clipped Reads, Runtime, Duplication Profile and Saturation Profile. The reported number indicates the criteria an aligner performs best.

One limit of the study regards the comparison of Illumina and Ion Torrent data, which could complicate the benchmark evaluation. A second problem was the need of sufficient localized computing resources to analyze data that, in our case, arrested the analysis of several resource-consuming algorithms (e.g., YARA). Another relevant issue might deal with scaling results to other datasets, element that should always keep in mind. Additionally, the choice of the aligners could be not totally exhaustive, due to practical reasons. Furthermore, we did not benchmark machine learning algorithms, such as DAVI [86] , DeepFam [87] , RLAlign [88] and DeepSF [89] , namely neural networks and support vector machines, as well as the emerging developments in artificial intelligence, that represent the most emerging analytic field. Nevertheless, the major challenge is to deal with and interpret all the distinct output coming from each algorithm parameters.

24 Perspectives: improving the SARS-CoV-2 genome sequencing and variant discovery

Sequencing of genes and whole genomes has been recognized as a powerful technique to investigate viral pathogen genomes, understand outbreak transmission dynamics and spill-over events and screen for mutations that potentially play a pivotal role pathogenicity, transmissibility and/or countermeasures (e.g., diagnostics, antiviral drugs and vaccines). A standardized pipeline to characterize, name and report SARS-CoV-2 sequences has not been established yet, even if there are several methods available for sequencing SARS-CoV-2 from clinical samples [90] . About data analysis, methods based on reference mapping are most suitable for routine analysis, while minority variant determination or structural genomic variants detection are technically challenging and can often not be fully automated. Alignment of on-target reads to a canonical reference genome, such as the genome NCBI reference [91] [92] [93] [94] . Regardless of the pipeline, nucleotide variants should not be called if the number of unique supporting reads at the site is lower than the required depth for confidence. Thus, in practice, such alignment-based methods are prone to bias ranging from false positives [95] to false negatives [96, 97] ). In these scenarios, alignment-based methods may be not specific or sensitive enough. Such alignment-based methods are also computationally intensive and therefore not particularly fast or efficient. Thus, our study could shed new lights on more suitable alignment algorithms for SARS-CoV-2 analysis, permitting the bioinformaticians to evaluate the introduction of new mapping procedures inside their inhouse pipelines, trying to improve the output of sequencing and reduce all types of just cited bias.

In conclusion, there is no best aligner among all of the analyzed ones; each tool was the-best in specific conditions. For Ion Torrent single-end RNA-Seq samples, the most suitable aligners resulted CLC and DNASTAR, while both DNA-Seq samples showed as ''best performers'' Segemehl and DNASTAR, with the first particularly performing well for WES data. Even if many studies have deeply evaluated and, then, improved actual algorithms, alignment still remains an active challenge. However, for the first time, we tried to analyze different NGS data (RNA-Seq and DNA-Seq ones) with 17 different mapping algorithms created for specific kind of experiment, showing possibilities of such attempt and limitations. We believe that also with machine learning algorithms support, the NGS technique will help scientists and clinicians to solve complex biological challenges, thus improving clinical diagnostics and opening new avenues for novel therapies development.

Authors contribution LD designed the study, performed analytical work, paper writing and drafting; CS participated in analytical work and paper drafting; CR participated in manuscript preparation; RD participated in paper drafting and assisted in study design; AS supervised all work and assisted in manuscript preparation. All authors read and approved the final manuscript.

Funding None.

Availability of data and material The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflict of interest The authors declare that they have no conflict of interests/competing interests.

Ethical approval This study was approved by the Ethics Committee of ''Azienda Policlinico Universitario of Messina'' and conformed to the tenets of the Declaration of Helsinki. All subjects had given written informed consent prior to participation in the study.

Consent for publication All authors give their consent for publications. 

A high-throughput SNP discovery strategy for RNA-seq data

RNA-Seq-Based comparative transcriptomics: RNA preparation and bioinformatics

Recent advances in ChIP-seq analysis: from quality management to whole-genome annotation

DNA methylation-based forensic age prediction using artificial neural networks and next generation sequencing

The present and future of de novo whole-genome assembly

A survey of software and hardware approaches to performing read alignment in next generation sequencing

Repetitive DNA and nextgeneration sequencing: computational challenges and solutions

Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing

Long fragments achieve lower base quality in Illumina paired-end sequencing

Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data

Single molecule sequencing with a HeliScope genetic analysis system

On Behalf Of The Rehab C (2019) Comparison of long-read sequencing technologies in the hybrid assembly of complex bacterial genomes

A comprehensive evaluation of alignment algorithms in the context of RNA-seq

Mapping reads on a genomic sequence: an algorithmic overview and a practical comparative analysis

Efficient computation of spaced seed hashing with block indexing

Computational complexity of algorithms for sequence comparison, short-read assembly and genome alignment

Efficient alignment of illuminalike high-throughput sequencing reads with the GEnomic Multitool (GEM) Mapper

TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions

Fast and accurate short read alignment with Burrows-Wheeler transform

Using BLAT to find sequence similarity in closely related genomes

Fast gapped-read alignment with Bowtie 2

Intersect-then-combine approach: improving the performance of somatic variant calling in whole exome sequencing data using multiple aligners and callers

Fast and memory efficient approach for mapping NGS reads to a reference genome

Weighted minimizer sampling improves long read mapping

Assessing graph-based read mappers against a baseline approach highlights strengths and weaknesses of current methods

Comparison of read mapping and variant calling tools for the analysis of plant NGS data. Plants (Basel)

Evaluation and assessment of read-mapping by multiple next-generation sequencing aligners based on genome-wide characteristics

A survey of sequence alignment algorithms for next-generation sequencing

A secure alignment algorithm for mapping short reads to human genome

Next-generation forward genetic screens: using simulated data to improve the design of mapping-by-sequencing experiments in Arabidopsis

Evaluating alignment and variant-calling software for mutation identification in C. elegans by wholegenome sequencing

Hardware acceleration of BWA-MEM genomic short read mapping for longer read lengths

Effects of A2E-induced oxidative stress on retinal epithelial cells: new insights on differential gene response and retinal dystrophies. Antioxidants (Basel

Discovery of GLO1 new related genes and pathways by RNA-Seq on A2E-stressed retinal epithelial cells could improve knowledge on retinitis pigmentosa

Angelo R (2020) Transcriptome analyses of lncRNAs in A2E-stressed retinal epithelial cells unveil advanced links between metabolic impairments related to oxidative stress and retinitis pigmentosa

New omics-derived perspectives on retinal dystrophies: could ion channels-encoding or related genes act as modifier of pathological phenotype?

Molecular characterization and phylogenetic analysis of a dengue virus serotype 3 isolated from a Chinese traveler returned from Laos

Possible A2E Mutagenic Effects on RPE Mitochondrial DNA from Innovative RNA-Seq Bioinformatics Pipeline

The GEM mapper: fast, accurate and versatile alignment by filtration

HISAT: a fast spliced aligner with low memory requirements

Magic-BLAST, an accurate DNA and RNA-seq aligner for long and short reads

Minimap2: pairwise alignment for nucleotide sequences

Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines

Approximate string matching for highthroughput sequencing

Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM)

Lacking alignments? The next-generation sequencing mapper segemehl revisited

STAR: ultrafast universal RNA-seq aligner

The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads

Expression of pro-angiogenic markers is enhanced by blue light in human RPE cells

ART: a next-generation sequencing read simulator

Mason-a read simulator for second generation sequencing data

Comparison of mapping algorithms used in high-throughput sequencing: application to Ion torrent data

Benchmarking short sequence mapping tools

GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly

Alfred: interactive multi-sample BAM alignment statistics, feature counting and feature annotation for long-and short-read sequencing

Mapping short DNA sequencing reads and calling variants using mapping quality scores

BatAlign: an incremental method for accurate alignment of sequencing reads

Best practices for illumina library preparation

Evaluating the accuracy and efficiency of multiple sequence alignment methods

FQC Dashboard: integrates FastQC results into a web-based, interactive, and extensible FASTQ quality control tool

Accurate whole human genome sequencing using reversible terminator chemistry

GC bias affects genomic and metagenomic reconstructions underrepresenting GC-poor organisms

Substantial biases in ultra-short read data sets from high-throughput DNA sequencing

Whole-genome sequencing and variant discovery in C. elegans

Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G?C)-biased genomes

A large genome center's improvements to the Illumina sequencing system

Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers

The impact of amplification on differential expression analyses by RNA-seq

A comprehensive quality control workflow for paired tumor-normal NGS experiments

Using ''Big Data'' in the costeffectiveness analysis of next-generation sequencing technologies: challenges and potential solutions

Short read mapping: an algorithmic tour

Evaluation of tools for long read RNA-seq splice-aware alignment

Specificity control for read alignments using an artificial reference genome-guided false discovery rate

A novel and well-defined benchmarking method for second generation read mapping

Simulation-based comprehensive benchmarking of RNA-seq aligners

Optimal seed solver: optimizing seed selection in read mapping

The challenge of detecting indels in bacterial genomes from short-read sequencing data

Fast and accurate mapping of Complete Genomics reads

The sequence alignment/map format and SAMtools

SHEAR: sample heterogeneity estimation and assembly by reference

High sensitivity multiplex short tandem repeat loci analyses with massively parallel sequencing

A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments

Summarizing and correcting the GC content bias in high-throughput sequencing

Discovery and saturation analysis of cancer genes across 21 tumour types

DAVI: Deep learning-based tool for alignment and single nucleotide variant identification

DeepFam: deep learning based alignment-free method for protein family modeling and prediction

RLALIGN: A reinforcement learning approach for multiple sequence alignment

DeepSF: deep convolutional neural network for mapping protein sequences to folds

Next generation sequencing and bioinformatics methodologies Neural Computing and Applications for infectious disease research and public health: approaches, applications, and considerations for development of laboratory capacity

Evaluation of NGS-based approaches for SARS-CoV-2 whole genome characterisation

A computational toolset for rapid identification of SARS-CoV-2, other viruses and microorganisms from sequencing data

The establishment of reference sequence for SARS-CoV-2 and variation analysis

Control ECfDPa (2021) Sequencing of SARS-CoV-2: first update

A Genomic Perspective on the Origin and Emergence of SARS-CoV-2

Hybrid capture and next-generation sequencing identify viral integration sites from formalin-fixed, paraffin-embedded tissue

Practical innovations for high-throughput amplicon sequencing