key: cord-0922385-sefip2l0
authors: Maceda, Iago; Lao, Oscar
title: Analysis of the Batch Effect Due to Sequencing Center in Population Statistics Quantifying Rare Events in the 1000 Genomes Project
date: 2021-12-24
journal: Genes (Basel)
DOI: 10.3390/genes13010044
sha: 7ba3168ad2befd4dddc2ca64aa07330e47d40bee
doc_id: 922385
cord_uid: sefip2l0

The 1000 Genomes Project (1000G) is one of the most popular whole genome sequencing datasets used in different genomics fields and has boosting our knowledge in medical and population genomics, among other fields. Recent studies have reported the presence of ghost mutation signals in the 1000G. Furthermore, studies have shown that these mutations can influence the outcomes of follow-up studies based on the genetic variation of 1000G, such as single nucleotide variants (SNV) imputation. While the overall effect of these ghost mutations can be considered negligible for common genetic variants in many populations, the potential bias remains unclear when studying low frequency genetic variants in the population. In this study, we analyze the effect of the sequencing center in predicted loss of function (LoF) alleles, the number of singletons, and the patterns of archaic introgression in the 1000G. Our results support previous studies showing that the sequencing center is associated with LoF and singletons independent of the population that is considered. Furthermore, we observed that patterns of archaic introgression were distorted for some populations depending on the sequencing center. When analyzing the frequency of SNPs showing extreme patterns of genotype differentiation among centers for CEU, YRI, CHB, and JPT, we observed that the magnitude of the sequencing batch effect was stronger at MAF < 0.2 and showed different profiles between CHB and the other populations. All these results suggest that data from 1000G must be interpreted with caution when considering statistics using variants at low frequency.

The 1000 Genomes Project (1000G) [1] corresponds to the first attempt to characterize the worldwide genetic variation in humans. The project was created to generate accurate haplotype information across different human populations. To do so, the project aimed to characterize over 95% of variants in genomic regions that have >1% allele frequency in each of the major population groups (populations in or with ancestry from Europe, East Asia, South Asia, West Africa, and the Americas). The project started with 15 populations in the pilot phase, and had a total of 26 populations by the end of Phase 3, when the project was concluded. In the pilot phase, the project characterized a total of 1092 samples (not evenly distributed across the different populations) and, by the end of the Phase 3, it included a total of 2504 samples (with a close to even distribution across all populations). The final genomic dataset is heterogeneous in nature, comprising individuals with low coverage and exome sequencing data, produced in nine sequencing centers using five sequencing technologies and bioinformatic pipelines. An additional problem is that populations were not divided evenly across all sequencing centers. For example, for the GWD (Gambian in Western Division, Mandinka) population, the BGI sequenced 25 individuals, the Broad Institute sequenced 86 individuals, and Washington University sequenced two individuals. standing the history of archaic introgression, identifying archaic species from which ancient DNA is not yet available [4, 25] , assessing their role in phenotypes of medical interest.

Furthermore, we wanted to assess to what extent sequencing errors in the 1000G could affect the interpretation of results in newly generated data, especially for populations that due to their rural condition could be potential isolates. As a proof of concept, we used the rural population of the Spanish Eastern Pyrenees (SEP), presented and analyzed in [26] .

To generate the dataset that we used across this work we downloaded the ready-to-use variant calling format (VCF) files from the FTP site of the 1000 Genomes Project (link). These VCF files are divided by chromosome. We concatenated all the files into a single file using bcftools [27] . After this step, we selected those variants that correspond to SNV, and are biallelic and polymorphic across the whole dataset.

We obtained the sequencing center information from the spreadsheet available in the 1000 Genomes Project site (https://www.internationalgenome.org/data/ accessed on 26 April 2019). From this spreadsheet, we selected the Exome sequencing center as our sequencing center reference for all the samples. This decision was made because, for the low coverage whole genome sequencing, some of the samples seem to be sequenced in two different centers according to the spreadsheet (see Supplementary Materials of the 1000 Genomes Phase 3 paper [1] for details). Based on this parameter, we also had to exclude one of the samples from the ACB (African Caribbean in Barbados) population (HG02537).

A summary of the acronyms for sequencing centers and populations used through this article can be found in Supplementary Materials Table S1.

The SEP dataset consists of 29 samples of unrelated individuals from five different regions of the Spanish Eastern Pyrenees. The characteristics of these samples are described in [26] , which studied the basic demographic and spatial aspects of the considered region. Briefly speaking, for each sample it was ascertained that the parents and grandparents were from the same geographic location; samples were healthy with an average age of 76 years and equal proportions of both sexes.

Each sample was sequenced using standard Illumina paired-ends with a read length of 150 bp for an average sequencing coverage of~40×. The dataset used contains a total of 9,309,056 SNVs.

The bam files from the FIN and IBS samples were downloaded from the 1000 Genomes Project Data Portal site. Samples were SNP-called using GATK HaplotypeCaller v3.6 [28] , using the default settings according to the GATK Handbook v3.6 [29] with hs37d5 as the reference assembly. To produce the final VCF file, all the samples (IBS, FIN and SEP) were called jointly.

This step was done to avoid a possible batch effect regarding the different SNP calling methodologies between 1000G and SEP datasets.

To predict loss of function variants, we annotated them using the already annotated table of variants from dbNSFP v2.9.2a [30] . We used SnpSift v4.3e [31] to merge the contents of the table with the VCF file and SnpEff v4.3e [32] to add the functional prediction. From this annotated and functionally predicted VCF file, we decided to use three different algorithms to categorize a variant as LoF: Polyphen2 [33] , MutationAssessor [34] , and SIFT [31] . We used a conservative approach to call a variant as LoF by only considering LoF variants that were classified as LoF by the three algorithms.

Furthermore, we restricted the analyses to LoF variants that do not appear in homozygosis among samples, as they would reflect redundant or advantageous effects of dispensable genes [20] .

From the general VCF file we extracted singleton SNVs, using the flag AC (number of alternative alleles for that variant across samples) from the INFO field. We used the ancestral allele already present in the dataset (flag AA from the INFO field), which uses the six-way Enredo-Pecan-Ortheus (EPO) primate [35] available in Ensembl v71 [36] .

We compared the reference allele to the ancestral allele. If they were the same allele, we marked the singleton as derived. We excluded those SNVs that either had no alignment for the ancestral allele (AA = .), that were considered as a lineage-specific insertion (AA = -), or those in which the allele was not present (AA = N).

To count the number of introgressed alleles per population, we downloaded the output from the Sprime algorithm used in Browning et al. [4] in 1000G samples. The output from Sprime is divided on population and chromosome basis. These files provide the introgressed alleles for SNVs and population according to the algorithm. We used this information to count the number of introgressed alleles of each individual.

To understand whether there was a bias in the portion of samples that each center analyzed from each population, we generated a contingency table (Supplementary Materials Table S2 ) between the populations present in the 1000G and the sequencing center where the exome was generated. With this table, we ran a correspondence analysis using the function CA from the R package FactoMineR [37].

In order to quantify the batch effect of the sequencing center in the studied statistics of human population genetics for the populations present in 1000G, we used the R package lme4 [38] to generate hierarchical mixed models controlling for the random effects of the continent and population of assignation of each individual of the type:

where S is the variable of interest. The contrast of hypotheses with the mixed null model lmer( log(S) ∼ (1|Continent/Pop ) was conducted with the analysis of variance (ANOVA) command of R [39] .

For this analysis, we only considered sequencing centers that sequenced samples across all populations.

We also analyzed the variants showing an excess of genetic differentiation due to the sequencing centers in YRI, CEU, JPT, and CHB populations. These populations were the first considered in the Phase 1 of the 1000G project [40] and they have been widely used in population genomics.

For each population and SNV, we assigned each individual from that population to each sequencing center and estimated Weir and Cockerham's Fst [41] among the different sequencing centers. In order to avoid sampling biases, we selected the same number of individuals (n = 5) by center, corresponding to the minimum number of sequenced Genes 2022, 13, 44 5 of 18 individuals in CHB at the WUSGC sequencing center. We estimated the MAF P at the population P over all the K sequencing centers, where a c is the frequency of allele a at center c:

Furthermore, since it has been shown that the magnitude of Fst is dependent on the MAF [42] , for each population we estimated the maximum Fst differentiation that could be obtained between centers with the observed MAF P for each SNV. The maximum genetic differentiation is obtained when the first K sub-populations (i.e., sequencing centers) share the K * n * MAF P alleles that defines the MAF P , and the remaining 4-K sub-populations carry the other allele. We used this maximum observable Fst value given an MAF to scale the observed Fst value for each SNV. This allowed us to compare Fst values with different MAFs.

Next, for all the SNVs with the same MAF, we computed the average Fst and estimated the correlation between Fst and MAF by means of Spearman's correlation. In order to compute the expected Fst in the absence of sequencing center batch effects, for each population we randomly distributed the individuals among centers, computed the Fst values for each SNV, and computed the mean of each MAF category. We fitted a linear model between the observed and the mean of Monte Carlo-generated Fst values to quantify the departure from the expected Fst under no sequencing center bias.

We first checked whether each center had sequenced the same proportion of samples from each population. We observed a statistically significant association (χ 2 p-value = 7.381 × 10 −146 ) between certain populations and sequencing centers, as shown by the correspondence analysis on Figure 1 . Some populations tend to be overrepresented at certain sequencing centers, even at the continental level. In principle, this implies that any batch effect differentially affecting a sequencing center is going to be reflected by spurious increase of population differentiation and higher than expected heterogeneity within each continent.

Next, to understand the putative effect of batch effects on statistics focusing on rare events, we studied the number of LoF alleles in each of the 1000G individuals. Assessing the biological impact of in silico-predicted LoF is usually complex [43] . Therefore, we adopted a conservative approach for predicting recessive damaging variants that severely affect the function of protein-coding genes. We used variants consistently predicted as highly deleterious by Polyphen2 [32] , MutationAssessor [34] , and SIFT [31] algorithms. We further restricted our analyses to LoF variants that do not appear in homozygosis [20] . By doing such filtering, we created a putative bias among populations (i.e., populations that are more genetically diverse will tend to have more chances to have homozygote pseudo-LoF genotypes and therefore be removed from the analyses). However, this should not influence the comparisons of centers within each population (i.e., see Figure 2 ). We ran a hierarchical mixed model in which the dependent variable was the log(LoF) per individual, the fixed variable was the sequencing center (BGI, BI and WUGSC), and the random effects were nested by continent and population. We observed statistically significant differences between a mixed model that includes the sequencing center as a variable (ANOVA p-value = 4.87 × 10 −20 ) (Table 1) . Next, we wondered if we would observe such batch effect bias in mutations in coding regions classified by the three LoF predictors as benign ( Figure 3 ). In this case, the hierarchical mixed model also supports the role of the sequencing center (ANOVA p-value =8.315 × 10 −8 ) ( Table 2 ). As we are consider LoF, and set the threshold to absence of homozygotes, any sequencing error occurring in a gene will likely tend to produce a false positive that will be recovered by the three algorithms [9] . Therefore, it is not surprising that the statistical significance was larger in the case of LoF compared to benign, as well as in the magnitude of the estimated slopes (Tables 1 and 2) . Table S1 .

Next, to understand the putative effect of batch effects on statistics focusing on rare events, we studied the number of LoF alleles in each of the 1000G individuals. Assessing the biological impact of in silico-predicted LoF is usually complex [43] . Therefore, we adopted a conservative approach for predicting recessive damaging variants that severely affect the function of protein-coding genes. We used variants consistently predicted as highly deleterious by Polyphen2 [32] , MutationAssessor [34] , and SIFT [31] algorithms. We further restricted our analyses to LoF variants that do not appear in homozygosis [20] . By doing such filtering, we created a putative bias among populations (i.e., populations that are more genetically diverse will tend to have more chances to have homozygote pseudo-LoF genotypes and therefore be removed from the analyses). However, this should not influence the comparisons of centers within each population (i.e., see Figure  2 ). We ran a hierarchical mixed model in which the dependent variable was the log(LoF) per individual, the fixed variable was the sequencing center (BGI, BI and WUGSC), and Table S1 . the sequencing center (ANOVA p-value =8.315 × 10 −8 ) ( Table 2 ). As we are consider LoF, and set the threshold to absence of homozygotes, any sequencing error occurring in a gene will likely tend to produce a false positive that will be recovered by the three algorithms [9] . Therefore, it is not surprising that the statistical significance was larger in the case of LoF compared to benign, as well as in the magnitude of the estimated slopes (Tables 1 and  2 ). Table S1 . Table S1 . Table S1 .

Taken together, the hierarchical mixed models of the LoF and benign variables suggest that the sequencing center plays an important role as a batch effect. This result agrees with [16] , which reported an enrichment of mutation artifacts in genes. Table S1 .

Taken together, the hierarchical mixed models of the LoF and benign variables suggest that the sequencing center plays an important role as a batch effect. This result agrees with [16] , which reported an enrichment of mutation artifacts in genes.

We wondered whether such bias could also be found in the presence of derived singletons in each individual. Mixed models with the log(derived singletons by individual) and log(singletons by individual) also support a role of the sequencing center (ANOVA pvalue for derived singletons = 9.81 × 10 −10 ; ANOVA p-value for singletons = 1.28 × 10 −9 ). However, in this case not all the sequencing centers equally contribute to the bias (see Tables 3 and 4) , suggesting some heterogeneity between the centers (see Figures 4 and 5) . For example, whereas WUGSC tends to decrease the number of derived singletons observed in the individuals sequenced at that center, BI increases the number of derived singletons and BGI does not significantly affect this variable. 

We wondered whether such bias could also be found in the presence of derived singletons in each individual. Mixed models with the log(derived singletons by individual) and log(singletons by individual) also support a role of the sequencing center (ANOVA p-value for derived singletons = 9.81 × 10 −10 ; ANOVA p-value for singletons = 1.28 × 10 −9 ). However, in this case not all the sequencing centers equally contribute to the bias (see Tables 3 and  4 ), suggesting some heterogeneity between the centers (see Figures 4 and 5) . For example, whereas WUGSC tends to decrease the number of derived singletons observed in the individuals sequenced at that center, BI increases the number of derived singletons and BGI does not significantly affect this variable. Table S1 . Table 3 . Coefficient, standard error, and p value of the coefficient from the hierarchical mixed model using the number of derived singletons as the dependent variable. For this analysis, we only considered sequencing centers that sequenced samples across all populations. Table S1 .

Given these results, we studied if the batch effect due to the sequencing center could affect the estimation of variables of population genetics that use the derived alleles to make inferences. One of these variables is the identification of chunks of DNA that are enriched for derived alleles, which under certain demographic models are indicative of the presence of archaic introgression [4] . Furthermore, it has been shown that the presence of archaic introgression depends on the function of the genomic region, as archaic introgression is depleted in genomic regions that contain genes [44, 45] . We used the archaic regions from [46] that were identified in the 1000G samples, and computed the number of archaic alleles that are found in each individual. No statistically significant differences were observed between a mixed model using the log(number of introgressed alleles) and the sequencing center and one without the sequencing center (ANOVA p-value = 0.88; Table  5 ; see Figure 6 ), thus suggesting that these regions are not enriched by batch effect artifacts due to sequencing center. Nevertheless, since we are considering the estimated amount of Table S1 .

Given these results, we studied if the batch effect due to the sequencing center could affect the estimation of variables of population genetics that use the derived alleles to make inferences. One of these variables is the identification of chunks of DNA that are enriched for derived alleles, which under certain demographic models are indicative of the presence of archaic introgression [4] . Furthermore, it has been shown that the presence of archaic introgression depends on the function of the genomic region, as archaic introgression is depleted in genomic regions that contain genes [44, 45] . We used the archaic regions from [46] that were identified in the 1000G samples, and computed the number of archaic alleles that are found in each individual. No statistically significant differences were observed between a mixed model using the log(number of introgressed alleles) and the sequencing center and one without the sequencing center (ANOVA p-value = 0.88; Table 5 ; see Figure 6 ), thus suggesting that these regions are not enriched by batch effect artifacts due to sequencing center. Nevertheless, since we are considering the estimated amount of archaic introgression per individual given the described introgressed alleles rather than the length of the introgressed chunks of each individual, our results must be considered as quite conservative. archaic introgression per individual given the described introgressed alleles rather than the length of the introgressed chunks of each individual, our results must be considered as quite conservative. In the x-axis the name of the population is displayed; in the y-axis the number of introgressed alleles as defined by [4] is displayed. Acronyms are showed in Supplementary Materials Table S1 .

Next, we studied the properties of genetic markers showing strong deviations between sequencing centers for CEU, CHB, JPT, and YRI. For each population and SNV, we In the x-axis the name of the population is displayed; in the y-axis the number of introgressed alleles as defined by [4] is displayed. Acronyms are showed in Supplementary Materials Table S1.

Next, we studied the properties of genetic markers showing strong deviations between sequencing centers for CEU, CHB, JPT, and YRI. For each population and SNV, we computed Weir and Cockerham's Fst [41] using the sequencing centers as groups, scaling each Fst value by the maximum Fst value that could be achieved given such MAF, and assigning each SNV to a MAF bin. Our results show that the mean scaled differentiation between sequencing centers of each MAF category depends on the MAF of the SNV independent of which population is considered (Figure 7) . In particular, for MAFs close to 0, the amount of genetic differentiation that can be observed between centers is close to the maximum value that can be observed. In contrast, for SNVs with MAFs close to 0.5, the genetic differentiation that is observed between centers is close to 0. For all the populations, a statistically significant Spearman correlation was observed between MAF and the mean of the scaled Fst for the SNVs showing that particular MAF (JPT: rho = −0.914, p-value < 10 × 10 −16 ; CEU: rho = −0.914, p-value < 10 × 10 −16 ; CHB: rho = −0.915, p-value < 10 × 10 −16 ; YRI: rho = −0.914, p-value < 10 × 10 −16 ).

Genes 2022, 13, x FOR PEER REVIEW 12 of 18 computed Weir and Cockerham's Fst [41] using the sequencing centers as groups, scaling each Fst value by the maximum Fst value that could be achieved given such MAF, and assigning each SNV to a MAF bin. Our results show that the mean scaled differentiation between sequencing centers of each MAF category depends on the MAF of the SNV independent of which population is considered (Figure 7 ). In particular, for MAFs close to 0, the amount of genetic differentiation that can be observed between centers is close to the maximum value that can be observed. In contrast, for SNVs with MAFs close to 0.5, the genetic differentiation that is observed between centers is close to 0. For all the populations, a statistically significant Spearman correlation was observed between MAF and the mean To test the presence of differences in the sequencing error rate, we generated the null distribution of the scaled Fsts given a MAF under the hypothesis that differences among centers were due to randomness. We observed that the Monte Carlo-scaled Fst showed the same trend as the observed in the real data. Nevertheless, as expected by the lack of differential spurious mutations due to the sequencing center, the simulated Fsts were on average much smaller than the observed ones for each population (slope in a linear model for JPT: 3.932, p-value < 2 × 10 −16 ; slope CEU: 3.896, p-value < 2 × 10 −16 ; slope CHB: 3.2, pvalue < 2 × 10 −16 ; slope YRI: 3.52, p-value < 2 × 10 −16 ).

We decided to test if these batch effects can have a real effect when using the 1000 Genomes Project as a general population dataset when analyzing newly described populations. To test this scenario, we used the dataset presented in [26] , which consists of 29 individuals from the Southern Eastern Pyrenees, effectively a rural population. To test the presence of differences in the sequencing error rate, we generated the null distribution of the scaled Fsts given a MAF under the hypothesis that differences among centers were due to randomness. We observed that the Monte Carlo-scaled Fst showed the same trend as the observed in the real data. Nevertheless, as expected by the lack of differential spurious mutations due to the sequencing center, the simulated Fsts were on average much smaller than the observed ones for each population (slope in a linear model for JPT: 3.932, p-value < 2 × 10 −16 ; slope CEU: 3.896, p-value < 2 × 10 −16 ; slope CHB: 3.2, p-value < 2 × 10 −16 ; slope YRI: 3.52, p-value < 2 × 10 −16 ).

We decided to test if these batch effects can have a real effect when using the 1000 Genomes Project as a general population dataset when analyzing newly described populations. To test this scenario, we used the dataset presented in [26] , which consists of 29 individuals from the Southern Eastern Pyrenees, effectively a rural population.

From the 1000 Genomes Project, we selected the IBS and FIN populations. IBS was selected as it is the most similar population to the SEP dataset found in the 1000 Genomes Project dataset. The FIN population was selected as they are considered a genetic isolate within Europe [47] and to serve as a comparison to rural populations. To reduce the possible batch effect due to the different SNP calling algorithms used in both projects, we performed the same SNP calling in both sets of samples (see Section 2) .

Using the combined dataset, we predicted the number of damaging alleles using the same filtering procedure shown above. As shown in Figure 8 , there are notable differences between the sequencing centers in both the FIN and IBS populations. These differences were confirmed through a Wilcox test between the different sequencing centers within the FIN and IBS populations (see Tables 6 and 7 for p-values).

From the 1000 Genomes Project, we selected the IBS and FIN populations. IBS was selected as it is the most similar population to the SEP dataset found in the 1000 Genomes Project dataset. The FIN population was selected as they are considered a genetic isolate within Europe [47] and to serve as a comparison to rural populations. To reduce the possible batch effect due to the different SNP calling algorithms used in both projects, we performed the same SNP calling in both sets of samples (see Section 2) .

Using the combined dataset, we predicted the number of damaging alleles using the same filtering procedure shown above. As shown in Figure 8 , there are notable differences between the sequencing centers in both the FIN and IBS populations. These differences were confirmed through a Wilcox test between the different sequencing centers within the FIN and IBS populations (see Tables 6 and 7 for p-values). 

The finalization of the 1000G project represented a milestone for the human population genetics community [1] . This dataset is normally used as basis for imputation in microarrays [48] , to phase genomes [7] in evolutionary studies [2, 4, 5, 49] , in multiple medical studies [50] , or as a basis to identify potential genetic isolates [3] .

However, two studies [13, 15] have recently raised concerns about the presence of batch effects at variants at low frequency. The study [15] in particular pointed to the se- 

The finalization of the 1000G project represented a milestone for the human population genetics community [1] . This dataset is normally used as basis for imputation in microarrays [48] , to phase genomes [7] in evolutionary studies [2, 4, 5, 49] , in multiple medical studies [50] , or as a basis to identify potential genetic isolates [3] .

However, two studies [13, 15] have recently raised concerns about the presence of batch effects at variants at low frequency. The study [15] in particular pointed to the sequencing center as one of the main factors affecting the presence of rare spurious mutations in 1000G individuals. Given these results, in this study we looked at to what extent the sequencing center, as reported by the spreadsheet of the 1000G (https://www.internationalgenome. org/data/ accessed on 26 April 2019), could influence statistics of population genomics that quantify variants at low frequency in the human genome. Across this paper we have presented proof that some variation present in the 1000 Genomes Project dataset seems to have a noticeable batch effect regarding the sequencing center of the sample, although this batch effect is not noticeable for common variants (MAF > 5%).

LoF variants are usually present at low frequency in the population due to their deleterious effects in carriers and their consequent erasure from the population by purifying selection [51] . Given the fact that 1000G individuals are healthy [1] , the presence of LoF variants is expected in heterozygosis and to be mostly recessively inherited. Therefore, the presence of homozygote LoF variants in an individual should either be explained by a relaxation of the purifying selection [52] or by the presence of sequencing errors. In fact, given the evolutionary constraints of LoF, it could be expected that the power for identifying NGS sequencing errors should be enhanced at LoF variants. Sequencing errors typically occur in 0.1-1% of the sequenced nucleotides [53] ; as they are rare events, it can be expected that they will tend to produce rare mutations. Several factors shape the probability of a nucleotide to be erroneously sequenced; in particular, poor-quality bases due to low deep sequencing can be misinterpreted by the sequencers [17] . It has been shown that even in the whole exome sequencing of the 1000G, there exists variation in depth of sequencing in the different sequencing centers [54] . Furthermore, there is heterogeneity in the error rates across sequencing platforms [55] . Given that the different populations were nonhomogeneously sequenced at the different centers ( Figure 1 ) and that each center used a different technologies or methodologies, differences in the LoF fraction within populations and among genetically related populations due to the sequencing center can be expected. In fact, we have observed that for some populations, the number of LoF variants in a sample is directly dependent on the sequencing center used for the different samples. These results suggest a substantial bias in LoF identification due to sequencing center, even when using stringent filters for defining LoF.

Considering the use of the 1000G for calibrating isolated populations and for comparing the presence of LoF (i.e., see [3] ), we wondered to which extent the sequencing center could affect the conclusions regarding the condition of an isolated population. We used the SEP population [26] , which has a similar genetic ancestry to the IBS population and which has been sequenced at~40× coverage. Compared to IBS and FIN, classically considered isolated populations in Europe showing an excess of private Mendelian diseases [47] , the SEP population has a lower number of LoF damaging alleles compared to both populations. From a biological point of view, this result could be explained by the presence of inbreeding due to isolation and strong purifying selection acting for a long period in the population (as shown with gorilla populations in [56] ). However, when considering the whole genetic variation present in SEP individuals, they genetically resemble IBS individuals and there is no evidence of the presence of long runs of homozygosity or other statistics suggesting long ongoing isolation and inbreeding (see [26] ). Moreover, the observed differences between SEP, FIN, and IBS depend on which sequencing center produced the sequences (Tables 6  and 7) . Whereas for BCM the differences in LoF among populations are negligible, for BGI sequenced samples IBS would have an enrichment of LoF variants compared to FIN, which would invalidate previous findings about the population genetics of Finish people [47] . Overall, these results suggest that the observed differences in LoF are more likely due to biases due to the sequencing center. The fact that we observed similar results in the 1000G as those for LoF variants when considering benign mutations suggests that the sequencing bias due to sequencing center could be extended to other mutations occurring at low frequency in these samples. In agreement with this hypothesis, we observed that, for singleton mutations, the sequencing center plays a role, although this signal is not as strong as in the case of LoF variants.

We wondered the limit to which biases in the sequencing center could affect the profile of mutations in a population. Using the originally sampled YRI, CEU, CHB, and JPT populations in Phase 1 of the 1000G, we observed that the bias among centers decreases for all the populations as the frequency of the MAF increases in the considered population. This result could be explained by the variance from a given MAF becoming much higher than the variance due to specific sequencing center error rates, effectively masking their effect. To test this hypothesis, we generated the null distribution of scaled Fsts given a MAF under the hypothesis of no differential sequencing center artifacts. We observed that the Monte Carlo Fst showed the same trend as that observed in the real data.

Overall, these results suggest that the effect we observed in singletons is also observed at higher frequencies. However, when analyzing the detected effect of the sequencing errors on a biological variable that depends on the genetic variation to be detected, such as the level of introgression, we found no effect of the sequencing center on the number of introgressed alleles as defined by [4] . However, it is interesting to note that strong discrepancies were observed in the amount of introgressed alleles for some populations (CHB and CHS; see Figure 6 ). This is particularly relevant because several studies pointed to the presence of heterogeneous patterns of ghost archaic populations in these populations [4, 25] .

In this study, we have shown that statistics that use SNVs that occur at low frequencies in the 1000G are influenced by the sequencing center that generated the samples. Since the proportion of samples generated at each center per population is not the same, conclusions about the biological processes that generated these differences can be jeopardized.

The enrichment of putative biases due to batch effects occur mostly with low frequency variants; in particular, in variants where the sequencing error occurs within a gene, it is more likely that the error will generate a ghost functional error in the protein. Nevertheless, the presence of sequencing errors extends to other MAF categories.

Therefore, when considering such particular kind of statistics and genetic variants, caution must be taken in interpreting the results, particularly when merging with another dataset that may have its own batch effects.

Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/genes13010044/s1, Table S1 : Acronyms for the sequencing centers and populations, Table S2 : Distribution of population samples across sequencing centers. 

Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing

Estimation of allele-specific fitness effects across human protein-coding sequences and implications for disease

Whole-Exome Sequencing in the Isolated Populations of Cilento from South Italy

Analysis of Human Sequence Data Reveals Two Pulses of Archaic Denisovan Admixture

Genomes Selection Browser 1.0: A genome browser dedicated to signatures of natural selection in modern humans

The online catalog of human genome adaptation

Haplotype Estimation Using Sequencing Reads

Genetic studies of body mass index yield new insights for obesity biology

A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes

Integrative Annotation of Variants from 1092 Humans: Application to Cancer Genomics

The 1000 Genomes Project Consortium

The landscape of human STR variation

The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes

Rapid evolution of the human mutation spectrum

Evaluating the quality of the 1000 genomes project data

Turning Vice into Virtue: Using Batch-Effects to Detect Errors in Large Genomic Data Sets

Genotype and SNP calling from next-generation sequencing data

Allele balance bias identifies systematic genotyping errors and false disease associations

Analysis of protein-coding genetic variation in 60,706 humans

Common homozygosity for predicted loss-of-function variants reveals both redundant and advantageous effects of dispensable human genes

Recent Explosive Human Population Growth Has Resulted in an Excess of Rare Genetic Variants

The origins, determinants, and consequences of human mutations

Selection against archaic hominin genetic variation in regulatory regions

The major genetic risk factor for severe COVID-19 is inherited from Neanderthals

Approximate Bayesian computation with deep learning supports a third archaic introgression in Asia and Oceania

Fine-scale population structure in five rural populations from the Spanish Eastern Pyrenees using high-coverage whole-genome sequence data

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data

A framework for variation discovery and genotyping using next-generation DNA sequencing data

0: A Database of Human Non-synonymous SNVs and Their Functional Predictions and Annotations

Using Drosophila melanogaster as a Model for Genotoxic Chemical Mutational Studies with a New Program

A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3

A method and server for predicting damaging missense mutations

Predicting the functional impact of protein mutations: Application to cancer genomics

AnRPackage for Multivariate Analysis

Fitting Linear Mixed-Effects Models Using lme4

The R Project for Statistical Computing

The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing

Estimating F-Statistics for the Analysis of Population Structure

The Relationship Between FST and the Frequency of the Most Frequent Allele

Human Knockout Carriers: Dead, Diseased, Healthy, or Improved?

Impacts of Neanderthal-Introgressed Sequences on the Landscape of Human Gene Expression

The Strength of Selection against Neanderthal Introgression

Sprime Results for 1000 Genomes Non-African Populations and SGDP Papuans

Genetics in an isolated population like Finland: A different basis for genomic medicine?

Imputation of Variants from the 1000 Genomes Project Modestly Improves Known Associations and Can Identify Lowfrequency Variant-Phenotype Associations Undetected by HapMap Based Imputation

The 1000 Genomes Project Consortium. Human genomic regions with exceptionally high levels of population differentiation identified from 911 whole-genome sequences

Predicting disease-causing variant combinations

The effect of deleterious mutations on neutral molecular variation

Enrichment of low-frequency functional variants revealed by whole-genome sequencing of multiple isolated European populations

Benchmarking of computational error-correction methods for next-generation sequencing data

Lopes-Cendes, I. Methodological differences can affect sequencing depth with a possible impact on the accuracy of genetic diagnosis

The role of replicates for error mitigation in next-generation sequencing

Mountain gorilla genomes reveal the impact of long-term population decline and inbreeding