BioMed CentralBMC Genomics ss Open AcceMethodology article Identification of disease causing loci using an array-based genotyping approach on pooled DNA David W Craig†1, Matthew J Huentelman†1, Diane Hu-Lince1, Victoria L Zismann1, Michael C Kruer1, Anne M Lee1, Erik G Puffenberger2, John M Pearson1 and Dietrich A Stephan*1 Address: 1Neurogenomics Division, Translational Genomics Research Institute (TGen) Phoenix, Arizona 85004, USA and 2Clinic for Special Children, Strasburg, PA 17579, USA Email: David W Craig - dcraig@tgen.org; Matthew J Huentelman - mhuentelman@tgen.org; Diane Hu-Lince - dhlince@tgen.org; Victoria L Zismann - vzismann@tgen.org; Michael C Kruer - mkruer@tgen.org; Anne M Lee - alee@tgen.org; Erik G Puffenberger - epuffenberger@clinicforspecialchildren.org; John M Pearson - jpearson@tgen.org; Dietrich A Stephan* - dstephan@tgen.org * Corresponding author †Equal contributors Abstract Background: Pooling genomic DNA samples within clinical classes of disease followed by genotyping on whole-genome SNP microarrays, allows for rapid and inexpensive genome-wide association studies. Key to the success of these studies is the accuracy of the allelic frequency calculations, the ability to identify false-positives arising from assay variability and the ability to better resolve association signals through analysis of neighbouring SNPs. Results: We report the accuracy of allelic frequency measurements on pooled genomic DNA samples by comparing these measurements to the known allelic frequencies as determined by individual genotyping. We describe modifications to the calculation of k-correction factors from relative allele signal (RAS) values that remove biases and result in more accurate allelic frequency predictions. Our results show that the least accurate SNPs, those most likely to give false-positives in an association study, are identifiable by comparing their frequencies to both those from a known database of individual genotypes and those of the pooled replicates. In a disease with a previously identified genetic mutation, we demonstrate that one can identify the disease locus through the comparison of the predicted allelic frequencies in case and control pools. Furthermore, we demonstrate improved resolution of association signals using the mean of individual test-statistics for consecutive SNPs windowed across the genome. A database of k-correction factors for predicting allelic frequencies for each SNP, derived from several thousand individually genotyped samples, is provided. Lastly, a Perl script for calculating RAS values for the Affymetrix platform is provided. Conclusion: Our results illustrate that pooling of DNA samples is an effective initial strategy to identify a genetic locus. However, it is important to eliminate inaccurate SNPs prior to analysis by comparing them to a database of individually genotyped samples as well as by comparing them to replicates of the pool. Lastly, detection of association signals can be improved by incorporating data from neighbouring SNPs. Published: 30 September 2005 BMC Genomics 2005, 6:138 doi:10.1186/1471-2164-6-138 Received: 13 May 2005 Accepted: 30 September 2005 This article is available from: http://www.biomedcentral.com/1471-2164/6/138 © 2005 Craig et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Page 1 of 9 (page number not for citation purposes) http://www.biomedcentral.com/1471-2164/6/138 http://creativecommons.org/licenses/by/2.0 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16197552 http://www.biomedcentral.com/ http://www.biomedcentral.com/info/about/charter/ BMC Genomics 2005, 6:138 http://www.biomedcentral.com/1471-2164/6/138 Introduction The ability to genotype hundreds of thousands of single nucleotide polymorphisms (SNPs) across the genome and to perform association analysis between cases and controls provides, for the first time, a discovery-based approach for determining the underpinnings of complex human genetic disorders. Technologies from Affymetrix (microarray-based GeneChip® Mapping arrays), Illumina (BeadArray™), and Sequenom (MassARRAY™) are now available with sufficient density to detect linkage disequi- librium between informative SNPs and nearby disease- causing nucleotide variants through non-hypothesis based whole-genome association scans for certain popula- tions [1-3]. Several practical issues make whole-genome association studies by utilizing individual genotyping difficult to implement [4]. Power estimates predict that somewhere on the order of a thousand cases and control subjects must be genotyped to detect allelic differences of <5% between the cohorts, as well as to detect rare alleles which may be causative in only a subset of the cohort [5]. Addi- tionally, population stratification and allelic imbalance may identify SNPs that have statistically significant allelic frequency differences yet have no relation to the disease [4,6,7]. Whole-genome association studies are now tech- nologically possible, though the cost is several million dollars if samples are individually genotyped. Here we describe the validation of pooling genomic DNA samples as a rapid pre-screening to detect disease-causing loci for a few thousand dollars on SNP genotyping microarrays. It is possible to identify SNPs that have significant differ- ences in allelic frequencies between two populations while saving a significant amount in resources by pooling genomic DNA and then SNP genotyping on a single microarray, or preferably on a series of replicated arrays. Indeed, a limited number of studies have been conducted that demonstrate the possibility to predict accurately the allelic frequencies of a SNP from a pooled sample on a microarray, and, in fact identify quantitative trait loci [8- 12]. Typically, these studies have validated the pooled allelic frequencies by later individually genotyping between ten to a few hundred SNPs. The most elegant val- idations of pooling have used indirect approaches, such as spiking a single individual of known genotype into a pooled group with unkown genotypes [9]. One cannot realistically expect all probes on a microarray to function equally, especially considering that the objec- tive of these platforms is to identify allelic differences of 0%, 50%, and 100%. Indeed, as platforms move to 100,000+ SNPs, the ability to select preferentially the best performing SNPs, such as was done in the design of the Affymetrix 10K GeneChip®, will likely be compromised. As a result, our prediction is that many SNPs will be unre- liable for pooling, and thus may be more likely to lead to false positives. In a pooling study, limiting false positives that are a result of the assay, rather than the underlying population, will be a major factor in being able to realis- tically identify SNPs that can predict disease status. In this study, we investigated the reliability of SNP allelic fre- quency measurements as determined from pooling genomic DNA samples on SNP mapping arrays. We fur- ther demonstrate our ability to identify poorly predictive SNPs prior to analysis. Results We compared the predicted allelic frequencies from pools of genomic DNA to the known allelic frequencies deter- mined by individual genotyping in order to establish the accuracy of pooling. The goal was to compare allelic fre- quencies for all the SNPs on a microarray, since not all SNPs will be equally accurate for the prediction of fre- quencies. Inaccurate SNPs are expected to be problematic as microarrays progress to probe hundreds of thousands of SNPs, whereby SNPs are chosen primarily for their physical position in the genome and not for their repro- ducibility. Indeed, in order to identify 11,500 SNPs for the Affymetrix 10K GeneChip® Mapping Array nearly 500,000 SNPs were screened by Affymetrix for reliability in the assay [13]. Individual genotyping of SNPs for 107 samples Allelic frequencies for 10,205 SNPs on 107 samples were determined by individually genotyping on the 10K Gene- Chip®. These samples were genotyped over a one-year period; therefore, some samples were genotyped on ver- sion 1.0 of this platform and others on version 2.0. Only SNPs genotyped on both platforms were utilized for this study. Accuracy of SNP calls was approximately 99.8%, as determined by inheritance errors in family pedigrees, in line with the accuracy reported by Affymetrix (99.9%) [13]. We found no significant decrease in accuracy between the two versions of the 10K GeneChip®. The aver- age percentage of SNPs called across all 107 samples was 90.9%. Only individuals with a call rate above 80% were included in the present study. For example, 2,783 SNPs were called for all individuals and 3,525 SNPs were called for >98% of individuals. In our experience using this plat- form on over 4,000 samples we determined that the call rate is highly dependent on DNA quality and that high quality genomic DNA yields a call rate of 95–98%. The samples used in this study have been collected over sev- eral years with variable DNA quality. It is to be expected expect that large-scale whole-genome association studies will also be forced to utilize DNA of less than optimal quality since hundreds to thousands of individuals are needed. Thus the genomic DNA used in this study will likely be representative of what could be expected in a Page 2 of 9 (page number not for citation purposes) BMC Genomics 2005, 6:138 http://www.biomedcentral.com/1471-2164/6/138 whole-genome association study on a disease where sev- eral-thousand individuals are needed. Construction of pools Pools were created in triplicate from the individually gen- otyped samples. The individuals were from the Old Order Amish and Old Order Mennonite populations of south- eastern Pennsylvania [14]. Pool 1 consisted of 52 individ- uals, Pool 2 consisted of a different 52 individuals, and Pool 3 consisted of 3 patients who died of a form of sud- den infant death syndrome known as SIDDT and had a known region of identity-by-descent (shared a pre- defined allele on all six chromosomes across the case cohort) [14]. This region was defined on the 10K microar- ray by 13 consecutive autozygous SNPs, 6 of which were informative. All DNA was quantitated using PicoGreen reagent (Molecular Probes, Eugene, Oregon) to ensure equal amounts were contributed to the pool from each individual. These three pools were then genotyped in rep- licates of three on the 10K GeneChip®. In all, 9 microar- rays were used for the pooled genotyping compared to 107 microarrays for the individual genotyping. Calculation of allelic frequencies from pooled samples The predicted allelic frequencies from pooled genotype samples were calculated for each SNP using a k-correction factor based on their derivation from over 3,000 individ- uals genotyped on the 10K GeneChip®. The training set consisted of 3,000 individuals that were genotyped in our lab within the past year. All had call rates above 80% with an average call rate of 95%. None of the 3,000 individuals used for calculation of k-correction factors were included in the pooled genomic DNA. The purpose of the k-correction factor is to allow for cal- culation of a predicted allelic frequency from peak heights, or in this case fluorescence signal, whereby p = A/ (A+kB) [15]. K-correction factors have recently become well established and have been used successfully in primer extension assays whereby measurements in SNP allelic frequencies on pooled genomic DNA have been taken by HPLC, mass spec, and by fluorescence in TAQ- MAN assays [15-19]. For the 10K Mapping array assay, p is the predicted allelic frequency of the A allele, A is the flu- orescent signal intensity measure of the A allele, and B is the fluorescent signal intensity measure of the B allele. The k-correction factor can be calculated for a given SNP using a heterozygote who is AB, effectively 50% A and 50% B. Conveniently, output of the Affymetrix GeneChip software for the Affymetrix 10K Mapping Array includes Relative Allele Signal (RAS) values which have been previ- ously used to determine k-correction values (see Figure 1) [11]. Generally, RAS = A/(A+B). Here, A refers to the median match/mismatched differences of the major allele and B for the minor allele (Affymetrix Technical Manual). Example of RAS statistics for three SNPs based on genotyp-ing of 100 individuals with an average call rate of all SNPs greater than 98%Figure 1 Example of RAS statistics for three SNPs based on genotyp- ing of 100 individuals with an average call rate of all SNPs greater than 98%. These example SNPs illustrate how SNP call reliability can vary both between SNPs and within the same SNP, as measured by RAS1 and RAS2 values. Blue spheres are BB individuals, orange triangles are AA individu- als, and green squares are AB individuals, grey stars are "Not Called". Page 3 of 9 (page number not for citation purposes) BMC Genomics 2005, 6:138 http://www.biomedcentral.com/1471-2164/6/138 There are two RAS values, RAS1 (sense) and RAS2 (anti- sense) since both sense and antisense directions are probed. Whereas k-correction factors based on the Affymetrix 10K GeneChip® Mapping Array have been previously calcu- lated directly using only heterozygous RAS values [11], we suggest that this can be improved upon since the RAS val- ues are generally not 0 or 1 for homozygotes (See Figure 1). Indeed, we observed significant deviation for many SNPs, which could potentially add significant bias (see discussion) [11]. Thus for each SNP, we normalized RAS values, referred to as nRAS, using the individuals from the training set that were AA (normalized to 1) and BB (nor- malized to zero). Without this normalization, predicted frequencies will be systematically biased as the pooled samples approach homozygosity. Thus nRASx = (RASx- AAave)/(BBave) where AAave is the average RASx score for individuals AA in the training set, and BBave is the RASx score for individuals BB in the training set. The value of X refers to whether the calculation is for RAS1 or RAS2, and nRAS values are calculated for both RAS1 and RAS2. Thus, two predictions of allelic frequency are obtained: one from RAS1 and one from RAS2. Each RAS variable has dis- tinct variability, and as shown in Figure 1(b), RAS1 may be very precise with low variance, while RAS2 may exhibit high variance, and vice versa. Averaging the two RAS val- ues will mask the RAS value with lower variance. Because of this independent variability, we do not recommend averaging RAS1 and RAS2 for all SNPs as was suggested in other pooling studies [8,10,12]. Rather, we recommend treating the two RAS values as separate experiments, and preferably removing RAS values with the greatest variance prior to analysis. Values making up each of the RAS1 and RAS2 mean values are provided for homozygous AA, homozygous BB, and heterozygous individuals on a website based on our 3,000 person database which is being made available to the pub- lic as part of this publication. These k-correction factors derived from RAS1 and RAS2 values using this training dataset are available at [20] and as supplementary material. Comparison of allelic frequencies: Pooling vs. individual genotyping For the 10,205 SNPs on the Affymetrix 10K GeneChip® we found a median difference in allelic frequency between individually genotyped samples and pooled samples of 5.1%, a mean difference of 6.3%, and a standard devia- tion of 4.9%. Figure 2 shows a histogram for all the SNPs and their difference between the predicted frequency from the pools and the individual genotypes data. Other stud- ies have reported slightly lower differences between pooled and individually genotyped methods for determining allelic frequencies (3–5%) [8,9,12]. Many reasons are likely for this difference: We used DNA that was collected over ten years and was of varying quality; we also compared all the SNPs on the microarray rather than selecting only a few SNPs for comparison. Realistically, the greater difference seen in this study may be more rep- resentative of large association studies, in which thou- sands of genomic samples of varying quality are pooled. (A) Allele frequency differences between individual and pooled genotypesFigure 2 (A) Allele frequency differences between individual and pooled genotypes. Histogram representing the total number of SNPs at each allele frequency difference between individ- ual and pooled samples. (B) Accuracy of predicted SNP fre- quencies increases for those SNPs that perform well on Mapping 10K individual assays and decreases for poorly per- forming SNPs. The mean and median absolute difference between the predicted allelic frequency and individually gen- otyped allelic frequencies are shown vs. the binned perform- ance of SNPs on individual assays. Performance is ranked by the frequency of calls in a set of 3,000 individually genotyped samples. Page 4 of 9 (page number not for citation purposes) BMC Genomics 2005, 6:138 http://www.biomedcentral.com/1471-2164/6/138 Identification of assay false positives While the differences in frequencies between pooled and individual genotyped samples show that calculating fre- quencies from pooled samples is highly accurate, it is per- haps of greater importance that we are able to predict those SNPs that are unreliable and largely inaccurate. Assays genotyping 500,000 SNPs will likely not have the ability to be as selective and thus are likely to provide a large number of SNPs that do not reliably quantitate allelic frequencies from pooled genomic samples. As shown in Table 1, we found that the 100 SNPs most likely to give a "NoCall" in individual genotyped samples more often gave unreliable predictions of allelic frequencies in pooled samples. Furthermore, as shown in Figure 2b, those SNPs that are the worst 10% (in terms of % called for individual genotypes) also gave rise to higher allelic frequency differences. We found that rarely called SNPs are also likely to be called inaccurately (Table 1 and Figure 3c). In this case, constructing k-correction factors and predicting allelic fre- quencies will be unreliable for these SNPs, even if pooled replicates show low variability. SNPs with the highest var- iance in pool replicates were also unreliable. As a practical measure, we found that applying both filters for too many "NoCalls" in a training set and having a high variance in pooled replicates was more effective than either measure alone. We could identify 1/3rd of the worst performing SNPs (greater than 12% difference), by removing the worst performing 5% of SNPs based on variance in pool replicates and removing the worst performing 5% of SNPs based on excessive "No Calls" when individually geno- typed. Consequentially, the removal of those SNPs that are either poorly called in a training set of individually genotyped samples or highly variable across pooled repli- cates significantly decreases the number of false positives. The number of SNPs removed should maintain a balance between retaining dense SNP coverage and excluding those SNPs more likely to give false positives. Ultimately, removing potential false positives will be a compromise between the coverage of the SNP microarray and the genetic diversity of the population. It is of interest to note that allelic frequencies calculations were more accurate as SNPs approached homozygosity. For example, for those SNPs with allelic frequencies from 0% to 20% and from 80% to 100% the mean difference was 5.2% vs. a mean difference of 7.1% for SNPs with allelic frequencies between 20% and 80%. This finding may be due to inaccuracies in the assays as SNPs approach 50%, since the variance for heterozygotes is higher than the variance measured for homozygotes. Identification of a disease locus from genotyping of pooled samples In order to assess whether it is feasible to use pooled gen- otyping to identify the genetic locus for a disease, we created case and control pools for the disease sudden infant death with dysgenesis of the testes (SIDDT) and a pool of Amish control individuals. A test for proportions was employed to detect statistical differences between cases and controls. This test-statistic is more often used in pooled studies since frequency data are generally not whole-integers [19]. Shown in equation 1 is the calculation for the test statistic (T) where fcase is the allele frequency for the case group and fcontrol is the allele frequency for the control group. Table 1: Inaccurate SNPs with the largest difference between SNP allele frequencies when genotyped individually vs. calculated from pooled DNA can be partially predicted. Nearly 40% of the SNPs found to be the 100 most inaccurate SNPs were also either (a) one the 500 worst performing SNPs in individual genotyping or (b) had the largest variability between replicates in the pool. All SNPs (a) 500 worst performing SNPs Criteria: NoCalls on 3000 person database (b) 500 worst performing SNPs Criteria: Pool variability from 3 replicates SNPs found in (a) or (b) Inaccuracy (predicted vs. genotyped) 100 most inaccurate SNPs (individual genotyped vs. pooled) 24.2% 27.3% 38.6% 27.2% 500 most inaccurate SNPs (individual genotyped vs. pooled) 12.5% 14.5% 22.1% 20.2% Remaining 9605 SNPs 4.4% 5.5% 8.6% 5.0% T f f f f f f case control case Control case case = − + = var( ) var( ) , var( ) (11 2 1 − ( )f N case case ) Page 5 of 9 (page number not for citation purposes) BMC Genomics 2005, 6:138 http://www.biomedcentral.com/1471-2164/6/138 The distribution follows an approximate χ2 distribution with one degree of freedom. The SNP with the highest sig- nificance, rs949748, had a p-value of 0.00016 and was in the SIDDT locus at chromosome 6q21. However, it is expected that the SNP with the lowest p-value will not always be at the correct disease locus. Even strong single SNP association signals will likely be obscured in the noise when 500,000+ SNPs are probed. Thus we employed a moving window whereby the mean test-statis- tic of several consecutive SNPs was calculated at each SNP position across the genome. The objective of the moving window was to leverage the fact that neighbouring SNPs will likely be in linkage disequilibrium, whereby one SNP is at least partially predictive of the neighbouring SNP. The number of SNPs contained in the moving window was varied between 1 and 25. Shown in table 2 is the rank of the 6q21 region for varying window sizes. It is of inter- est to assess sensitivity of this windowing approach to SNPs within the region. Thus, we consecutively removed the top three SNPs contributing to the overall association signal. Removing the first two SNPs has little effect on detecting the association signal. The 6q21 region remains the most significant for window sizes of four and greater even when these top two SNPs are removed. In compari- son, all three top SNPs has a marginal effect, lowering the rank of the region from highest to within the top ten. To compute the statistical significance of averaged test-sta- tistics, we used a permutation test. With this approach the consecutive order of SNPs was randomized in four hun- dred separate bootstrapped datasets. P-value statistics were calculated from the distribution of these datasets. Shown in Table 2, the SIDDT locus (6q22.1-q22.31) was generally revealed as the most significant region of associ- ation for window sizes between 4 and 20 SNPs. It will not always be the case that SNPs are in linkage dis- equilibrium and a windowing-based approach will be effective. The permutation statistics can be used to test this scenario in order to see if the frequency of a given mean window test-statistic is indeed significant. The Old Order Amish and Mennonite populations used in this study Identification of the SIDDT locus from pooled genomic DNA by calculating the mean test-statistic for a rolling window of con-secutive SNPsFigure 3 Identification of the SIDDT locus from pooled genomic DNA by calculating the mean test-statistic for a rolling window of con- secutive SNPs. The moving window was determined across the genome and the p-value was calculated from a distribution of 400 bootstraps of the original dataset. Mean window sizes of 1, 3, 5, 10, 15, and 20 are shown and the SIDDT locus is high- lighted in yellow. The SIDDT disease locus is the top region for window sizes of 1, 5, 10, 15, and 20. Page 6 of 9 (page number not for citation purposes) BMC Genomics 2005, 6:138 http://www.biomedcentral.com/1471-2164/6/138 arise from a population founded in approximately the six- teenth century with expectedly larger regions of identity by descent. The Amish and Mennonites are not one large isolated population. It is more accurate to say that both these populations derive from the Swiss Anabaptists (circa 1525). These groups are socially and genetically unique even though both came from the same geographical region. Thus undoubtedly some stratification exists between our two cohorts and it is encouraging that the correct region was easily identifiable despite any stratifica- tion [21]. Based on previous research in this population, the 10K Mapping Array was anticipated to be of sufficient density whereby many of the SNPs would be in relative linkage disequilibrium throughout this regional population. Indeed, the permutation statistics of moving windows support this notion as the 6q21 region shows a p-value of <1e-6 for window sizes of 10 SNPs and greater, far lower than would be expected with ~10,000 SNP measure- ments. Other methods have been developed that reduce noise using haplotype data from SNPs in linkage disequi- librium [22]. In the absence of this haplotype data, which may often be the case, it is encouraging that the very straightforward statistical approach described here is effective at identifying the correct locus. Discussion Our results show that (1) pooling genomic samples is highly accurate; (2) unreliable SNPs most likely to give false-positives can be largely identified and removed prior to association analysis; and (3) a moving window of aver- aged test-statistics can be used to detect association sig- nals. Additionally, we have described modifications as to how allelic frequencies are calculated from RAS values of pooled samples that remove systematic biases. Pioneering work on pooling studies by other research groups has shown that the average Relative Allele Signal (RASave) can be effectively used to derive k-correction fac- tors by k = RASAVE/(1-(RASAVE), and as such, can be used to accurately predict allelic frequencies [8,11,12]. Pooling studies are intended to be screening approaches. RAS val- ues are highly convenient since they are generated by the Affymetrix GDAS software on the 10K platform and fairly intuitive to understand. We suggest significant improve- ments to this innovative approach that will remove biases; allow for continued use of RAS values; and result in more accurate predictions. These improvements focus on lower- ing the number of false positives due to added variance or systematic biases, since the utility of pooling-based approaches will be based on how one can detect associa- tion signals given a high number of false positives. First, RAS1 and RAS2 should not be averaged since they are separate probe sets with distinct variances. One may unnecessarily propagate unwanted variance by averaging. For example in Figure 1b, it is clearly visible that RAS2 is highly predictable of the particular SNP allele whereas RAS1 is highly inaccurate. In this case, averaging RAS2 and RAS1 will produce a RASAVE value that is less accurate than RAS2 alone. We suggest instead that these values be treated as separate measures, each with their distinct vari- ance. In the case of RAS values with a large variance, these values should not be used due to the increased chance of a false positive. Table 2: Identification of disease locus using a moving window. SNPs were ranked by test statistics and sorted by physical position. The average was calculated for a moving window of consecutive SNPs across the genome. The region 6q22.1 was already known to contain the mutation leading to the SIDDT. The rank of region 6q22.1 for a various window sizes in shown in the second column. In the 3rd, 4th, and 5th columns, the top 1, 2, and 3 SNPs were removed from the 6q22.1 regions to probe sensitivity of window size. # SNPs Averaged in Moving Window 6q22.1 Rank Region (All SNPs) 6q22.1 Region Rank (Exclude Top 1 SNP in Region) 6q22.1 Region Rank (Exclude Top 2 SNPs in Region) 6q22.1 Region Rank (Exclude Top 3 SNPs in Region) 1 1 22 24 60 2 11 11 19 11 3 6 6 6 14 4 1 1 1 2 5 1 1 1 8 6 2 2 2 3 7 1 1 1 13 8 1 1 1 3 9 1 1 1 9 10 1 1 1 3 Page 7 of 9 (page number not for citation purposes) BMC Genomics 2005, 6:138 http://www.biomedcentral.com/1471-2164/6/138 Second, we highly recommend that RAS values for each SNP be normalized prior to calculation of allelic frequencies. When these values are not normalized prior to calculating a predicted allelic frequency a significant bias is introduced since the RAS values, as produced by the Affymetrix GDAS software, generally are not 0.0 or 1.0 for homozygous BB and AA respective alleles. Indeed, on a training set of 1000 individuals we found that 34% of SNPs who were called AA had a RAS value less than 0.9 and 35% of SNPs called BB had a RAS value greater than 0.1. This bias can be seen in an example calculation using k-correction factors derived from a typical RAS value directly obtained from the GDAS software. For example, the average RAS1 for a given SNP of an AA individual may be 0.9, the average RAS1 for a heterozygous individual may be 0.5, and the average RAS2 for a BB individual may be 0.1. When one uses the approach outlined by Butcher, et al, the k-correction factor is 1.0, whereby the RAS value of the average heterozygote is divided by one minus this value [10]. In a pooled sample, the same SNP is expected to have a RAS value of 0.9 if it is completely homozygous for AA. However, using the k-correction approach on non- normalized RAS values, one would predict an allelic fre- quency of 90%, whereas the actual frequency is 100%, a bias of 10%. These biases would be most pronounced as pools approach dominance by one allele type, as would often be the case for a SNP highly associated to a disease. While RAS values are readily obtainable from the Affyme- trix software for the 10K GeneChip® arrays, they are not provided for the 100K or 500K. This is partly due to the fact that RAS values are no longer used to make a SNP call. We have developed a simple Perl script which generates RAS values, still useful in pooling, for the 100K and 500K Affymetrix GeneChip® platform from CHP files. This tool is available on our website [23]. While one may use these RAS values to find obvious differences in cases and con- trols, for many SNPs allelic frequencies are not linearly dependent on the RAS values; thus, one should calculate allelic frequencies when possible to reduce uneven biases between different SNPs. Additionally, we are making public on the same site both normalized and non-normalized k-correction factors derived from over 3,000 genotyped individuals for the 10K version 2.0 SNP genotyping platform. Other research groups have created central repositories for k-corrections using non-normalized RAS values and we will work with these teams to contribute these values to this valuable cen- tralized resource [11]. Conclusion Prior to the investment of large resources into individual genotyping thousands of individuals, one may first con- sider pooling samples at a low cost to rapidly ascertain gross population stratification concerns and potentially identify the regions of the genome with the strongest asso- ciation to the trait. The sheer number of SNPs interrogated will lead to a high number of false positives, due to both actual variation in genotype frequencies of the underlying groups and to technical variance. We demonstrate that technical variance can be detected a priori for each SNP using training sets from large numbers of individual microarrays or by replicates of pooled samples. We further show that despite the issues of population stratification, admixture, and subgroups that are difficult to detect when pooling, the cost savings make pooling a first step that we suggest should logically precede the investment of mil- lions of dollars. We describe here a method by which 100K and 500K Affymetrix SNP array data can be parsed into RAS scores and pooled inbalances accurately assessed in an outbred population. Methods 10K GeneChip® Mapping Array Genotyping 10K SNP genotyping was performed as detailed by Affymetrix on the 10K GeneChip® Mapping 1.0 and 2.0 Arrays [5]. In short, 250 ng of genomic DNA was digested with 10 units of Xba I (New England Biolabs, Beverly, MA) for 2 hours at 37°C. Adaptor Xba (P/N 900410, Affymetrix, Santa Clara, CA) was then ligated onto the digested ends with T4 DNA Ligase for 2 hours at 16°C. After dilution with water, samples were subjected to PCR using primers specific to the adaptor sequence (P/N 900409, Affymetrix) with the following amplification parameters: 95°C for 3 minutes initial denaturation, 95°C 20 seconds, 59°C 15 seconds, 72°C 15 seconds for a total of 35 cycles, followed by 72°C for 7 minutes final extension. PCR products were then purified and frag- mented using 0.24 units of DNase I at 37°C for 30 min- utes. The fragmented DNA was then end-labeled with biotin using 100 units of terminal deoxynucleotidyl trans- ferase at 37°C for 2 hours. Labeled DNA was then hybrid- ized onto the 10K Mapping Array at 48°C for 16–18 hours at 60 rpm. The hybridized array was washed, stained, and scanned according to the manufacturer's instructions. The chp_2_ras.pl script processes one or more CHP text files from Affymetrix 10K and 100K SNP chips, calculates RAS1 and RAS2 scores and outputs them in an Excel spreadsheet. Testing shows that for 10K chips, chp_2_ras.pl produces the same scores as those produced by Affymetrix' GDAS software. chp_2_ras.pl is distributed as part of TGen-Array, a collection of Perl scripts and mod- ules that provide parsing and object-oriented interfaces to common microarray files. The script can be downloaded at the TGen bioinformatics website [23]. Page 8 of 9 (page number not for citation purposes) BMC Genomics 2005, 6:138 http://www.biomedcentral.com/1471-2164/6/138 Authors' contributions DWC and MJH performed SNP genotyping, participated in the concept of the paper, and drafted the manuscript. DHL, VLZ, MJH, and AML conducted pooling and SNP genotyping. DWC and JMP performed statistical analysis of the SNP data. DAS participated in study design, coordi- nation, and manuscript drafting. All of the authors have read and approved the final manuscript. Additional material Acknowledgements We thank the Old Order Amish families who participated in the research and the Old Order Amish community for their willingness to participate in research studies. References 1. Matsuzaki H, Dong S, Loi H, Di X, Liu G, Hubbell E, Law J, Berntsen T, Chadha M, Hui H, Yang G, Kennedy GC, Webster TA, Cawley S, Walsh E, Jones KW, Fodor SP, Mei R: Genotyping over 100,000 SNPs on a pair of olignucleotide arrays. Nature Methods 2004, 1:109-111. 2. Fan JB, Oliphant A, Shen R, Kermani BG, Garcia F, Gunderson KL, Hansen M, Steemers F, Butler SL, Deloukas P, Galver L, Hunt S, McBride C, Bibikova M, Rubano T, Chen J, Wickham E, Doucet D, Chang W, Campbell D, Zhang B, Kruglyak S, Bentley D, Haas J, Rigault P, Zhou L, Stuelpnagel J, Chee MS: Highly parallel SNP genotyping. Cold Spring Harb Symp Quant Biol 2003, 68:69-78. 3. Marnellos G: High-throughput SNP analysis for genetic associ- ation studies. Curr Opin Drug Discov Devel 2003, 6:317-321. 4. Cardon LR, Bell JI: Association study designs for complex diseases. Nat Rev Genet 2001, 2:91-99. 5. Risch N, Teng J: The relative power of family-based and case- control designs for linkage disequilibrium studies of complex human diseases I. DNA pooling. Genome Res 1998, 8:1273-1288. 6. Hirschhorn JN, Daly MJ: Genome-wide association studies for common diseases and complex traits. Nat Rev Genet 2005, 6:95-108. 7. Carlson CS, Eberle MA, Kruglyak L, Nickerson DA: Mapping com- plex disease loci in whole-genome association studies. Nature 2004, 429:446-452. 8. Butcher LM, Meaburn E, Knight J, Sham PC, Schalkwyk LC, Craig IW, Plomin R: SNPs, microarrays and pooled DNA: identification of four loci associated with mild mental impairment in a sample of 6000 children. Hum Mol Genet 2005, 14:1315-1325. 9. Meaburn E, Butcher LM, Liu L, Fernandes C, Hansen V, Al-Chalabi A, Plomin R, Craig I, Schalkwyk LC: Genotyping DNA pools on microarrays: tackling the QTL problem of large samples and large numbers of SNPs. BMC Genomics 2005, 6:52. 10. Butcher LM, Meaburn E, Liu L, Fernandes C, Hill L, Al-Chalabi A, Plo- min R, Schalkwyk L, Craig IW: Genotyping pooled DNA on microarrays: a systematic genome screen of thousands of SNPs in large samples to detect QTLs for complex traits. Behav Genet 2004, 34:549-555. 11. Simpson CL, Knight J, Butcher LM, Hansen VK, Meaburn E, Schalkwyk LC, Craig IW, Powell JF, Sham PC, Al-Chalabi A: A central resource for accurate allele frequency estimation from pooled DNA genotyped on DNA microarrays. Nucleic Acids Res 2005, 33:e25. 12. Butcher LM, Meaburn E, Dale PS, Sham P, Schalkwyk LC, Craig IW, Plomin R: Association analysis of mild mental impairment using DNA pooling to screen 432 brain-expressed single- nucleotide polymorphisms. Mol Psychiatry 2005, 10:384-92. 13. Matsuzaki H, Loi H, Dong S, Tsai YY, Fang J, Law J, Di X, Liu WM, Yang G, Liu G, Huang J, Kennedy GC, Ryder TB, Marcus GA, Walsh PS, Shriver MD, Puck JM, Jones KW, Mei R: Parallel genotyping of over 10,000 SNPs using a one-primer assay on a high-density oligonucleotide array. Genome Res 2004, 14:414-425. 14. Puffenberger EG, Hu-Lince D, Parod JM, Craig DW, Dobrin SE, Con- way AR, Donarum EA, Strauss KA, Dunckley T, Cardenas JF, Melmed KR, Wright CA, Liang W, Stafford P, Flynn CR, Morton DH, Stephan DA: Mapping of sudden infant death with dysgenesis of the testes syndrome (SIDDT) by a SNP genome scan and identi- fication of TSPYL loss of function. Proc Natl Acad Sci U S A 2004, 101:11689-11694. 15. Hoogendoorn B, Norton N, Kirov G, Williams N, Hamshere ML, Spurlock G, Austin J, Stephens MK, Buckland PR, Owen MJ, O'Dono- van MC: Cheap, accurate and rapid allele frequency estima- tion of single nucleotide polymorphisms by primer extension and DHPLC in DNA pools. Hum Genet 2000, 107:488-493. 16. Le Hellard S, Ballereau SJ, Visscher PM, Torrance HS, Pinson J, Morris SW, Thomson ML, Semple CA, Muir WJ, Blackwood DH, Porteous DJ, Evans KL: SNP genotyping on pooled DNAs: comparison of genotyping technologies and a semi automated method for data storage and analysis. Nucleic Acids Res 2002, 30:e74. 17. Norton N, Williams NM, Williams HJ, Spurlock G, Kirov G, Morris DW, Hoogendoorn B, Owen MJ, O'Donovan MC: Universal, robust, highly quantitative SNP allele frequency measure- ment in DNA pools. Hum Genet 2002, 110:471-478. 18. Giordano M, Mellai M, Hoogendoorn B, Momigliano-Richiardi P: Determination of SNP allele frequencies in pooled DNAs by primer extension genotyping and denaturing high-perform- ance liquid chromatography. J Biochem Biophys Methods 2001, 47:101-110. 19. Moskvina V, Norton N, Williams N, Holmans P, Owen M, O'Donovan M: Streamlined analysis of pooled genotype data in SNP- based association studies. Genet Epidemiol 2005, 28:273-282. 20. Website: . [http://www.tgen.org/neurogenomics/data]. 21. Puffenberger EG: Genetic heritage of the Old Order Mennon- ites of southeastern Pennsylvania. Am J Med Genet C Semin Med Genet 2003, 121:18-31. 22. Hinds DA, Seymour AB, Durham LK, Banerjee P, Ballinger DG, Milos PM, Cox DR, Thompson JF, Frazer KA: Application of pooled gen- otyping to scan candidate regions for association with HDL cholesterol levels. Hum Genomics 2004, 1:421-434. 23. Website2: . [http://bioinformatics.tgen.org/software/tgen-array/]. Additional file 1 Calculated k-correction factors for pooling on Affymetrix 10K GeneChip Mapping Array based on 3,000 person database. Click here for file [http://www.biomedcentral.com/content/supplementary/1471- 2164-6-138-S1.zip] Additional file 2 The chp_2_ras.pl script processes one or more CHP text files from Affyme- trix 10K, 100K, and 500K EA SNP chips, calculates RAS1 and RAS2 scores and outputs them in an Excel spreadsheet. Testing shows that for 10K chips, chp_2_ras.pl produces the same scores as those produced by Affymetrix' GDAS software. GDAS does not calculate RAS values for 100K chips. It should be noted that SNPs on 100K chips do not necessarily contain even numbers of sense and antisense probes and in fact only about 40% have 5 sense and 5 antisense probes. The remaining SNPs have a 6– 4 or 7–3 probe bias towards either sense or antisense. This is important because part of the RAS calculation involves taking the median of the "successful" probes and median may not be the best approach if only 3 probes exist in one direction and some may have failed and been dis- carded. chp_2_ras.pl is distributed as part of TGen-Array, a collection of Perl scripts and modules that provide parsing and object-oriented inter- faces to common microarray files. The TGen-Array site contains online documentation for all modules and scripts in the distribution including pages that show the source code so the code and algorithms may be inspected. Click here for file [http://www.biomedcentral.com/content/supplementary/1471- 2164-6-138-S2.xls] Page 9 of 9 (page number not for citation purposes) http://www.biomedcentral.com/content/supplementary/1471-2164-6-138-S1.zip http://www.biomedcentral.com/content/supplementary/1471-2164-6-138-S2.xls http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15782172 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15782172 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15338605 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15338605 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12833663 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12833663 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11253062 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11253062 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=9872982 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=9872982 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=9872982 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15716906 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15716906 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15164069 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15164069 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15800012 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15800012 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15800012 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15811185 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15811185 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15811185 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15319578 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15319578 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15701753 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15701753 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15701753 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15452586 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15452586 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15452586 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=14993208 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=14993208 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=14993208 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15273283 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15273283 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15273283 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11140947 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11140947 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11140947 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12140336 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12140336 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12140336 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12073018 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12073018 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12073018 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11179766 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11179766 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11179766 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15700279 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15700279 http://www.tgen.org/neurogenomics/data http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12888983 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12888983 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15606997 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15606997 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=15606997 http://bioinformatics.tgen.org/software/tgen-array/ Abstract Background Results Conclusion Introduction Results Individual genotyping of SNPs for 107 samples Construction of pools Calculation of allelic frequencies from pooled samples Comparison of allelic frequencies: Pooling vs. individual genotyping Table 1 Identification of assay false positives Identification of a disease locus from genotyping of pooled samples Table 2 Discussion Conclusion Methods 10K GeneChip® Mapping Array Genotyping Authors' contributions Additional material Acknowledgements References