key: cord-0000390-1rhy8td0 authors: Lee, Charlie Wah Heng; Koh, Chee Wee; Chan, Yang Sun; Aw, Pauline Poh Kim; Loh, Kuan Hon; Han, Bing Ling; Thien, Pei Ling; Nai, Geraldine Yi Wen; Hibberd, Martin L.; Wong, Christopher W.; Sung, Wing-Kin title: Large-scale evolutionary surveillance of the 2009 H1N1 influenza A virus using resequencing arrays date: 2010-02-25 journal: Nucleic Acids Res DOI: 10.1093/nar/gkq089 sha: 8a6809df45d5f80a822d68d3c305f7640e10234a doc_id: 390 cord_uid: 1rhy8td0 In April 2009, a new influenza A (H1N1 2009) virus emerged that rapidly spread around the world. While current variants of this virus have caused widespread disease, particularly in vulnerable groups, there remains the possibility that future variants may cause increased virulence, drug resistance or vaccine escape. Early detection of these virus variants may offer the chance for increased containment and potentially prevention of the virus spread. We have developed and field-tested a resequencing kit that is capable of interrogating all eight segments of the 2009 influenza A(H1N1) virus genome and its variants, with added focus on critical regions such as drug-binding sites, structural components and mutation hotspots. The accompanying base-calling software (EvolSTAR) introduces novel methods that utilize neighbourhood hybridization intensity profiles and substitution bias of probes on the microarray for mutation confirmation and recovery of ambiguous base queries. Our results demonstrate that EvolSTAR is highly accurate and has a much improved call rate. The high throughput and short turn-around time from sample to sequence and analysis results (30 h for 24 samples) makes this kit an efficient large-scale evolutionary biosurveillance tool. groups, there remains the possibility that future variants may cause increased virulence, drug resistance or vaccine escape. Early detection of these virus variants may offer the chance for increased containment and potentially prevention of the virus spread. We have developed and field-tested a resequencing kit that is capable of interrogating all eight segments of the 2009 influenza A(H1N1) virus genome and its variants, with added focus on critical regions such as drug-binding sites, structural components and mutation hotspots. The accompanying base-calling software (EvolSTAR) introduces novel methods that utilize neighbourhood hybridization intensity profiles and substitution bias of probes on the microarray for mutation confirmation and recovery of ambiguous base queries. Our results demonstrate that EvolSTAR is highly accurate and has a much improved call rate. The high throughput and short turn-around time from sample to sequence and analysis results (30 h for 24 samples) makes this kit an efficient largescale evolutionary biosurveillance tool. While the current influenza A H1N1 2009 virus is known to be sensitive to neuraminidase-inhibitor chemoprophylaxis, it has limited diversity at neutralizing antibody binding sites and overall mortality rates comparable with seasonal influenza (1) ; variations at numerous sites within the influenza A genome are predicted to alter these characteristics, with potentially important consequences for healthcare provision. In order to enable large-scale identification of variations of H1N1(2009) viruses from multiple patient samples, it is necessary to develop a low-cost method for rapidly wholegenome sequencing the H1N1 samples. Historically, sequencing of viral genomes is performed using standard dye termination technologies. These conventional sequencing technologies produce accurate data but are too slow, costly and labour-intensive to be practical for large-scale epidemiologic or evolutionary investigations in viral outbreaks. Oligonucleotide resequencing microarrays that are capable of identifying nucleotide sequence variants may offer an alternative solution (2, 3) and in recent years, have been used for detecting and subtyping influenza viruses (4, 5) . By analysing sequences generated from tiling probes across targeted regions of various strains of the influenza virus [e.g. partial fragments of the haemagglutinin (HA) and neuraminidase (NA) genes], important information such as viral subtypes, lineages and sequence variants can be determined. Apart from influenza, resequencing microarrays have also been used to obtain whole-genome primary sequences for orthopoxviruses (6) , biothreat viruses (7) and SARS (8) . The reported studies mainly use platform accompanying software that employs probabilistic base-calling algorithms such as ABACUS (3) and Nimblescan PBC (8) . Although statistically sound, these methods are susceptible to hybridization noise caused by factors such as poor probe quality, poor amplification or mutations. This results in numerous ambiguous and false positive base calls that may affect the accuracy of downstream evolutionary analysis. Efforts have been made to improve the call rates and accuracies of existing probabilistic base-calling algorithms. For example, Model-P uses probe and sequence features to build intensity-prediction models that compute maximum likelihood scores for base-calling (9) . Another approach filters low-confidence base calls from problematic regions (e.g. regions with high mutation rates or repeats), thereby reducing the number of false-positive base calls (10) . Depending on the stringencies of the filters used, call rates may suffer as a result. To address if these arrays can be used as a practical, large-scale re-sequencing tool, we have developed a system comprising customized sequence amplification primers, a 12-plex DNA resequencing array and an automated base-calling and variant analysis software (EvolSTAR). We demonstrate that the sequences obtained from the array are highly reproducible with !99.99% accuracy and 99.02 ± 0.82% genome coverage. The short turnaround time from sample to sequence and analysis results ($30 h for 24 samples) makes this kit an efficient large-scale evolutionary surveillance tool. This article describes the development of the various genetic analysis components, and their validation using clinical samples. Accession numbers for 84 complete H1N1(2009) genomes generated are listed in Supplementary Data File 1. Viral RNA from the diagnostic swabs or RNA extracted from MDCK cell cultures was extracted using the DNA minikit (Qiagen, Inc, Valencia, CA, USA) according to manufacturer's instructions. RNA was reverse-transcribed to cDNA using customized random primers designed using LOMA (11) and then amplified by PCR using proprietary H1N1(2009) specific primers. The presence of H1N1(2009) in the samples was confirmed using a separate real-time PCR assay based on the published primer sequences from the Centre for Disease Control and Prevention (CDC), USA. We found 36 mutation hotspots in the alignments where mutations occurred near one another (within 20 bp). A perfect match (PM) probe residing in a mutation hotspot may contain mismatches that will have a detrimental effect on its hybridization intensity. To avoid this problem, we designed additional PM probes that contain all possible combinations of mutations found in each mutation hotspot. Thus, if two mutations are found within 20 bp of each other in the alignments, then we need in total four (2 2 ) PM probes to encode them. In general, 2 x PM probes are needed to completely encode a cluster of x mutations that occur within 20 bp of one another in the alignments. We have identified five distinct types of neighbourhood hybridization intensity profile belonging to true nonmutations (wild-type), true mutations, isolated errors/ 'N's, long consecutive errors/'N's, and unknown errors/ 'N's, respectively. For each non-high-confidence query base, we determine the type of its neighbourhood hybridization signal intensity profile (NHIP) by the following criteria: (i) True-non-mutation-The PM probe (of both strands) of the query base must be a high-confidence call [i.e. it has hybridization intensity ! 1.4-fold that of its mismatch (MM) probes]. Neighbourhood PM probes are also high-confidence calls. Let the mean hybridization intensity of the three nearest PM probes to the immediate left of the mutation base (at position À1, À2 and À3), denoted as {À1,À2,À3} , the mean hybridization intensity of the three PM probes to the far left of the mutation base (at position À4, À5 and À6), denoted as {À4,À5,À6} , the mean hybridization intensity of the three nearest PM probes to the immediate right of the mutation base (at position 1, 2 and 3), denoted as {1,2,3} , and the mean hybridization intensity of the three PM probes to the far right of the mutation base (at position 4, 5 and 6), denoted as {4,5,6} . We impose that {À1,À2,À3} % {À4,À5,À6} and {1,2,3} % {4,5,6} (ii) True-mutation-The PM probe (of both strands) of the query base must have hybridization intensity ! 1.4-fold that of its MM probes. To detect the characteristic dip, we check four mean hybridization intensities: the mean hybridization intensity of the three nearest PM probes to the immediate left of the mutation base (at position À1, À2 and À3), denoted as {À1,À2,À3} , the mean hybridization intensity of the three PM probes to the far left of the mutation base (at position À4, À5 and À6), denoted as {À4,À5,À6} , the mean hybridization intensity of the three nearest PM probes to the immediate right of the mutation base (at position 1, 2 and 3), denoted as {1,2,3} , and the mean hybridization intensity of the three PM probes to the far right of the mutation base (at position 4, 5 and 6), denoted as {4,5,6} . If {À1,À2,À3} < {À4,À5,À6} and {1,2,3} < {4,5,6} , we say this is a dip pattern and the query base is likely to be mutated. (iii) Isolated error/'N'-The PM probe (of both strands) of the query base has hybridization intensity < 1.4fold that of its MM probes. Neighbourhood PM probes are high-confidence calls. (iv) Long consecutive errors/'N's-The PM probe (of both strands) of the query base has hybridization intensity < 1.4-fold that of its MM probes. A majority of neighbourhood PM probes are nonhigh-confidence calls. (v) Unknown error/'N'-All other neighbourhood hybridization profile patterns that do not fall under the previous categories. We define that a probe encodes the base b if b is located in the centre-most position of the probe and is the base to be To validate if the observed PM probe encoding b 1 is indeed the true PM probe of the sample sequence, we compute the likelihood ratio of f obs and f rand , where f obs is probability of observing the hybridization intensity reduction order b 1 b 2 b 3 b 4 given that the PM probe encodes b 1 and f rand is the probability of observing the hybridization intensity reduction order b 1 b 2 b 3 b 4 by chance. Precisely, where #(wxyz) is the number of observed hybridization intensity reduction orders from high-confidence base-calls and t is the total number of hybridization intensity reduction orders excluding b 1 b 2 b 3 b 4 obtained from high-confidence base-calls. If the likelihood ratio >2, we expect that the observed PM probe encoding b 1 is indeed the true PM probe of the sample sequence. EvolSTAR employs a two-step process for base-calling ( Figure 1 ). In the first step, each base query is scrutinized for signs of hybridization intensity abnormalities. If the gain-of-signal of the query base is strong and has no mutation, the base is called. In the second step, EvolSTAR then tries to recover base queries that have any hybridization intensity abnormalities with two analysis methods, namely neighbourhood hybridization intensity profile analysis and nucleotide substitution bias analysis. Step 1: Identification of base queries with ambiguity. On our array platform, the hybridization intensity of each probe is given by the mean and standard deviation of the fluorescence intensities of nine individually scanned pixels. Hence, we define the signal-to-noise ratio (SNR) of a probe as the ratio of the mean to the standard deviation of the intensities of the nine pixels associated with the probe. In our experiments, we found that >95% of all probes had SNR less than T SNR (T SNR = m SNR + 2s SNR , where m SNR and s SNR are the mean and standard deviation of SNR of all probes on the array). The remaining 5% of probes with SNR ! T SNR are unreliable. Hence, base queries with one or more probes with SNR ! T SNR are analysed further in step 2. Furthermore, all base queries whose PM probe in the forward strand and PM probe in the reverse strand are non-complementary, or have weak PM/MM hybridization intensity differentiation (<1.4-fold) are also passed to step 2. Lastly, we also pass all putative mutation calls to step 2 for confirmation. Step 2: Mutation confirmation and base query recovery. A high-confidence mutation call may be a result of coincidental non-specific hybridization of the same MM probe in both strands. As such, it may be inadequate to discern true mutations based solely on differences in the hybridization intensities of PM and MM probes. From our analysis of true-mutation calls made by PBC, we have found that true mutations have a signature NHIP type as per described in Figure 2b . Thus, query bases that result in a mutation call must have this signature NHIP. Finally, to confirm the mutation, we perform nucleotide substitution bias analysis on these query bases. For each of the query bases with NHIP of type described in Figure 2b , we compute the likelihood ' that the observed PM probe (representing the mutation) is indeed the true PM probe of the sample sequence given the hybridization intensity-based ordering of its MM probes (see 'Materials and methods' section). If ' > 2, the query base results in a strong mutation call (represented by upper case base calls 'A', 'C', 'G' or 'T'). If ' > 1, the query base results in a mutation call with weak support (represented by lower case base calls 'a', 'c', 'g' or 't'). Otherwise, they are re-assigned an unknown 'N' call. For query bases that results in a mutation call but have NHIP of type described in Figure 2c , they are most likely isolated errors caused by poor PM probe quality. Hence, we correct the base-calls of these query bases to their respective reference bases (but represented by lower case Figure 2 . Summary of characteristics of neighbourhood hybridization intensity profiles for different type of calls. Summary of the characteristics of the NHIP for five types of call: (a) true-non-mutation, (b) true-mutation, (c) isolated error or 'N', (d) long chains of consecutive errors or 'N', (e) unknown error or 'N') based on their respective observed neighbourhood hybridization intensity profiles. The PM probe (red circle) of query base is at position 0 while neighbourhood PM probes (black circles) are numbered according to their distance away from the query base. A PM probe is significantly differentiated from its MM probes if its hybridization intensity is at least t-fold that of all its MM probes. base calls 'a', 'c', 'g' or 't') in the reference sequences. We also perform the same correction to non-high-confidence query bases with NHIP of type described in Figure 2c . We try to recover the remaining query bases that have NHIP of type described in Figure 2d or 2e by analysing the substitution bias from their PM and MM probes in the forward and reverse strands separately. Similar to how a mutation is confirmed, we compute the likelihood ' f that the observed PM probe (representing the unsure base call) is indeed the true PM probe of the sample sequence given the hybridization intensity-based ordering of its MM probes in the forward strand. We also compute a similar likelihood ' r for the PM probe in the reverse strand. If the PM probes in both strands are complementary and ' f , ' r > 2, the query base results in a strong base call (represented by upper case base calls 'A', 'C', 'G' or 'T'). However, in many cases, the PM probes in both strands are not complementary due to non-specific hybridization of MM probes in one or both strands. For such query bases, we make base calls based on ' f and ' r : if ' f > ' r and ' f > 2, a base call with weak support (represented by lower case base calls 'a', 'c', 'g' or 't') is made from the PM probe in the forward strand. Else, if ' r > ' f and ' r > 2, a base call with weak support is made from the PM probe in the reverse strand. Otherwise, they are assigned an unknown 'N' call. Note that since nucleotide substitution biases may vary depending on the experimental conditions, experimental reagents or input samples, for each experiment, we obtain a set of high-confidence base-calls and use them to infer the hybridization intensity reduction orders for each PM probe encoding. This is then used to compute likelihood scores for base-calling non-high-confidence query bases and mutation confirmation. We generated a consensus sequence for each segment of the H1N1(2009) virus by aligning all 1715 complete and partial sequences available from the NCBI H1N1 flu resources database (http://www.ncbi.nlm.nih .gov/genomes/FLU/SwineFlu.html) as of 11 June 2009 using MAFFT (12) with high-accuracy option. Tiling probes spanning the entire genome segments on both the forward and reverse strands were created at one base resolution (8) . Analysis of the sequence alignments revealed that there were no deletions, insertions or recombination. However, we found 36 mutation hotspots in the alignments where mutations occurred near one another (within 20 bp). Thus, we added additional probes that represent all possible combinations of mutations in these mutation hotspots onto the array. To ensure that we have accurate sequence of the drug binding pocket targeted by NA inhibitors (13) Due to the small amount of virus present in samples relative to human or cell-line total RNA, it was necessary to amplify the viral RNA through PCR. We employed a combination of sequence-specific and random PCR approaches using LOMA-optimized primers as previously described (11) . The addition of random primers ensured complete genome amplification, even if mutations were present at the specific-primer binding sites. PCR conditions were optimized by conducting five duplicate hybridizations of the same virus sample cultured from a patient sample under different PCR conditions. The optimized method was then tested on RNA isolated directly from nasal swabs obtained from the same patient and from virus grown in cell culture. Microarray sequences generated from these replicate experiments were compared with capillary sequencing to estimate sequencing accuracy. Following PCR product labelling, hybridization and scanning, signal intensities for each probe was generated using Genepix 4.0 software, and annotated using Nimblescan 2.5 software. Initially, the standard Nimblescan software which employs a gain-of-signal approach [PBC algorithm (8) ], was used to determine the viral sequence. The PBC algorithm assumes that the signal intensity of the PM probe (which matches exactly to the sequence in the sample) will be significantly higher than that of the MM probes. While this approach sufficed for $90% of base queries, we observed that the discrimination between the PM and MM signals was not clear for the remaining probes. These ambiguous signals were caused by the presence of multiple mutations in the probe sequence, homopolymers and hybridization artefacts. We developed a novel algorithm, Evolution Surveillance and Tracking Algorithm for Resequencing arrays (EvolSTAR), to resolve this problem. EvolSTAR improves upon PBC by adding an analysis of the NHIP and nucleotide substitution bias (described below). Due to the use of tiling probes in resequencing arrays, a single nucleotide mutation at a particular query base could cause a dramatic reduction in the hybridization intensities of neighbouring PM probes up to six bases away (14) . This effect can be measured by studying the NHIP of each query base. We defined the NHIP of each query base as the observed pattern of hybridization intensities of its PM and MM probes and neighbouring (±6 bases from query base) PM and MM probes. To study the effects of sequence variation (mutation) and noise on the NHIP of a query base, we sequenced RNA from H1N1(2009) patient 380 by capillary sequencing and on duplicate microarrays. We compared sequence calls generated using by Nimblescan or by capillary sequencing and compiled a list of true (correct) calls, error calls and 'N' (unknown) calls. In total, of the expected 13 588 bases of the H1N1 virus (based on genome described at http:// www.ncbi.nlm.nih.gov/genomes/taxg.cgi?tax=211044) the microarray called 13 449 bases while capillary sequence was able to call 12 832 bases. Figure 3 shows the NHIPs of a representative set of 40 randomly selected query bases that result in true-nonmutation calls (wild-type calls). We observed that in these NHIPs, the PM probe of the query base together with neighbouring PM probes, have hybridization intensities significantly higher (>1.4-fold) than that of their MM probes in general. We also identified 10 mutations using capillary sequencing in the patient sample. The NHIPs of these 10 true-mutation calls (Figure 4 ) are very different from NHIPs of wild-type calls. The presence of a mutation at the query base created a MM in neighbouring PM probes and caused a drop in their hybridization intensities. The closer this mutation is to the centre of a neighbouring PM probe, the bigger the drop in hybridization intensity. This results in a distinctive dip to the immediate left and right of the centre of the NHIP where the mutation is. Unlike the NHIPs of wildtype and true-mutation calls, the NHIPs of most errors and 'N' calls appear haphazard ( Figure 5 ). However, when we traced the locations of these errors and 'N' calls on the genome, we found that some are isolated among good calls while others are conjugated in a small locality of the genome. We investigated the NHIPs of isolated errors and 'N' calls that occurred among good calls and found that in these NHIPs, only the PM probe of the query base that is an error or 'N' call has poor hybridization differentiation with its MM probes while other PM probes have hybridization intensities significantly higher than that of their MM probes in general ( Figure 6 ). This suggests that for such calls, only the PM and MM probes of the query base are noisy while neighbouring PM and MM probes are unaffected. In addition, we also found that long chains of consecutive error and 'N' calls (especially at the 5 0 -and 3 0 -end of the sample sequences) often have NHIPs where the PM probe of the query base together with neighbouring PM probes, have poor hybridization differentiation with their MM probes (Figure 7) . These error and 'N' calls usually occur at the ends of the genome segments. In summary, NHIP analysis showed that all true mutation calls had a characteristic profile (Figure 2b ) that differed from wild-type sequence calls (Figure 2a ). Ambiguous calls arising from different causes, such as homopolymers, isolated errors and hybridization artifacts also have profiles that are distinct from true mutation profiles ( Figure 2 ). See 'Materials and Methods' section for details. The presence of nucleotide substitution bias in Nimblegen resequencing arrays has been previously described (15) . However, this knowledge has so far been used only to improve probe design. In this article, we propose a novel method that makes use of nucleotide substitution bias in the array to improve base-calling accuracy and call rate. The key idea is to build a likelihood model of the substitution bias among the probes of non-ambiguous calls on the array; then use this to call bases with ambiguous signals. To build the likelihood model, we first determined the substitution bias on our platform by comparing the PM and MM probes (of both strands) of 25 028 true calls made by PBC from the two replicate microarray experiments of patient sample 380 mentioned in the previous section. For each true call, we generated a hybridization intensity reduction order by ranking the PM and MM probes of a particular strand in decreasing order of hybridization intensity and recording their respective frequencies (Table 1 ). Table 1 shows that for each PM probe encoding, certain hybridization intensity reduction orders occur much more frequently than others. For example, if the PM probe encoding is 'A' (regardless of strand), then it is most likely that the hybridization intensity reduction order is 'TGC' or 'GTC'. Thus, by matching the hybridization intensity reduction orders of its PM/ MM probes with that in Table 1 , we can compute the likelihood that the putative base call for a query base with ambiguous signals is correct (see 'Materials and Methods' section). In this way, we can recover base calls of ambiguous query bases exceeding a reasonably high likelihood threshold and achieve better accuracy and call rate than PBC. Grading the quality of the sequence calls EvolSTAR employs a two-step process for base-calling (details in 'Materials and methods' section). First, it determines if the gain-of-signal from the PM probe is strong. If strong, then the base sequence is called and annotated as 'high quality'. If the gain-of-signal is ambiguous, or if the base called is different from the expected sequence, then NHIP and nucleotide substitution bias is deployed to verify the sequence call. From empirical experiments and comparison with capillary sequence data, we observed that high quality sequence calls are made when the signal from the PM probe for both forward and reverse strands are at least 40% higher than that of the MM probes. The second-step processes the ambiguous bases. For bases confirmed to be mutations through NHIP analysis and satisfy the substitution bias rule, they are graded as high quality sequence calls and denoted in the FASTA sequence as UPPER CASE characters. For bases confirmed to be an isolated error through NHIP analysis and also satisfy the substitution bias rule, they are graded as low confidence sequence calls and denoted in the FASTA sequence in lower case characters. The rest of the bases are called 'N' in the FASTA sequence. The flowchart of EvolSTAR is shown in Figure 1 . To validate the software, we hybridized 14 patient samples in duplicate onto the microarray. The microarrays were analysed in parallel using Nimblescan (PBC algorithm) and EvolSTAR, and the sequences obtained were compared to Sanger capillary sequencing. We counted the number of true-non-mutation calls, true-mutation calls, error calls and ambiguous ('N') calls for both methods ( Table 2 ; Supplementary Data File 3). We also confirmed that the substitution bias in all 14 duplicate hybridization experiments (Table 3) were consistent with that found in Table 1 . Compared with the available capillary sequences for the 14 samples, EvolSTAR had an average error rate of 0.0029% and 12 ambiguous calls per sample (346 in total). This is far superior than Nimblescan PBC, where we obtained an average error rate of 0.083% and 158 ambiguous calls per sample (4434 in total). Furthermore, EvolSTAR called all true mutations correctly. The genome coverage attained by EvolSTAR (99.02 ± 0.82%) is also much higher than that of Nimblegen PBC (94.3 ± 6.06%). We wondered if, and by how much, incorporating NHIP and substitution biases analysis to the PBC results would improve the performance of the PBC algorithm. We observed that more than 70% of the 65 error calls (false mutation calls) made by PBC did not have the characteristic NHIP of a true-mutation shown in Figure 2b . The remaining 30% of the error calls had a NHIP reminiscent of a true-mutation NHIP but did not satisfy the substitution bias rule. Using NHIP and substitution biases analysis together, we were able to reduce the number of false mutation calls to only two. Most of the 4434 'N' calls made by PBC were due to conflicting base calls from the forward and reverse strand. By analysing the NHIP and hybridization intensity reduction order of the query base in the forward and reverse strand individually, we were able to identify the noisy strand and hence, make the base call only from the non-noisy strand. We were able to recover 92% of the 'N' calls made by PBC using this approach. In addition, we evaluate the robustness and reproducibility of EvolSTAR by employing six pairs of replicate experiments consisting of one pair nasal swab and five pairs of cell culture isolates, belonging to the same patient sample 305 (Supplementary Data File 4) . Of the experiments, two pairs of replicates (305_nasal and 305_cell_cond1) were amplified under the same optimal experimental conditions while each of the other pairs (305_cell_cond2, 305_cell_cond3, 305_cell_cond4, 305_cell_cond5) were amplified under different suboptimal experimental conditions (simulating experimental volatility). Compared with the available capillary sequences for sample 305, EvolSTAR had an average error rate of 0.0012% and 28 ambiguous calls per sample (338 in total). On the other hand, Nimblescan A CGT 547 246 CTG 558 237 GCT 957 367 GTC 2215 1407 TCG 1049 611 TGC 3015 2873 C AGT 2035 2712 ATG 1752 2400 GAT 382 341 GTA 159 134 TAG 360 377 TGA 165 129 G ACT 1474 1043 ATC 976 624 CAT 1639 1534 CTA 868 788 TAC 594 410 TCA 542 454 T ACG 432 529 AGC 562 636 CAG 623 841 CGA 1066 1616 GAC 1421 1878 GCA 1637 2841 Hybridization intensity reduction orders found in 25 028 true calls from two replicated hybridization experiments of patient sample 380. For each true call, for each strand, we rank the PM probe and its MM probes based on their hybridization intensities in decreasing order. We count the frequency of each hybridization intensity reduction order. 1 6002 6 5888 6 0 0 EvolSTAR 2 6002 6 5993 6 0 0 PBC 2 6002 6 5895 6 0 1 507 EvolSTAR 1 3921 8 3913 8 0 0 PBC 1 3921 8 3736 8 0 3 EvolSTAR 2 3921 8 3916 8 0 0 PBC 2 3921 8 3758 8 0 2 581 EvolSTAR 1 8574 10 8567 10 0 0 PBC 1 8574 10 8458 10 0 2 EvolSTAR 2 8574 10 8566 10 0 0 PBC 2 8574 10 8461 10 0 5 582 EvolSTAR 1 3057 4 3051 4 0 0 PBC 1 3057 4 2986 4 0 0 EvolSTAR 2 3057 4 3053 4 0 0 PBC 2 3057 4 3001 4 0 0 593 EvolSTAR 1 3054 3 3053 3 0 0 PBC 1 3054 3 3007 2 1 0 EvolSTAR 2 3054 3 3053 3 0 0 PBC 2 3054 3 2992 2 1 0 9 061 364 EvolSTAR 1 5129 5 5123 5 0 0 PBC 1 5129 5 5064 5 0 0 EvolSTAR 2 5129 5 5122 5 0 0 PBC 2 5129 5 5042 5 0 0 9 061 365 EvolSTAR 1 3000 3 2993 3 0 0 PBC 1 3000 3 2956 3 0 1 EvolSTAR 2 3000 3 2991 3 0 0 PBC 2 3000 3 2941 3 0 0 9 061 366 EvolSTAR 1 1683 3 1683 3 0 0 PBC 1 1683 3 1649 3 0 1 EvolSTAR 2 1683 3 1682 3 0 1 PBC 2 1683 3 1636 3 0 1 923 EvolSTAR 1 4373 5 4365 5 0 0 PBC 1 4373 5 4187 5 0 1 EvolSTAR 2 4373 5 4330 5 0 1 PBC 2 4373 5 3738 5 0 6 Types of calls and their frequencies generated by EvolSTAR and PBC in replicated microarray hybridizations of 14 patient samples. Partial or complete capillary sequences were generated for each sample and used to verify the calls made by EvolSTAR and PBC on each replicate. We then count the frequency of true-non-mutation, true-mutation, error and 'N' calls in each replicate. PBC obtained a relatively higher average error rate of 0.169% and 237 ambiguous calls per sample (2855 in total). Our results showed that EvolSTAR is robust and performs well when samples are prepared under sub-optimal conditions. Even for nasal swab samples that tend to have much less concentration of virus RNA than cell cultures, EvolSTAR suffered only a slight drop in performance compared to Nimblescan PBC. In conclusion, we have shown that EvolSTAR is robust and generates sequence calls of high accuracy and reproducibility in this pilot study consisting of 40 microarray experiments. Meanwhile, efforts will be put in to continually evaluate EvolSTAR with more samples and update it on a regular basis as the H1N1(2009) influenza virus evolves. Besides a FASTA output of the virus sequence, EvolSTAR generates a visualization map of the sequence calls using a heat map based on the percentage identity of the called sequence to the reference sequence measured at 50 bp windows (Figure 8 ). The map template consists of all eight segments of the 2009 influenza A(H1N1) virus and the locations of known drug binding sites (marked with green lines) on the NA gene. Locations of all mutation calls are denoted by red triangles beneath the heat map bar. Sequences that are of low coverage (<90%) are automatically flagged, and the overall PM/ MM discrimination ratio for each segment is displayed. The heat map bar allows the technician to rapidly assess the quality of the sequence data obtained from the microarray and identify regions where PCR did not work well, or presence of potential recombination/ reassortment events. Mutations, especially those in close proximity to drug binding sites, can be quickly visualized. Other details such as coverage, number of base calls successfully made, number of mutations and number of 'N' calls for each sequence call are also shown on the visualization map. Traditional statistical and probabilistic sequence-calling techniques ascertain that a base call is of high confidence if they exceed pre-defined significance or probability thresholds. This approach works well for high-confidence base-calls but is inadequate to extract sufficient information from noisy base-calls. It is also difficult to determine the validity of a mutation call purely based on the distribution of hybridization intensities of its PM and MM probes. In this work, we have described two new hybridization intensity analysis methods that enable us to confidently identify true mutations and recover some noisy base calls. Compared to PBC, EvolSTAR has achieved superior call rates and accuracies, especially in low-concentration samples with high CT values. The robustness of the base calls enables our approach to be a practical large-scale evolutionary surveillance tool. Although we are confident that our resequencing array can successfully generate complete sequences for the H1N1(2009) virus and its variants at the current stage, we cannot rule out the possibility of reassortments between the H1N1(2009) virus and other influenza viruses. Clearly, our resequencing array cannot fully sequence such events and will generate sequences with poor quality and coverage of the reassorted segments. To investigate the effects of a reassortment event on our array, we independently amplified segments 1, 2, 3, 5, 6 and 7 of the 2009 influenza A(H1N1) virus and segment 4 of a H3N2 influenza A virus, and hybridized them onto our array. The visualization map of this experiment is shown in Figure 9 . As expected, the sequence call for segment 4 [based on PM/MM probes from the segment 4 consensus of the 2009 influenza A(H1N1) virus] is poor in quality and coverage. However, we observed that we were able to get good base calls from region 1150-1547. This region turns out to be the only significantly similar (70% matched) region between the segment 4 consensus of the 2009 influenza A(H1N1) virus and segment 4 of a H3N2 virus (CY039087). This shows that identifying regions of high similarity between the 2009 influenza A(H1N1) virus with other influenza viruses and checking if these regions have good sequence calls may be a plausible way of detecting reassortments. The drawback of this approach is that it will fail to detect reassortment of Table 3 . Hybridization intensity reduction orders found in 14 hybridization experiments certain segments where there are no regions of high similarity between the H1N1(2009) virus and the parental influenza virus. It is also difficult to annotate and differentiate every region that the H1N1(2009) virus and all other influenza viruses share similarity with. We propose an alternative approach to detect reassortments. By analysing the PM/MM hybridization intensity foldchange of high confidence calls of all eight segments, we found that the average PM/MM hybridization intensity fold-change of high confidence calls in segments 1, 2, 3, 5, 6 and 7 belonging to the 2009 influenza A(H1N1) virus is $4.5 while the average PM/MM hybridization intensity fold-change of high confidence calls in segment 4 belonging to the H3N2 influenza A virus is only 1.9. The most likely reason for this huge drop in the average PM/MM hybridization intensity fold-change of high confidence calls is that the signal gained by most of the segment 4 PM probes on our array are through cross-hybridization to the segment 4 sequence of the H3N2 influenza A virus, and thus much lower than signal gained from true specific binding. Thus, by computing and comparing the average PM/MM hybridization intensity fold-change of high confidence calls in each segment, we can identify potential reassortments in a given H1N1(2009) virus sample. Virus samples with possible reassortments can then be sequenced using . A heat map bar is used to represent the quality and coverage of its sequence calls. The locations of all mutation calls made by EvolSTAR are represented by red triangles beneath the heat map bar. Sequences with coverage <90% are automatically flagged as 'low coverage'. Other details such as coverage: percentage of base calls successfully made, match: number of base calls that match the reference sequence i.e. non-mutation base calls, strong mismatch: number of high confidence base calls that do not match the reference sequence i.e. mutation base calls, weak mismatch: number of low-confidence base-calls that do not match the reference sequence i.e. mutation base calls and Ns: number of 'N' calls, for each sequence call are also shown on the visualization map. capillary sequencing or customized reassortment resequencing arrays. So far, the sequence diversity of H1N1 2009 influenza virus has been rather limited. From our analysis, it would be possible to resequence all the published isolates using this resequencing approach. However, as antigenic drift is expected to occur, it is likely that the resequencing array would need to be updated at least annually. Updating the array requires only bioinformatics input, and does not require any other additional manufacturing costs. Thus, this combination of sample amplification primers, low-cost multiplex array and robust interpretation software allows sustainable, rapid, large-scale biosurveillance of the influenza H1N1(2009) virus. From a broader perspective, this study has highlighted the feasibility of using resequencing microarrays for high-throughput full genome sequencing of viruses. In our application, resequencing microarrays are relatively low-cost, costing only a 10th that of a 454 run, and equivalent to that of a traditional capillary sequencing run. However, through multiplexing, our system can generate full genomes of 24 different H1N1(2009) samples in 30 h. In comparison, capillary sequencing and next-generation technologies such as 454 may obtain full genomes of only one or two different samples in the same time-frame. In practice, capillary sequencing is labour-intensive and thus impractical for large-scale full genome sequencing of viruses in an outbreak. In our experience with the 454 system, much of the amplified material is still human (as the bulk of the patient sample material is human RNA with very little influenza RNA), requiring very deep sequencing to obtain a complete flu genome sequence, with one compartment of a run not yielding sufficient viral information. Furthermore, assembly of the sequence fragments is required before any analysis can be done. Any abnormalities or gaps in the assembly would then require additional runs of 454, incurring more cost and time. Hence, our approach based on resequencing microarrays presents a cost-effective and efficient solution for high-throughput full genome sequencing of viruses. Supplementary Data are available at NAR Online. WHO issues guidelines for antiviral treatment for H1N1 and other Influenza New developments in high-throughput resequencing and variation detection using high density microarrays High-throughput variation detection and genotyping using microarrays Identifying influenza viruses with resequencing microarrays Universal detection and identification of avian influenza virus by use of resequencing microarrays Smallpox virus resequencing GeneChips can also rapidly ascertain species status for some zoonotic non-variola orthopoxviruses Testing and validation of high density resequencing microarray for broad range biothreat agents detection Tracking the evolution of the SARS coronavirus using high-throughput, high-density resequencing arrays Model-P: a basecalling method for resequencing microarrays of diploid samples A bioinformatic filter for improved base-call accuracy and polymorphism detection using the Affymetrix GeneChip Õ whole-genome resequencing platform LOMA: a fast method to generate efficient tagged-random primers despite amplification bias of random PCR on pathogens Recent developments in the MAFFT multiple sequence alignment program Mapping the sequence mutations of the 2009 H1N1 influenza A virus neuraminidase relative to drug and antibody binding sites High-throughput, high accuracy array-base resequencing Mismatch oligonucleotides in human and yeast: guidelines for probe design on tiling microarrays The authors would like to thank Sebastian Maurer-Stroh of the Bioinformatics Institute of Singapore for his expert advice on 2009 H1N1 Influenza A drug binding pocket targeted by NA inhibitors. Samples were from Ministry of Health of Singapore (MOH) and DSO National Laboratories (Singapore). Funding for this work was provided by the Genome Institute of Singapore and Exploit Technologies Pte Ltd, Agency for Science, Technology and Research (A*STAR).Conflict of interest statement. None declared. Hybridization intensity reduction orders found in 135 830 true calls from 14 hybridization experiments. For each true call, for each strand, we rank the PM probe and its MM probes based on their hybridization intensities in decreasing order. We count the frequency of each hybridization intensity reduction order.