key: cord-223292-ct8xyntw authors: Lemey, Philippe; Rambaut, Andrew; Bedford, Trevor; Faria, Nuno R.; Bielejec, Filip; Baele, Guy; Russell, Colin A.; Smith, Derek J.; Pybus, Oliver G.; Brockmann, Dirk; Suchard, Marc A. title: The seasonal flight of influenza: a unified framework for spatiotemporal hypothesis testing date: 2012-10-22 journal: nan DOI: nan sha: doc_id: 223292 cord_uid: ct8xyntw Global mobility flow data are at the heart of spatial epidemiological models used to predict infectious disease behavior but this wealth of data on human mobility has been largely neglected by reconstructions of pathogen evolutionary dynamics using viral genetic data. Although stochastic models of viral evolution may potentially be informed by such data, a major challenge lies in deciding which mobility processes are critical and to what extent they contribute to shaping contemporaneous distributions of pathogen diversity. Here, we develop a framework to integrate predictors of viral diffusion with phylogeographic inference and estimate human influenza H3N2 migration history while simultaneously testing and quantifying the factors that underly it. We provide evidence for air travel governing the global dynamics of human influenza whereas other processes act at a more local scale. Global mobility flow data are at the heart of spatial epidemiological models used to predict infectious disease behavior but this wealth of data on human mobility has been largely neglected by reconstructions of pathogen evolutionary dynamics using viral genetic data. Although stochastic models of viral evolution may potentially be informed by such data, a major challenge lies in deciding which mobility processes are critical and to what extent they contribute to shaping contemporaneous distributions of pathogen diversity. Here, we develop a framework to integrate predictors of viral diffusion with phylogeographic inference and estimate human influenza H3N2 migration history while simultaneously testing and quantifying the factors that underly it. We provide evidence for air travel governing the global dynamics of human influenza whereas other processes act at a more local scale. Global public health is repeatedly and increasingly challenged by the emergence of highimpact pathogens [1] . Novel influenza strains, severe acute respiratory syndrome (SARS) virus and Methicillin-resistant Staphylococcus aureus represent only a few examples of pathogens that exploited today's complex and voluminous human traffic and mobility to rapidly disseminate in our globalized world. The worldwide air transportation network is by far the most extensively studied mobility system in the context of human infectious disease dynamics [2] . Indeed, air travel represents an obvious driving force for the global circulation of seasonal influenza A (H3N2) viruses, and may explain the absence of locally persistent strains in between epidemic seasons [3] . Retrospective modeling of the 1968 'Hong Kong flu' pandemic spread demonstrated that the H3N2 virus diffused through a network of global cities interconnected by air travel [4] . Numerous modeling studies have subsequently examined the influence of air travel on influenza spread (e.g. [5, 6, 7, 8] ), but far less work has attempted to verify such models against underlying patterns of host movement [9] . Two studies on the timing and rate of seasonal influenza spread across the United States have highlighted the difficulty of resorting to standard epidemiological data to disentangle the contributions of different human transportation systems to influenza spread. Using weekly time series of excess mortalities between 1972 and 2002, [9] demonstrated that the patterns of timing and incidence across the continental United States are significantly associated with euclidean distance and various measures of domestic transportation (including airline travel), but most strongly with rates of movement of people to and from their workplaces. In the same year, Brownstein et al. (2006) [10] demonstrated that the rate of inter-regional spread and timing of influenza in the United States, as measured using weekly influenza and pneumonia mortality statistics from 1996 to 2005, is predicted by domestic airline travel volume in November. Because both studies implicated a different key driver of seasonal influenza spread across the United States, the findings were subject of debate [11] , in particular in the light of a mounting threat of an influenza pandemic [12, 13] and the need for decisions on implementing travel restrictions. Insights at the global scale are also revealed by simulation studies (e.g, [14] ), and empirical analysis using global mortality data may prove even more challenging. As a historical record of epidemic spread, viral genetic data may offer a valuable alternative for empirical verification of epidemic models. The power of fitting statistical models of evolution to observed sequence data has clearly been demonstrated by a number of seminal studies, e.g. by revealing the genetic dynamics of influenza A H3N2 seasonality [15] and spatial patterns of global H3N2 circulation [3, 16] . More generally, viral phylogenetic and epidemiological insights have culminated into a phylodynamic framework that unifies evolutionary and ecological dynamics to explain patterns of viral diversity [17] . Although model-based inference is increasingly used to reconstruct viral diffusion through time and space, these attempts typically fit parameter-rich models to sparse spatial data, and post-hoc interpretation of phylogeographic patterns are then difficult to relate directly to underlying ecological and evolutionary processes [18] . Here, we present a novel model-based approach to simultaneously reconstruct spatiotemporal history and test the contribution of potential diffusion predictors. Our phylogeographic reconstruction considers discrete sampling locations defined by geographical and administrative boundaries as well as air communities identified through direct analysis of the global air transportation network. By parameterizing the discrete diffusion process in terms of the inclusion and contribution of predictors, our approach generally requires considerably fewer parameters compared to standard phylogeographic inference. We demonstrate how this model allows for the integration of viral genetic data and human mobility measures to draw inference about key drivers of global influenza dynamics. We compiled 1441 hemagglutinin sequences with known date and location of sampling previously obtained by [3] . These sequences were sampled globally from 2002 to 2007 and are representative of a larger sampling (13,000 isolates) used for antigenic analysis [3] . We explored different spatial and air travel-assisted subdivisions with subsampling to examine the impact of discrete sampling allocation and sample numbers per locations on our phylogeographic estimates. In an attempt to include all sequence data while keeping the number of samples per location as balanced as possible, we first divided all the sequences into 26 geographic regions (Table 1 ). Since these spatial partitions sometimes required arbitrary subdivisions (e.g. breaking up USA, China, Japan and Australia), we also applied a discrete location scheme that accommodated a single location for these spatially and administratively coherent regions, arriving at 15 geographic regions ( Table 2 ). Within each sampling year, we randomly down-sampled the five locations with the highest number of samples relative to their population size (USA: from 278 to 150; Australia: from 166 to 30; New Zealand: from 59 to 20; Japan: from 341 to 75; South Korea: from 51 to 30) and analyzed three different subsampled data sets. Because passenger flux emerged as the main predictor in our phylogeographic model (see 3.3.1), we also identified discrete air communities in the worldwide air transportation network (see 3.2.1) and applied these as location states to our sequence sample. To increase sequence numbers for under-sampled air communities, we also complemented the hemagglutinin gene sequences with publicly available sequences from Africa (n = 21), USA (Hawaii, n = 4), Central America (n = 13), South America (n = 46) and Canada (n = 10). From this data set, we removed six sequences that appeared to be outliers in a root-to-tip divergence versus sampling time regression analysis, resulting in a total of 1529 sequences. Within each sampling year, we randomly down-sampled the four locations with the highest number of samples relative to their population size (USA: from 318 to 120; Oceania: from 225 to 50; Japan: from 327 to 75; Southeast Asia: from 175 to 100) and analyzed three different subsampled data sets discretized according to the 14 air communities. The worldwide air transportation network is defined by a passenger flux matrix that quantifies the number of passengers traveling between each pair of airports. We use a dataset provided by OAG (Official Airline Guide) Ltd. (http://www.oag.com), containing 4,092 airports and the number of seats on scheduled commercial flights between pairs of airports during the years 2004-2006. The number of seats on scheduled commercial flights from airport i to j is given by Ψ ij , which we take to be proportional to the number of passengers traveling. For the 26 location scheme, we summarized the number of passengers from the full aviation network for each pair of locations based on all the airports in the respective regions. To facilitate the identification of 14 air communities, we focused on a 1227-largest-airport network that represented 95% of the passenger flux of the full aviation network by excluding the lowest contributing airports. By focusing on this subset of largest airports, we exclude a large number of very small community airports; details and plausibility of this reduction are discussed in [19] . To identify air transportation communities, we approximate a maximal-modularity subdivision of the 1227-largest-airport network by employing a recently described stochastic Monte-Carlo approach [20] , a generalization of the method introduced in [21] . Modularity provides a measure of how well the connectivity of a network is described by partitioning its nodes into non-overlapping groups. For any given partition, modularity will be high if connectivity within groups is high and connectivity among groups is low. In large networks, it is generally computationally impossible to find the optimal subdivision. To approximate the optimum a variety of approximative methods have been introduced. The method we employ here generates an ensemble of high modularity subdivisions and computes the consensus in this ensemble by superposition, for details see [20, 22] . For an ensemble of 1000 modularity subdivisions we quantify the uncertainty by an affinity matrix that, for each pair of locations, summarizes the fraction of partitions in which these locations are in the same community. Based on a partition encompassing a reasonably large number of air communities (n = 14), we subsequently obtain the average affinity for each airport to the communities in this partition. We assign each airport to the community for which it shows the highest average affinity, but we take into account its uncertainty by also considering assignments that yield affinities that are > 2/3 of the highest affinity score. This cut-off resulted in 771 ambiguous airport assignments. Finally, we partitioned the sequence data according to the air community assignment and accommodate 368 (24%) ambiguous sequence locations, i.e. those sequences related to airports with ambiguous community assignments, using ambiguity coding in our phylogeographic approach. We integrate genetic, spatial and air transportation data within a single full probabilistic evolutionary model and simultaneously estimate the parameters of phylogeographic diffusion using Markov chain Monte Carlo (MCMC) analysis implemented in BEAST [23] . We introduce novel models and inference procedures in the section below. To model sequence evolution, we partition the hemagglutinin codon positions into first+second and third positions [24] and apply a separate HKY85 [25] CTMC model of nucleotide substitution with discrete gamma-distributed rate variation [26] to both. We assume a flexible Bayesian skyride prior over the unknown phylogeny [27] . Exploratory runs using the data for the 26 locations indicated that a relaxed molecular clock represented an over-parametrization [28] . A strict clock was therefore used in subsequent analyses. Because the exact date of sampling was not known for some additional publicly available sequences, we integrated out their dates over the known sampling time interval [29] . We capitalize on BEAGLE [30] in conjunction with BEAST to improve computational performance on our large data sets. MCMC analyses were run sufficiently long to ensure stationarity as diagnosed using Tracer. We used the TreeAnnotator tool in BEAST to summarize trees in the form of maximum clade credibility (MCC) trees. We develop a novel model-based approach to simultaneously reconstruct spatiotemporal history and test the contribution of potential predictors of spatial diffusion. This approach builds on recently developed Bayesian phylogeographic inference methods to simultaneously reconstruct phylogenetic history and discretized diffusion processes [31] . These processes are modeled as continuous-time Markov chain processes parameterized in terms of a K x K infinitesimal rate matrix of discrete location change ( Λ). Here, we extend this model by adopting a generalized linear model (GLM) approach that considers every rate of movement (Λ ij ) in Λ as a log linear function of an arbitrary number of predictors X, such that: for n predictors, where β represents the effective size for the predictor log p, quantifying its contribution to Λ ij , and δ is an (0,1)-indicator variable that governs the inclusion or exclusion of the predictor in the model. The incorporation of indicator variables allows Bayesian stochastic search variable selection (BSSVS) [32, 33] , which estimates the posterior probabilities of all possible linear models that may or may not include the predictors. When an indicator equals 1, this predictor is included in the model, demonstrating that it helps to explain the diffusion process in the phylogenetic history with high probability. We complete this GLM specification with variable selection by assigning independent Bernoulli prior probability distributions on δ, effectively placing equal probability on each predictors inclusion and exclusion. Lemey [35, 36] to express how much the data change our prior opinion about the inclusion of each predictor. These BFs are calculated by dividing the posterior odds for the inclusion of a predictor with the corresponding prior odds (here, 1:1 odds). where p i is the posterior probability that the predictor is included, in this case the posterior expectation of indicator δ i , and q i is the prior probability that δ i = 1. We specify that a priori the β's are normally distributed with mean 0 and a variance of 4. We implement the GLM-diffusion parametrization in the software package BEAST [37] and approximate the joint posterior and its marginalizations using standard Markov chain Monte Carlo (MCMC) transition kernels. An important, novel extension to the standard MCMC machinery in BEAST lies in generating an efficient Metropolis-Hastings proposal distribution for the GLM coefficients β. Given the potential for high correlation between predictors X, attempting to update one coefficient β j at a time while holding the remaining constant returns high autocorrelations times. Instead, we exploit the fixed correlation structure X X between predictors to generate a multivariate proposal β . In particular, if we assume β are the current realized values, then we draw where α is an auto-tunable variance scalar. Motivation for this proposal stems from imagining that the marginal posterior distribution of β under our phylogenetic GLM should partially approximate a simple linear regression model involving β, whose posterior variance is proportional to X X. We consider a 'bit-flip' operator on the Bernoulli rate indicators; this transition kernel is further discussed in [38] . We extended the phylogeographic inference techniques to take into account ambiguous location states in order to accommodate the uncertainty of the modularity maximization procedure in assigning airports to distinct discrete air communities (see 3.2.2). Depending on the location state partitioning scheme, we considered several potential predictors of global influenza diffusion: • Average and minimum distance. To test whether geographical proximity predicts influenza diffusion we considered two different great-circle distance measures: (i) the average distance between two locations based on the pairwise distances between all pairs of airports from the two locations and (ii) the minimum distance amongst those pairwise distances. • Absolute latitude. Absolute latitudes for each region/community were calculated as the absolute values of the average latitudes of the sequence sampling locations (Sequences from unknown locations within specific countries were assigned to the capital of that country) and are listed in Tables 1, 2 and 3 . • Passenger flux. The total number of passengers traveling between each pair of locations per day (see 3.2.1). • Population size and density. Demographic estimates obtained from www.citypopulation. de or Geographica [39] are listed in Tables 1, 2 and 3. Origin and destination population sizes/densities were included as separate predictors. • Viral surveillance data. To test the predictive power of viral surveillance data, we essentially aimed at capturing the nature and degree of synchronicity of yearly incidence profiles in each region. To this purpose, we extracted the number of influenza viruses A(H3) detected per country from week 1 in 1997 to week 45 in 2010 from FluNet/WHO (www.who.int/flunet) for relevant countries in the discrete partition schemes. Taiwanese surveillance data was obtained from [40] . We focused on the influenza A(H3) incidence counts between 2002 -2007 or as close as possible to this time period when insufficient data was available. Average incidence counts were used when data from multiple countries per region/community was available. We subsequently calculated average incidences per week across multiple years for each region/community, normalized these weekly averages and smoothed them with a Gaussian standard deviation of 2 weeks. Figure 1 depicts the resulting incidence profiles for the 14 air communities. We derived several potential predictors from these incidence profiles, including incidence overlap, origin incidence versus destination growth rate, peak time difference, and incidence in the origin location at fixed times prior to peak incidence in the destination location. The incidence overlap summarizes the overlapping area under the origin-destination incidence curves for each pair of locations. The origin incidence versus destination growth rate sums the product of origin incidence and destination growth rate for each week of the year. Peak time difference quantifies the difference in peak incidence for each origin-destination pair. For the latter, we summarized the donor incidences at 4, 8, 12, 16, 20 and 24 weeks prior to peak incidence in the destination location as potential predictors. • Antigenic evolution. Because antigenic evolution can provide insights into the seeding dynamics of seasonal H3N2 [3] , we sought to include the average antigenic divergence for each region as phylogeographic diffusion predictors. Based on the available antigenic cartography data for the strains in our phylogeographic analyses, we performed a local regression (LOESS) of the principle antigenic component, obtained from a multidimensional scaling analysis of hemagglutination inhibition assay measurements [3] , against time. The resulting scatter plot with strains colored according to air community is presented in Figure 2 . Distances from the spline (residuals) were calculated for each antigenic measurement and average residuals were obtained for each region/community, which reflects whether a location is on average antigenically leading or trailing [3] . These average residuals are listed in Tables 1, 2 and 3. We considered the exponentiated residual and exponentiated negative residual as a measure of efflux and influx respectively for each location and included these as separate origin and destination predictors. • Sample sizes. To test the impact of sampling effects, we considered origin and destination sample sizes (number of H3N2 sequences included per discrete location state in the phylogeographic analysis) as separate predictors. Although sampling sizes are expected to have an impact on the number of location transitions, support for other factors in addition to sampling size predictors may suggest that they are robust to potential sampling biases. All predictors were transformed to log space and standardized prior to their incorporation in the GLM approach. Although we generally desire to simultaneously reconstruct sequence and discrete/continuous trait evolution using our Bayesian statistical framework, integrating over tree-space becomes a computationally daunting task for a large number of taxa. The main limiting factor in Bayesian MCMC analysis of evolutionary history is typically the efficiency with which topology proposals sample phylogenetic tree space [41] . To side-step these limitations and reduce time to convergence, we seek to approximate phylogenetic uncertainty in our phylogeographic estimates in cases where sampling tree space needs to be performed repeatedly (e.g. when comparing different diffusion models). To this purpose, we follow [42] and implement proposal mechanisms to randomly draw from an empirical posterior distribution of trees, which, in our case, were solely inferred from sequence data. Because the likelihood of a tree topology will largely be dominated by an informative sequence alignment compared to a single discrete (location) site, we expect such an empirical distribution to closely approximate the phylogenetic uncertainty in the joint inference approach. To identify key factors in seasonal influenza dispersal, we inferred the phylogeographic history of globally sampled A/H3N2 viruses between 2002 and 2007, while simultaneously evaluating the contribution of several potential diffusion predictors using a novel Bayesian model selection procedure. Our approach draws from recent developments in stochastic phylogenetic diffusion models [31] , and extends these by parameterizing pairwise diffusion rates as a function of a number of potential predictors, distinguishing between the statistical support for a predictor and the magnitude of its effect (see Methods). We discretized the global sampling locations into 26 and 15 geographically defined regions (Table 1 and 2 respectively) as well as into 14 distinct air travel communities (Table 3) . To identify this community structure in global air travel, we determined partitions with high intra-module connectivity and low inter-module connectivity in a passenger flux network of 1227 airports. Although this approach is blind to the geographic locations of the airports, the global air communities are spatially compact with few exceptions (Figure 3) . We compared a panel of possible predictors of phylogeographic diffusion using a generalized linear model (GLM) approach (see Methods). Here, we considered geography, air travel (in the form of the number of passengers traveling between each pair of locations), demography, viral surveillance data, viral phenotypic evolution and sampling sizes as possible explanatory variables (Figure 4 ). Our phylogeographic test does not attribute any importance to geographical proximity, absolute latitude (to model source-sink behavior between the tropics and Northern/Southern hemisphere, [15] ), population size and density, antigenic divergence or H3 incidence for the 14 air communities and the 15 geographic regions. Instead, we provide consistent and overwhelming evidence for passenger flow driving the global H3N2 diffusion dynamics, as reflected Table 3 . in the posterior support and a conditional effect size close to unity on a log scale for both the air community and geographic partitions. Despite efforts to down-sample presumably oversampled regions or communities (see Methods), there is still a role for sample size in both the air community and geographic partitions. Explicitly including sample sizes as diffusion predictors allows us to absorb potential effects of sampling bias, offering more credibility for other predictors that are included in the model. When applied to 26 geographic regions, which further partitions geographically and administratively coherent regions like US, China, Japan and Australia (Table 1) , our model also takes an important negative contribution from geographic distance and, correlated with this but with lower coefficients, population densities. Here, distance most likely represents the role of other human mobility processes such as commuting, which has been shown to play a key role in the spread of influenza in the US [9] . The negative population density effect may suggest that commuting is to some extent less likely to occur out of or into dense subpopulations. Influenza prevention and control critically relies on our understanding of its geographical transmission patterns. Here, we demonstrate the ability to jointly reconstruct phylogeographic history while identifying the factors that contribute epidemic spread from viral genetic data. Our analysis of global influenza transmission provides evidence for the key role of air travel, which is highly intuitive and has long been predicted by modeling studies (e.g. [5] ), but remained difficult to ascertain from empirical data. The predictors of influenza diffusion will undoubtedly be scale-dependent as indicated by the role of geographic distance within more confined geographic areas (Figure 4) , and this may represent other forms of human mobility such as commuting [9] , which can be tested in future applications. More generally, our novel phylogenetic diffusion approach may be applied to different infectious diseases problems and provide entirely new opportunities for testing how host ecology shapes the distribution of pathogen genetic data. Diffusion Fundamentals III (Leipziger Universitätsverlag Handbook of Optimization in Complex Networks Theory of Probability. Oxford Classic Texts in the Physical Sciences Geographica: the complete illustrated atlas of the world We thank Guandi Li for Matlab assistance, Hsin-Fu for providing the Taiwanese H3 incidence data and Jessica Hedge for collecting the coordinates for the H3N2 sequences.