key: cord-0780450-z55jjlct
authors: Polonsky, Jonathan A.; Baidjoe, Amrish; Kamvar, Zhian N.; Cori, Anne; Durski, Kara; Edmunds, W. John; Eggo, Rosalind M.; Funk, Sebastian; Kaiser, Laurent; Keating, Patrick; de Waroux, Olivier le Polain; Marks, Michael; Moraga, Paula; Morgan, Oliver; Nouvellet, Pierre; Ratnayake, Ruwan; Roberts, Chrissy H.; Whitworth, Jimmy; Jombart, Thibaut
title: Outbreak analytics: a developing data science for informing the response to emerging pathogens
date: 2019-07-08
journal: Philos Trans R Soc Lond B Biol Sci
DOI: 10.1098/rstb.2018.0276
sha: 89f41c87c8849ce37e609c1010087291a4679a37
doc_id: 780450
cord_uid: z55jjlct

Despite continued efforts to improve health systems worldwide, emerging pathogen epidemics remain a major public health concern. Effective response to such outbreaks relies on timely intervention, ideally informed by all available sources of data. The collection, visualization and analysis of outbreak data are becoming increasingly complex, owing to the diversity in types of data, questions and available methods to address them. Recent advances have led to the rise of outbreak analytics, an emerging data science focused on the technological and methodological aspects of the outbreak data pipeline, from collection to analysis, modelling and reporting to inform outbreak response. In this article, we assess the current state of the field. After laying out the context of outbreak response, we critically review the most common analytics components, their inter-dependencies, data requirements and the type of information they can provide to inform operations in real time. We discuss some challenges and opportunities and conclude on the potential role of outbreak analytics for improving our understanding of, and response to outbreaks of emerging pathogens. This article is part of the theme issue ‘Modelling infectious disease outbreaks in humans, animals and plants: epidemic forecasting and control‘. This theme issue is linked with the earlier issue ‘Modelling infectious disease outbreaks in humans, animals and plants: approaches and important themes’.

Despite continued efforts to improve health systems worldwide, emerging pathogen epidemics remain a major public health concern. Effective response to such outbreaks relies on timely intervention, ideally informed by all available sources of data. The collection, visualization and analysis of outbreak data are becoming increasingly complex, owing to the diversity in types of data, questions and available methods to address them. Recent advances have led to the rise of outbreak analytics, an emerging data science focused on the technological and methodological aspects of the outbreak data pipeline, from collection to analysis, modelling and reporting to inform outbreak response. In this article, we assess the current state of the field. After laying out the context of outbreak response, we critically review the most common analytics components, their interdependencies, data requirements and the type of information they can provide to inform operations in real time. We discuss some challenges and opportunities and conclude on the potential role of outbreak analytics for improving our understanding of, and response to outbreaks of emerging pathogens.

This article is part of the theme issue 'Modelling infectious disease outbreaks in humans, animals and plants: epidemic forecasting and control'. This theme issue is linked with the earlier issue 'Modelling infectious disease outbreaks in humans, animals and plants: approaches and important themes'.

the Middle-East Respiratory Syndrome coronavirus (MERS-CoV) [2] [3] [4] , the emergence of Zika [5, 6] and the West African Ebola virus disease (EVD) outbreak [7, 8] , have been potent reminders of the need for robust surveillance systems and timely responses to nascent epidemics [9] . The West African EVD outbreak, by far the largest such epidemic in recorded history, in particular, had a strong impact on global health security and public health policy and practice [7, 8, 10] . It highlighted the difficulties of maintaining situational awareness in the absence of standards for surveillance, data collection and analysis, as well as the challenges of mounting and sustaining a large-scale international response [7, 8, 11, 12] . Despite the lessons learnt [9, 13, 14] , the recent (2018) EVD outbreaks in Democratic Republic of the Congo [15, 16] are a stark reminder that a large number of these challenges remain.

An important feature of the modern response to epidemics is the increasing focus on exploiting all available data to inform the response in real time and allow evidence-based decision making [3, 4, 7, 8, 13, 17] . Using data for improving situational awareness is complex, involving a range of inter-connected tasks and skills from point-of-care data collection to the generation of informative situational reports (sitreps). The science underpinning these data pipelines involves a wide range of approaches, including database design and mobile technology [18, 19] , frequentist statistics and maximum-likelihood estimation [7] , interactive data visualization [20, 21] , geostatistics [22] [23] [24] , graph theory [20, 25, 26] , Bayesian statistics [8, 27, 28] , mathematical modelling [29] [30] [31] , genetic analyses [32] [33] [34] [35] [36] and evidence synthesis approaches [37] . This accretion of heterogeneous disciplines, which may be best summarized as 'outbreak analytics', forms an emerging domain of data science: an 'interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms' [38] , dedicated to informing outbreak response. Outbreak analytics sits at the crossroads of public health planning, field epidemiology, methodological development and information technologies, opening up exciting opportunities for specialists in these fields to work together to meet the needs for an epidemic response.

In this article, we outline this developing research field and review the current state of outbreak analytics. In particular, we focus on how different analysis components interact within functional workflows, and how each component can be used to inform different stages of an outbreak response. We discuss key challenges and opportunities associated with the deployment of efficient, reliable and informative data analysis pipelines and their potential impact.

The focus of the public health response shifts during the course of an epidemic or outbreak, and so do the analytics. We identify four main stages (figure 1). The detection stage starts with the first case and ends with the first intervention activities (e.g. patient isolation, contact tracing, vaccination) and involves surveillance systems and mostly qualitative risk assessments. Next, the early response is the initial part of the intervention during which the first simple analytics can take place, essentially centred around estimating transmissibility. This blends into the intervention stage, where more complex analytics may be involved to inform planning (e.g. vaccination strategies), which ends once the last reported case has recovered or died. The post-intervention stage is for lessons to be learned, for improving preparedness for the next epidemic and for training and capacity building [39] .

During the early response, efforts are dedicated to estimating the likely impact of the outbreak and anticipating the nature, scale and timing of resources needed [7, 13, 15] . Theoretically, different factors including not only the total number of cases and fatalities but also the morbidity and overall impact on quality of life, as well as societal and economic impact, should ideally be taken into account when attempting to predict disease burden [40] [41] [42] [43] . Generally, as the demographic and morbidity data needed by composite measures of healthadjusted life years [40] are lacking in outbreak response contexts, efforts tend to focus on other proxies of impact: assessing transmissibility, predicting future case incidence and associated mortality and investigating risk factors [1, 3, 7, 15] .

Analytical needs to diversify as the intervention progresses. While investigations of transmissibility, mortality and risk factors remain key throughout [8] , new questions may arise to inform the implementation of control and mitigation measures. These may focus on predicting the impact of potential measures including testing (e.g. 'Could a rapid test help reduce incidence?' [29] ), vaccine development (e.g. 'Could a candidate vaccine be evaluated in this outbreak?' [44, 45] ), vaccination campaigns (e.g. 'Which is the optimal vaccination strategy?' [46, 47] ) or travel restrictions (e.g. 'Should international travel be restricted?' [48] ), or on estimating the impact of current measures such as improvements in access to care (e.g. 'Has the royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 374: 20180276 delay between symptom onset and hospitalization been reduced?' [14, 15] ). As case incidence reduces, statistical modelling can also be useful for assessing or predicting the end of an outbreak [49] [50] [51] . At the field operational level, outbreak response analytics may be best focused on informing and monitoring core surveillance activities and performance indicators, such as contact tracing [11] , through the use of tools for contact data visualization [52] , mapping [53, 54] and on analysis pipelines integrating mobile data collection tools [18, 19, 55, 56] with automated reporting systems [57] [58] [59] . Finally, the postintervention phase lends itself to retrospective studies, which can assess further the impact of interventions [60] , tease apart finer processes driving the epidemic dynamics such as contact patterns [12, 61] , study risk factors [54, 62] , identify avenues for fortifying surveillance [13, 36, 63] and evaluate, improve and develop modelling techniques [28, 64, 65] .

(c) What are outbreak data?

The term 'outbreak data' encompasses different types of information, of which we first distinguish 'case data' from 'background data'. Case data include the description of reported cases gathered in linelists, i.e. flat files where each row is a case and each column a recorded variable (e.g. dates of onset and admission, gender, age, location), thereby fulfilling the definition of 'tidy data' in the data science community [66] . Case data also include exposure and contact tracing data, either stored within a linelist or in separate files, pathogen whole genome sequencing (WGS) and data pertaining to outbreak investigations (e.g. casecontrol and cohort study data). Background data document the underlying characteristics of the affected populations. This includes demographic information (e.g. maps of population densities, age stratification, mixing patterns), movement data (e.g. borders, traveller flows, migration), health infrastructure (e.g. healthcare facilities, drug stockpiles) and epidemiological data themselves (e.g. levels of pre-existing immunity). A final type of data we consider here is 'intervention data', which refers to information on decisions made and efforts deployed as part of the intervention, such as vaccination coverage, the extent of active case finding or potential changes in the epidemiological case definition. An in-depth discussion of data needs in outbreaks can be found in Cori et al. [13] .

We use the term 'outbreak analytics' to refer to the variety of tools and methods used to collect, curate, visualize, analyse, model and report on outbreak data. These tools and their inter-dependencies are summarized in an exemplary workflow represented in figure 2 , derived from analyses pipelines used during recent epidemics of pandemic influenza [1] , MERS-CoV [4] and EVD [7, 8, 17] . Note that workflows may vary substantially in other epidemic contexts. For instance, analyses of food-borne outbreaks may focus on traceback data [67] [68] [69] , while vector-borne disease analysis may focus heavily on modelling the vector's ecological niche [70, 71] .

Tools for data capture have become a focus of much discussion in recent years as those involved in outbreak response seek to make use of important technological advances including mobile data collection, cloud computing and built-in automated data analyses and reporting. In resource-limited settings, in particular, epidemiological data are still often collected with pen and paper, the advantages of which are familiarity, simplicity, low cost and reliability where access to Internet and power sources may be limited. However, there are some downsides to using paper as a data management tool, becoming increasingly important with larger outbreaks, as any system for the printing and distribution, collection and storage and digitization of forms becomes overwhelmed. Additionally, two-stage processes involving transcription of data from forms typically introduces additional data entry errors [72] [73] [74] [75] and substantial delays from data capture to analysis [72] .

Electronic data collection (EDC) is becoming increasingly popular [18, 19, 55, 56] . These tools make use of widely available, low-cost hardware (e.g. smartphones and tablets) [76] that can, when appropriately configured, consume little power and collect data offline, making them suitable for use in resource-poor settings. Some of those may be part of existing surveillance systems or be deployed instead for specific enhanced surveillance and response activities during an outbreak. EDC platforms can also enhance data quality through the use of restriction rules and logical checks, and enforce reporting (even when there are zero cases) and entry of essential variables [72, 76] . EDC can decrease the delay between data collection, centralization and analysis, which is critical for data-driven responses. Time can be saved through 'form logic' (e.g. automatically skipping sections of a survey not relevant to a participant), while real-time, automated centralization, data analysis and reporting can be directly built into the platform. In addition, mobile-based EDC enables the collection of other types of data including GPS coordinates, photographs, barcode (useful to link case data and clinical specimens) and even aiding diagnostics by directly interfacing with point-of-care diagnostic devices [77] [78] [79] .

Maintaining confidentiality and privacy is a legitimate concern whenever data concerning human subjects are collected. While EDC systems provide opportunities for unauthorized interception and access to such information, many systems support end-to-end encryption during data transfer [80] , although few provide additional security through encryption at the level of data entry.

The first, and arguably one of the most important steps in data analysis is exploration, where visualization plays a central role, completed with informative summary statistics [81, 82] . The first type of graphics needed for rapid assessment of ongoing dynamics is the epidemic curve (epicurve), which shows case incidence time series as a histogram of new onset dates for a given time interval [83 -85] . Cumulative case counts, sometimes used in the absence of a raw linelist, are best avoided in epicurves, as they tend to obscure ongoing dynamics and create statistical dependencies in data points that will result in biases and lead to under-estimating uncertainty in downstream modelling [86] .

Maps have been at the core of infectious disease epidemiology from a very early stage [87] . Nowadays, they are typically used to visualize the distribution of disease [88] , for representing the 'ecological niche' of infectious diseases at large scales [23, 24, 89] and for assessing the spatial dynamics of an outbreak and strategizing interventions [7, 8] . Providers of free and crowd-sourced [90] ) and the Radiant Earth Foundation (Radiant Earth Foundation -Earth imagery for impact; see https://www.radiant.earth (accessed 18 November 2018)) provide layers of spatial data that include information on the location of households and health facilities, among other determinants. Several tools including SaTScan and ClusterSeer are routinely applied to surveillance system data for automated outbreak detection and the evaluation of clustering of disease by time and space [91] . Other examples of freely available mapping tools that can help track the spread of infectious diseases include the Spatial Epidemiology of Viral Haemorrhagic Fevers (VHF) disease visualization (see http://www.healthdata.org/datavisualization/spatial-epidemiology-viralhemorrhagic-fevers; accessed 19 September 2018), which maps risks of emergence and spread of VHF diseases, Nextstrain [92] and Microreact [93] , which focus on mapping pathogen evolution and epidemic spread, and HealthMap [94] , which provides resources for the rapid detection of outbreaks. Geographical locations of reported cases can also be useful for informing more complex modelling approaches [95] . In epidemics driven by person-to-person transmission, a last essential source of data is contact data [20] , which includes data on case exposure [12] as well as contact tracing, where appropriate [11, 63] . Exposure data document transmission pairs, which can yield precious insights into 'paired delays' (figure 2) including the serial interval (time between onsets of a case and their infector) or the generation time (time between the dates of infections of a case and their infector) [7, 8] , which are in turn useful for estimating transmissibility [27, 28, 96, 97] . Exposure data can also be used to investigate the occurrence and determinants of super-spreading events [12] and help identify introduction events in the case of zoonotic diseases [98] . Contact tracing, through the early detection of new cases and their subsequent isolation and treatment, plays a central role in reducing onward transmission and therefore containing outbreaks [11, 63, 99] , while additionally providing potential information on risk factors [7, 11] .

Summary statistics are a useful complement to data visualization in the exploratory phase of data analysis. Some metrics, such as transmissibility, require the use of statistical or mathematical models in order to be estimated (see §3d below) and are therefore not readily available as descriptive tools. Other useful statistics can be readily computed from linelists, including different demographic indicators of the reported cases (e.g. gender, age, occupation [7, 100, 101] ), case fatality ratios (the proportion of cases who died of the infection) or case delays such as the times to hospitalization, recovery or death, reported as a whole [1, 7, 8] or stratified by groups [100, 101] . The incubation period (time from infection to symptom onset) is another important delay for informing the intervention (e.g. to define the duration of contact tracing or declare the end of an outbreak), but can be harder to derive as it requires data on case exposure as well. Note that in the case of delays, these are best analysed by characterising the full distribution (e.g. by fitting to an appropriate probability distribution such as discretized Gamma [7] ) rather than reported as a single central value [7, 8, 102, 103] .

The 'transmissibility' of a disease is here used to refer to the rate at which new cases arise in the population, resulting either in epidemic growth or decline [1, 3, 27, 28] . Rather than an intrinsic property of a specific disease, transmissibility thus defined quantifies the propagation of a pathogen in a given epidemic setting and is impacted by multiple factors including population demographics, mixing and levels preexisting immunity. Importantly, estimates of transmissibility reported in the literature will typically be biased towards higher values, as subcritical outbreaks are by definition less likely to be detected. Several metrics of transmissibility can be used depending on the type of data available and can be estimated using different approaches.

A first measure of transmissibility is the growth rate (r), which is estimated from a simple model where case incidence is either exponentially growing (r . 0) or declining (r , 0). Typically, r is estimated directly from epicurves (figure 2) using a log-linear model, where r is defined as the slope of a linear regression on log-transformed incidence [104, 105] . Besides its simplicity and its computational efficiency, this approach has the benefits of being embedded in the linear modelling framework, thereby allowing one to measure the uncertainty associated with a given estimate of r, to test for differences in growth rates, e.g. between different locations, and to derive short-term incidence predictions. Moreover, the growth rate can also be used to estimate the doubling and halving times of the epidemic, i.e. the time during which incidence doubles (respectively is halved), as alternative metrics of transmissibility [103] . Unfortunately, the log-linear model can only fit exponentially growing or decaying outbreaks, which may not always be appropriate in the presence of complex spatial or age structure, or owing to changes in reporting, transmissibility or proportion of susceptible individuals over time. Besides, it cannot readily accommodate time periods with no cases, so that its applicability may in practice be restricted.

While r quantifies the speed at which a disease spreads, it does not contain information on the level of the intervention that is necessary to control a disease [106] . This is better characterized by the reproduction number (here generically noted 'R'), which measures the average number of secondary cases caused by each primary case. Researchers typically distinguish the basic reproduction number (R 0 [104] ), which applies in a large, fully susceptible population, without any control measures, from the effective reproduction number (R eff ), which is the number of secondary cases after accounting for behavioural changes, interventions and declines in susceptibility [96] . The current reproduction number determines the dynamics of the epidemic in the near future, with values greater than 1 predicting an increase in cases, and values less than 1 predicting control [104] . The value of R can also be used to calculate the fraction of the population that needs to be immunized (typically through vaccination) in order to contain an outbreak [104] .

Different methodological approaches have been developed to estimate the reproduction number. R can be approximated using estimates of the growth rate r combined with knowledge of the generation time distribution [97] . R can also be derived from compartmental models [104, 107] . The formula will depend on the type of model used, but such estimation will usually require that different rates (e.g. rates of infection, recovery, death) are either known or estimated by fitting the model to data [104, 107] . Real-world complexities can be incorporated into this approach; however, fitting such models can be challenging and may require computationally intensive algorithms such as data augmentation, approximate bayesian computation, or particle filters [108] . Compartmental models also require assumptions about the total population size and the proportion of the population at risk, which may be difficult to estimate in an outbreak. As an alternative, branching process models can be used to estimate R directly from incidence data [27, 28, 96, 109] . This requires a pre-specified distribution of the generation time, or of the serial interval, although recent developments suggest that in some cases, the generation time distribution itself can also be simultaneously estimated [4] . Branching process models are usually much simpler to fit to data than their compartmental counterparts, which facilitates their use in real time [27] .

Beyond the mere estimation of transmissibility, it is often essential to forecast future incidence for advocacy and planning purposes, e.g. to compare different interventions and epidemic scenarios [7, 8, 15, 30] . A variety of mathematical and statistical models, including those reviewed here for estimating transmissibility, can also be used for short-term incidence forecasting [65] . Despite the growing body of research focusing on predicting incidence during epidemics [65, 110] , there are currently no gold standards and the relative performances of forecasting methods largely remain to be assessed. Methods royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 374: 20180276 that have been developed and applied in other fields to rigorously assess not just the accuracy of forecasts but also how well models quantify the inherent uncertainty in making predictions, are only rarely applied in infectious disease epidemiology [111, 112] . Whether it is to estimate R or predict future incidence, the most appropriate method ultimately depends on the particular epidemiological setting, existing knowledge of the transmission dynamics and data availability. Branching process models, for example, can be used for a quick estimate of the current value of R from the recent trend in case numbers and, by extrapolating this forward, of expected case numbers in the near future [27, 28, 96] . Mechanistic or simulation models, on the other hand, aim to include a more explicit representation of the different factors that might influence transmission. They can be a more natural choice for assessing the expected impact of possible interventions, but they usually require careful parametrization and often intensive computation [29, 30, 45, 113] , both of which can be challenging early in an outbreak when data are scarce and rapid turnaround crucial.

Analytical epidemiological studies use data to better describe outbreaks and populations at risk and inform real-time and subsequent response efforts. Typically, these are conducted during the intervention and post-intervention phases of an outbreak response (figure 1). They include observational designs such as retrospective cohort and case-control studies to identify risk factors and quantify associations between potential causes and their outcomes (typically, the disease in question), and experimental designs, such as randomized-control studies used to estimate the impact of interventions such as vaccination and treatments [114] . These studies reside outside of the normal scope of outbreak response activities, being inserted ad hoc as functions that are not necessarily routine response activities such as strengthening surveillance. In the case of observational epidemiological studies, data on exposures and outcomes are required, permitting estimations of the increased risk of disease among people exposed to risk factors of interest [54, 62, 115, 116] . In the case of experimental epidemiology, data on outcomes of interest are collected to permit estimations of heterogeneity among groups (e.g. in the presence/absence of intervention).

The usefulness of such studies in informing outbreak response is highly context-dependent. Observational studies may be undertaken early on in the intervention phase to help identify ongoing infection sources of environmental, food-borne or water-borne nature [117] and to stop the outbreak at its source. In longer-running outbreaks, they can provide insights into opportunities for control [53, 115, 118] and inform global policy decisions that relate to outbreak response [119] . However, the time and expertise needed to prepare and implement these studies may preclude their application in the midst of an ongoing outbreak, so that the cost and benefits of such an undertaking need to be carefully weighed in emergency settings.

Whole genome sequencing of pathogens is increasingly affordable and reliable, and therefore more frequent in outbreak investigations [1, [120] [121] [122] [123] [124] [125] [126] . As technology is making real-time sequencing in the field a developing standard in the coming years [127, 128] , genetic analysis will likely carve out its own space in the outbreak analytics toolkit.

Different methods can be used to extract information from pathogen WGS. In bacterial genomics, molecular epidemiology methods have been used extensively for defining strains of related isolates [32, 129] , which can be used to infer various features of the pathogens sampled such as their origins, antimicrobial resistance profiles, virulence or antigenic characteristics [130] [131] [132] . These methods usually exploit only a fraction of the information contained within pathogens' genomes, as they rely on genetic variation in a limited number of housekeeping genes [32, 129] . While these methods will likely remain useful in years to come, substantially more information can be extracted by using WGS to reconstruct phylogenetic trees, which represent the evolutionary history of the sampled isolates, assuming the absence of selection or horizontal gene transfers [133] . Different types of phylogenetic reconstruction methods can be used, including fast, scalable distance-based methods [134] or more computer-intensive approaches using a maximum-likelihood [135, 136] or the Bayesian framework [33, 137] . Phylogenies can be used to assess the origins of a set of pathogens [138] , patterns of geographical spread [125] , host species jumps [139, 140] , past fluctuations in the pathogen population sizes [141] and even, in some cases, the reproduction number [1] . Importantly, there is a growing tendency to analyse phylogenetic trees in the broader context of other epidemiological data (mainly geographical locations until now), which is facilitated by user-friendly Web applications [92, 93] .

A further step towards integrating WGS alongside epidemiological data is the reconstruction of transmission trees (who infects whom) using evidence synthesis approaches. This methodological field has been growing fast over the past decade [25, [142] [143] [144] [145] [146] [147] [148] , but most applications of these methods remain within academia and their usefulness in the field in an outbreak response context needs to be critically assessed. A potential benefit of accurately reconstructing transmission trees lies in the identification of multiple introductions, the quantification of the proportion of unreported cases and the detection of heterogeneities in individual transmissibility [145] . Unfortunately, the reconstruction of transmission trees is a difficult and computationally intensive problem. First, most diseases do not accumulate sufficient genetic diversity during the course of an outbreak to allow the accurate reconstruction of transmission chains, so that multiple data sources need to be used [35] , making these methods more datademanding than most other approaches in outbreak analytics ( figure 2 ). In addition, the complex nature of the problem requires the use of Bayesian methods for model fitting, making these approaches difficult to interpret by non-experts [145, 146, 148] .

In this article, we reviewed methodological and technological resources forming the basis of outbreak analytics, an emerging data science for informing outbreak response. Outbreak analytics is embedded within a broader public health information context that starts with disease surveillance systems, followed by risk assessment and management, the epidemiological response itself, and finishes with the production of actionable information for decision making. Part of the challenge that this new field will face in the coming years royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 374: 20180276 pertains to the seamless integration of data analytics pipelines within existing workflows. As responders can allocate only limited time to data analysis, analytics resources should produce simple, interpretable results, highlighting the most pressing issues that need addressing and monitoring all relevant indicators to inform the response.

Outbreak analytics and resulting outputs are central to the surveillance pillar of any outbreak response, yet resources and capacities to ensure data availability and quality are often limited owing to operational constraints [16] . Priorities in terms of data needs should be defined by what actionable information it may give access to through the available analytics pipelines [13] . In this respect, we foresee that typical linelist data such as dates of events (e.g. onset, reporting, hospitalization, discharge), age, gender, disease outcome, geographical locations and exposure data will fulfil most needs, while other data such as WGS may only be useful for specific diseases and contexts [34, 35] . Intervention data are rarely collected but should be given more consideration, as they are key to assessing the impact and effectiveness of control measures, both during and after the operations. Similarly, data on the fraction of cases reported (and its variations through time), as well as behavioural changes (e.g. care-seeking behaviour) in the affected populations, can be very important sources of information for modelling [149] .

Fortunately, what we called 'background data' in this article can be gathered and shared outside of the epidemic context. Besides maps, population census, sero-surveys or genetic databanks, data on the natural histories of diseases derived from past epidemics, such as key delay distributions and transmissibility, can form a useful substitute to real-time estimates, especially in the early stages of outbreaks when such data may be lacking. While crowd-sourced initiatives are promising and have been used successfully in low resource settings [90] , more efforts are needed to collate and curate open data sources, assess their quality and make them widely available to the community. We argue that international public health agencies and non-governmental agencies should play a central role in orchestrating such background data preparedness.

Outbreak analytics is a developing field, and as such, there remain many gaps in terms of data collection, analysis and reporting tools. Some methodological challenges persist, such as better characterising forecasting methods [28, 64, 65] , including spatial information and population flows into existing transmission models [95] , and improving the integration of different types of data for reconstructing transmission trees [35] . In order to ensure transparent methods and availability to analysts in any setting, the implementation must be as freely available, open-source software. Among other popular programming languages, such as Python, Java, or Julia, the R software [150] arguably offers the largest collection of free tools for data analysis and reporting, and an increasing number of packages for infectious disease epidemiology [20, 21, 27, 84, 145] may form a solid starting point for the development of a comprehensive, robust and transparent toolkit for the analysis of epidemic data [151] . Importantly, the use of a common platform for the development and use of outbreak analytics tools will also likely contribute to standardizing data practices, including collection, sharing and analysis.

A final point relates to the use and dissemination of these new resources: how can outbreak analytics best help improve public health? As noted by Bausch & Clougherty [39] , health science should not be an entity unto itself, but a means to an end. Insofar as it can help field epidemiologists collect, visualize and analyse data, and subsequently provide decision-makers with actionable information, outbreak analytics will likely occupy an increasing space in field epidemiology over the years to come. We foresee that the dissemination of free training resources [152], the modernization of field epidemiology training programmes and the deployment of applied data scientists to the field with a sustained capacity building in resource-poor and vulnerable countries will be instrumental in shaping the future of this emerging field of health science. 

Pandemic potential of a strain of influenza A (H1N1): early findings

Hospital outbreak of Middle East respiratory syndrome coronavirus

Middle East respiratory syndrome coronavirus: quantification of the extent of the epidemic, surveillance biases, and transmissibility

Unraveling the drivers of MERS-CoV transmission

) royalsocietypublishing.org/journal/rstb Phil

Zika virus epidemic in the Americas: potential association with microcephaly and Guillain-Barré syndrome (first update)

Ebola virus disease in West Africa -the first 9 months of the epidemic and forward projections

West African Ebola epidemic after one year -slowing but not yet under control

Will Ebola change the game? Ten essential reforms before the next pandemic. The report of the Harvard-LSHTM independent panel on the global response to Ebola

A review of epidemiological parameters from Ebola outbreaks to inform early public health decision-making

Contact tracing performance during the Ebola virus disease outbreak in Kenema district

Exposure patterns driving Ebola transmission in West Africa: a retrospective observational study

Key data for outbreak evaluation: building on the Ebola experience

Ebola virus disease: 11 323 deaths later, how far have we come?

Outbreak of Ebola virus disease in the Democratic Republic of the Congo

Lessons learnt from Ebola virus disease surveillance in Équateur Province

Open Data Kit: Tools to build information services for developing regions

Open Data Kit 2.0: Expanding and refining information services for developing regions

epicontacts: Handling, visualisation and analysis of epidemiological contacts

2018 epiflows: an R package for risk assessment of travel-related spread of disease

Local, national, and regional viral haemorrhagic fever pandemic potential in Africa: a multistage analysis

Mapping global environmental suitability for Zika virus. Elife 5, e15272

Mapping the zoonotic niche of Ebola virus disease in Africa

Reconstructing disease outbreaks from genetic data: a graph approach

Extracting transmission networks from phylogeographic data for epidemic and endemic diseases: Ebola virus in Sierra Leone, 2009 H1N1 pandemic influenza and polio in Nigeria

A new framework and software to estimate timevarying reproduction numbers during epidemics

A simple approach to measure transmissibility and forecast incidence

The role of rapid diagnostics in managing Ebola epidemics

Real-time analysis of the diphtheria outbreak in forcibly displaced Myanmar nationals in Bangladesh

Real-time modeling should be routinely integrated into outbreak response

eBURST: inferring patterns of evolutionary descent among clusters of related bacterial genotypes from multilocus sequence typing data

BEAST 2: a software platform for Bayesian evolutionary analysis

Pandemics: spend on surveillance, not prediction

When are pathogen genome sequences informative of transmission events?

The Global Virome Project

Evidence synthesis for stochastic epidemic models

Data science. Wikipedia, The Free Encyclopedia

Ebola virus: sensationalism, science, and human rights

The impact of infection on population health: results of the Ontario burden of infectious diseases study

Years lived with disability (YLDs) for 1160 sequelae of 289 diseases and injuries 1990-2010: a systematic analysis for the Global Burden of Disease Study

Global Burden of Disease Study

Global, regional, and national incidence, prevalence, and years lived with disability for 301 acute and chronic diseases and injuries in 188 countries, 1990-2013: a systematic analysis for the Global Burden of Disease Study

Introduction and methods: assessing the environmental burden of disease at national and local levels

Estimating the probability of demonstrating vaccine efficacy in the declining Ebola epidemic: a Bayesian modelling approach

Real-time dynamic modelling for the design of a cluster-randomized phase 3 Ebola vaccine trial in Sierra Leone

Yellow fever in Africa: estimating the burden of disease and impact of mass vaccination from outbreak and serological royalsocietypublishing.org/journal/rstb Phil

Spread of yellow fever virus outbreak in Angola and the Democratic Republic of the Congo 2015 -16: a modelling study

International risk of yellow fever spread from the ongoing outbreak in Brazil

A hypothesis test for the end of a common source outbreak

Assessing the risk of observing multiple generations of Middle East respiratory syndrome (MERS) cases given an imported case

Objective determination of end of MERS outbreak, South Korea

Surveillance and outbreak response management system (SORMAS) to support the control of the Ebola virus disease outbreak in West Africa

Descriptive epidemiology of typhoid fever during an epidemic in Harare

Geographic distribution and mortality risk factors during the cholera outbreak in a rural region of Haiti

EpiCollect: linking smartphones to web applications for epidemiology, ecology and community data collection

Innovative technological approach to Ebola virus disease outbreak response in Nigeria using the open data kit and form hub technology

2018 R markdown: The definitive guide

Bookdown: authoring books and technical documents with R markdown

World Health Organization early warning, alert, and response system in the Rohingya Crisis

Estimating the impact of school closure on influenza transmission from Sentinel data

Role of social networks in shaping disease transmission during a community outbreak of 2009 H1N1 pandemic influenza

Risk factors for measles mortality and the importance of decentralized case management during an unusually large measles epidemic in eastern Democratic Republic of Congo in 2013

Role of contact tracing in containing the 2014 Ebola outbreak: a review

Real-time forecasting of infectious disease dynamics with a stochastic semimechanistic model

The RAPIDD Ebola forecasting challenge: synthesis and lessons learnt

Tidy data

Phylogenetic structure of European Salmonella enteritidis outbreak correlates with national and international egg distribution network. Microb Genom 2, e000070

Public health investigation of two outbreaks of shiga toxin-producing Escherichia coli O157 associated with consumption of watercress

A multi-country Salmonella enteritidis phage type 14b outbreak associated with eggs from a German producer: 'near real-time' application of whole genome sequencing and food chain investigations, United Kingdom

The impact of hotspottargeted interventions on malaria transmission in Rachuonyo South District in the Western Kenyan highlands: a cluster-randomized controlled trial

Factors associated with high heterogeneity of malaria at fine spatial scale in the Western Kenyan highlands

A comparison of smartphone and paper datacollection tools in the Burden of Obstructive Lung Disease (BOLD) study in Gezira state

Quality assurance and quality control in the global trachoma mapping project

A novel electronic data collection system for large-scale surveys of neglected tropical diseases

A comparison of smartphones to paper-based questionnaires for routine influenza sentinel surveillance

Smartphone ownership and internet usage continues to climb in emerging economies

Evaluation of a mobile phone-based microscope for screening of Schistosoma haematobium infection in rural Ghana

Targeted DNA sequencing and in situ mutation analysis using mobile phone microscopy

Mobile phone-based biosensing: an emerging 'diagnostic and communication' technology

Enhancing data security in open data kit as an mHealth application

The R book

Ggplot2: elegant graphics for data analysis

surveillance: An R package for the monitoring of infectious diseases

OutbreakTools: a new platform for disease outbreak analysis using the R software

Incidence: compute, handle, plot and model incidence of dated events

Avoidable errors in the modelling of outbreaks of emerging pathogens, with special reference to Ebola

1855 On the mode of communication of cholera

Atlas of human infectious diseases

Emergence and potential for spread of Chikungunya virus in Brazil

Radiant Earth Foundation -Earth imagery for impact

Spatial epidemiology of Viral Hemorrhagic Fevers

Nextstrain: real-time tracking of pathogen evolution

Microreact: visualizing and sharing data for genomic epidemiology and phylogeography. Microb Genom 2, e000093

HealthMap: global infectious disease monitoring through automated classification and visualization of Internet media reports

Spatiotemporal analysis of the 2014 Ebola epidemic in West Africa

Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures

How generation intervals shape the relationship between growth rates and reproductive numbers

Transmission scenarios for Middle East Respiratory Syndrome Coronavirus (MERS-CoV) and how to tell them apart

Utility of contact tracing in reducing the magnitude of Ebola disease

Ebola virus disease among children in West Africa

Ebola virus disease among male and female persons in West Africa

Epidemiological determinants of spread of causal agent of severe acute respiratory syndrome in Hong Kong

Epidemiology, transmission dynamics and control of SARS: the 2002-2003 epidemic

Infectious diseases of humans

A statistical algorithm for the early detection of outbreaks of infectious disease

A practical generation interval-based approach to inferring the strength of epidemics from their speed

Modeling infectious diseases in humans and animals

Inference in epidemic models without likelihoods

The R0 package: a toolbox to estimate reproduction numbers for epidemic outbreaks

Perspectives on model forecasts of the 2014-2015 Ebola epidemic in West Africa: lessons and the way forward

Probabilistic forecasting in infectious disease epidemiology: the 13th Armitage lecture

Assessing the performance of real-time epidemic forecasts

Measuring the impact of Ebola control measures in Sierra Leone

Epidemiology in medicine

Risk factors for cholera transmission in Haiti during inter-peak periods: insights to improve current control strategies from two case-control studies

Oswego County revisited

German outbreak of Escherichia coli O104:H4 associated with sprouts

The ring vaccination trial: a novel cluster randomised controlled trial design to evaluate vaccine efficacy and effectiveness during outbreaks, with special reference to Ebola

Time is of the essence: exploring a measles outbreak response vaccination in Niamey, Niger

Whole-genome sequencing for analysis of an outbreak of meticillin-resistant Staphylococcus aureus: a descriptive study

Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak

Transmission and evolution of the Middle East respiratory syndrome coronavirus in Saudi Arabia: a descriptive genomic study

Genomics and outbreak investigation: from sequence to consequence

Declaring a tuberculosis outbreak over with genomic epidemiology

Virus genomes reveal factors that spread and sustained the Ebola epidemic

Establishment and cryptic transmission of Zika virus in Brazil and the Americas

Real-time, portable genome sequencing for Ebola surveillance

Highthroughput sequencing and clinical microbiology: progress, opportunities and challenges

Displaying the relatedness among isolates of bacterial species-the eBURST approach

The evolutionary history of methicillin-resistant Staphylococcus aureus (MRSA)

Development of a multilocus sequence typing scheme for the pig pathogen Streptococcus suis: identification of virulent clones and potential capsular serotype royalsocietypublishing.org/journal/rstb Phil

Multi-locus sequence typing: a tool for global epidemiology

Inferring phylogenies

2012 ape 3.0: New tools for distance-based phylogenetics and evolutionary analysis in R

Evolutionary trees from DNA sequences: a maximum likelihood approach

phangorn: phylogenetic analysis in R

MrBayes 3: Bayesian phylogenetic inference under mixed models

Genomic insights into zika virus emergence and spread

Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic

Unifying the epidemiological and evolutionary dynamics of pathogens

Integrating genetic and epidemiological data to determine transmission pathways of foot-andmouth disease virus

Unravelling transmission trees of infectious diseases by combining genetic and epidemiological data

Relating phylogenetic trees to transmission trees of infectious disease outbreaks

Bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data

Simultaneous inference of phylogenetic and transmission trees in infectious disease outbreaks

Bayesian inference of infectious disease transmission from wholegenome sequence data

SCOTTI: efficient reconstruction of transmission within outbreaks with the structured coalescent

Accounting for behavioral responses during a flu epidemic using home television viewing

R: a language and environment for statistical computing

Carrion-Martin, Epidemiologists at Medecins Sans Frontieres (MSF, Operational Centre Amsterdam) for their additional reflections. The views expressed in this publication are those of the authors and not necessarily those of the National Health Service, the National Institute for Health Research or the Department of Health and Social Care. The authors alone are responsible for the views expressed in this article and they do not necessarily represent the views, decisions or policies of the institutions with which they are affiliated.