key: cord-0007528-qh1barua authors: Fajardo, David; Gardner, Lauren M. title: Inferring Contagion Patterns in Social Contact Networks with Limited Infection Data date: 2013-06-19 journal: Netw Spat Econ DOI: 10.1007/s11067-013-9186-6 sha: 9172b1b4657b544d5960b47fbf340876713ff8d1 doc_id: 7528 cord_uid: qh1barua The spread of infectious disease is an inherently stochastic process. As such, real time control and prediction methods present a significant challenge. For diseases which spread through direct human interaction, (e.g., transferred from infected to susceptible individuals) the contagion process can be modeled on a social-contact network where individuals are represented as nodes, and contacts between individuals are represented as links. The model presented in this paper seeks to identify the infection pattern which depicts the current state of an ongoing outbreak. This is accomplished by inferring the most likely paths of infection through a contact network under the assumption of partially available infection data. The problem is formulated as a bi-linear integer program, and heuristic solution methods are developed based on sub-problems which can be solved much more efficiently. The heuristic performance is presented for a range of randomly generated networks and different levels of information. The model results, which include the most likely set of infection spreading contacts, can be used to provide insight into future epidemic outbreak patterns, and aid in the development of intervention strategies. vaccination), local programs (e.g., health, emergency response), and also critically significant, the interaction patterns among individuals. Today, a large proportion of the population lives in increasingly dense conditions, an ideal environment for rapid disease transmission. The stochastic nature of the contagion process (i.e., contact between an infectious and susceptible person may or may not result in a new infection) makes it difficult to identify the path of infection or predict the impact that a new disease might have on a region. Over the last 100 years, significant research efforts have focused on predicting the expected spreading behavior of contact-based infectious diseases, exploiting characteristics of the population and the disease itself. However, there have been limited research efforts focusing on the use of future social network data: while current social network models are abstract constructs where people are anonymously represented as nodes, it is not unreasonable to expect developments in data collection (through Facebook, Twitter, Foursquare, etc.) which will allow accurate mappings between known individuals. Spatial analysis of networks, such as transport and communication networks, is a growing area of research (Gastner and Newman 2006; Schintler et al. 2007; Erath et al. 2009 ), and has recently been expanded to include social network modeling, specifically the ability to reproduce spatial structure and interaction between individuals for large-scale social networks (Illenberger et al. 2012) . Furthermore, the ongoing development of activity-based travel models, which examine why, where and when various activities are engaged in by individuals (Lam and Huang 2003; Roorda et al. 2009; Ramadurai and Ukkusuri 2010) , as well as innovations in pedestrian modeling (Hoogendoorn and Bovy 2005) present additional promising alternatives to generate social contact networks in the future. As such, it is critical to develop methods which can exploit this data in aiding the prevention and mitigation of contagion episodes. The objective of the model proposed in this paper is to infer the spatiotemporal path of infection through a social-contact network for an ongoing outbreak scenario under the assumption that limited infection information is available. This work specifically considers contact-based diseases, which refer to the family of infectious diseases that are transmitted from an infected to susceptible individual via direct contact. This category includes sexually transmitted diseases, various strands of the flu, SARS and the common cold, among others. In turn the social contact network is representative of the social interactions (e.g. through school, work or home) which occur among a group of individuals in a given time period (e.g. a day). The problem approached in this paper considers the case in which the structure of the network is deterministically known (set of nodes and links), but time-of-infection data is available only for a fraction of the population. We further assume that no information is known about the infection tree (i.e., the set of social contacts through which the disease spread). We refer to this set of assumptions as the partial information version of the problem, in contrast to the full information case in which time-of-infection information is available for all infected nodes. In previous work by Gardner et al. (2012) , an application of the full information problem was addressed, where the objective was to infer the most likely air travel routes responsible for spreading the Swine Flu to unexposed geographic regions. In Gardner's paper social contact networks were not considered, and the network structure was defined by the air traffic system. Generalization of the full information case to the partial information case introduces a significant increase in computational complexity. The partial information problem can be modeled as an integer program, and represents a considerably more difficult problem to solve than the full information case. Heuristic solution methods are therefore developed based on sub-problems which can be solved much more efficiently. The model performance is based on how accurately it predicts the paths of infection for a given contagion episode (which are extracted from simulation outputs). The outcome of the model can provide insight into future epidemic outbreak behavior and aid in the evaluation and recommendation of intervention strategies. In addition, the proposed solution methodology can be extended to alternative contagion processes which occur atop known network structures (e.g. tracking food borne outbreaks which propagate though a distribution network). In the following section a literature review of relevant network models is provided. Section 3 defines the problem and section 4 presents the mathematical problem formulation and solution methodology. Section 5 describes the evaluation procedure and numerical results. Section 6 concludes the paper with discussion of the results and future research directions. Dynamic contagion processes impact copious network systems, and are therefore the focus of various studies within the emerging field of network science. In addition to the transmission of infectious disease through communities and biological systems (Murray 2002; Anderson and May 1991) , the spread of information, ideas and opinions via social networks can also be modeled as a contagion process (Coleman et al. 1966; Hasan and Ukkusuri 2011) ; as well as the global spread of computer viruses on the Internet network (Newman et al. 2002; Balthrop et al. 2004 ); power grid failures in electricity markets (Kinney et al. 2005; Sachtjen et al. 2000) ; and the collapse of financial systems (Sornette 2003) . Of interest to this study is the propagation of disease through a social contact network, and therefore will be focus of the remainder of the section 2. The infection rate and pattern of the disease spreading process through a network is dependent on both the parameters of the disease (infectious period, level of contagiousness, etc.) and the fundamental structure of the network. In efforts to predict expected disease spreading behavior and characteristics, epidemiological models span from extremely generalized and simplified analytical models to increasingly in-depth stochastic agent based simulation tools. Analytical models are used to quantify the statistical properties of epidemic patterns (Colizza et al. 2006; Balcan et al. 2009 ); however, they are unable to capture certain behavioral aspects of the dynamics of disease spreading, and often lack detailed information about the network structure. In contrast, agent based simulation models can be used to replicate possible spreading scenarios, predict average spreading behavior, and analyze various intervention strategies for a given network and disease while capturing a greater degree of detail, but in turn require a highly detailed set of input data (see Rvachev and Longini (1985) , Epstein and Cummings (2002) , Eubank et al. (2004) , Hufnagel et al. (2004) , Dibble and Feldman (2004) , Cahill et al. (2005) , Dunham (2005) , Meyers et al. (2005) , Small and Tse (2005), Carley et al. (2006) , Ferguson et al. (2006 ), Germann et al. (2006 , Ekici et al. (2008) , Roche et al. (2011), and Haydon et al. (2003) ). The most recent and comprehensive models provide a greater degree of realism, but are difficult to implement within the short time frames in which real time control decisions must be made. Large scale simulation models can also be computationally taxing because multiple runs are required to accurately predict expected outcomes. There currently exists a gap in the literature which calls for scenario specific disease prediction models. Most contagion models predict future potential outbreak scenarios based on system-wide information; however, they are not able to reconstruct the contagion process of an ongoing outbreak to reveal information about the current state of the network. Recent advances in disease modeling have begun addressing this issue. For example, there are models which use genetic sequencing data to analytically infer the geographic history of a given virus's migration (Drummond and Rambaut 2007; Lemey et al. 2009; Wallace et al. 2007; Cottam et al. 2008; Haydon et al.; 2003) . Often this approach involves first enumerating all possible evolutionary trees, then assigning posterior probabilities based on specifics of the respective virus' mutation rates. Additionally the infection trees only include locations where samples were available. Jombart et al. (2009) proposed a novel approach to reconstruct the spatiotemporal dynamics of outbreaks from sequence data by inferring ancestries directly between strains of an outbreak using their genotype and collection date. The "infectious" links were selected such that the number of mutations between nodes is minimized. The idea of using infection data to construct the most likely path of transmission is the highlighted goal of this paper. This study is motivated by the need to track viruses through space and time in order to aid in the implementation of real-time containment strategies. Often the required genetic data and mutation based statistical properties are unavailable, or impossible to gather within the required time-frame. The proposed approach relies instead on available infection reports, contact network structure and disease properties to infer the spatiotemporal path of infection through a contact network, data which can be more realistically gathered during an epidemic. The proposed methodology accounts for missing infection information, enabling previously over-looked infection sources to be included. Using infection reports, contact network structure and disease properties, the methodology described in this section makes inferences about infection spreading patterns in a population. The problem assumes an underlying contagion process which can be represented on a network by a discrete-time, stochastic process. The following terminology is used for the remainder of this paper: i. t i , time stamps: the time period at which a node was reportedly infected, or predicted to be infected ii. p ij , link transmission probability: the probability that an infected node i will infect a susceptible (and adjacent) node j in a single time step. iii. 1 infectious period: the number of time steps an infected node remains infectious (i.e., is able to infect others) following its own infection. 1 can also represent the amount of time before recovery, hospitalization or some other type of removal from the network. The problem objective is defined as follow: assume we are given a social contact network which has been exposed to infection, such as that shown in Fig. 1(a) , in which a contagion process occurs resulting in a set of infected nodes (and corresponding time stamps for each) such as shown in Fig. 1(b) . Assuming we are only given information on a subset of the infected nodes, such as the scenario shown in Fig. 1(c) , we seek an infection tree such as that shown in Fig. 1 (d) that branches to all known infected nodes, which maximizes the likelihood of the infection event. The social contact network G ∈ (V, A) is formally defined by a set of nodes, V, which represent a population of individuals, and links, A, which represent physical daily contacts between individuals. The set N represents the set of individuals that became infected during the time period when population V was exposed to infection. The set I represents the set of information nodes: a subset of the infected individuals N, which were identified as infected (i.e., they visited a doctor, hospital, pharmacy, etc.). The problem can be further broken down into two information-based cases: I. Full information: The complete set of infected nodes and the time stamp, t i , for each infected node is available, i.e., I=N. II. Partial information: Information on a subset of the infected node set, I⊆N, is available. This problem serves as the more general version of the problem and is the focus of this paper. Relaxing the full information assumption results in a more realistic setting where only a fraction of infected individuals consult a physician, visit a hospital, etc., resulting in partial information. The objective of the partial information case is again to determine the most likely set of infection spreading contacts when only a subset of the infected nodes are identified. One highlight of this study explores the performance of the proposed model under different levels of available information. The relationship between the underlying contagion process and the mathematical programming formulation (presented in section 4) are of specific interest in regards to the problem definition. This section introduces the link-based infection process. The network-based contagion process is introduced in section 3.2. The link-based infection process consists of a set of link trials which are the basic building blocks of the network level contagion process. In other words, a given infection scenario at the network level is the result of many individual link-based trials. Each link trial consists of the following evaluation: At a discrete time step t, assume node i is in an infectious state, node j is in a susceptible state, and the two nodes are connected by link (i,j) with a link transmission probability p ij . A successful link trial is defined as when node i infects node j in time step t, and occurs with probability p ij . The probability a link trial is unsuccessful is therefore (1−p ij ). A simulation time step t is representative of the latent period, or the amount of time between when an individual contracts the disease and becomes infectious. The timestamp of node i, denoted by t i , represents the time (e.g., day) at which individual i was infected. We now consider two connected nodes, i and j. In calculating the probability associated with the inclusion or exclusion of link (i,j) in the infection tree, we must account for two events: either no infection trials are successful, or exactly one infection trial is successful. We denote the probability of no successful trials on (i,j) by γ ij . More explicitly, γ ij represents the probability that the correct number of trials were unsuccessful so as to ensure that node i did not infect node j. The number of necessary unsuccessful trials is represented by: This expression accounts for situations where node j was infected after i (corresponding to t j −t i ), infected before i (corresponding to 0), or not infected (corresponding to T−t i ) and 1. The value of γ ij is given by expression (1): The decision variable x ij is included so as to account for this term only if the link is excluded from the network (i.e., x ij = 0) and node i has been infected (i.e., ∑ k;i ð Þ∈A x ki ½ ¼ 1). If either condition is not satisfied, the term will evaluate to 1. Similarly, the probability of exactly one successful trial on link (i,j), which we denote by α ij , can be calculated for nodes i and j such that t j >t i as the probability of Δt ij −1 unsuccessful trials, and a single successful trial: The decision variable x ij allows the expression to take on the correct probability expression if the link is included in the tree (i.e., x ij =1), and 1 otherwise. Combining expressions (2) and (3), we can develop an expression which represents the probability associated with both the inclusion and exclusion of a link: When x ij =0 and ∃ k:x ki =1, the link is not included in the infection tree and the probability is equal to ð1 À p i j Þ Δt i j . Then the term evaluates to 1. If x ij =1 the link is included in the infection tree and the associated probability is equal to In the next section we extend this result to the network level. This work treats the network-based infection process as an iterative aggregation of individual link trials. We begin the simulation model by initializing all nodes to a susceptible state, and randomly choose a set of nodes to be infected (0 ∈ V). Then we simulate transmission of the disease over multiple time steps, t, for a predetermined simulation period, T. During each time step, we identify all links that connect infectious nodes and susceptible nodes, and perform an infection trial for each such link. If the link infection is successful, then we change the newly infected node status to "infected" in the following time step. The node remains infected for 1 time steps. After a node is infected for 1 time steps, its status is changed to "recovered". Once a node is recovered it can no long transmit the disease or become infected again (the equivalent of gaining immunity or being removed from the network). This process is representative of a discrete-time network Susceptible-Infectious-Removed (SIR) contagion process. The simulation model described in words above forms the basis for the mathematical formulation and evaluation presented in the remainder of this paper. It follows that the aim of this solution methodology is to replicate the actual infection tree for a specific outbreak scenario by exploiting node level infection information and the network structure. Multiple simplifying assumptions are necessary to solve the proposed problem. This work assumes: i. a priori knowledge of the underlying social contact network, G ∈ (V, A) ii. The contagion process can be approximated as discrete-time network SIR contagion process with known transmission probabilities, p ij iii. An individual can be infected at most once, and thus only those diseases for which immunity is acquired after recovery are considered. iv. Known timestamps t i for the set of information nodes, I The first assumption is the most debatable of the four. Social networks are difficult to characterize, as they are not directly observable, and also highly unstable. However, the increase of social networking information available online, and improvements in activity-based travel modeling both contribute towards the possibility of access to more detailed social contact information in the future. The second assumption is also present in many previous epidemiological models and ongoing research is focused on accurately quantifying these parameters. The third assumption restricts the set of applications to those diseases for which acquiring immunity restricts an individual from being infected more than once over the entire course of an outbreak. Assumption 4 is based on the premise that some infected individuals report to a medical authority (i.e., hospital, private clinic, pharmacy), and that information is made available. In this section the mathematical formulation and proposed solution methodology for the partial information case (as defined in section 3) are presented. A nonlinear integer program formulation for the partial information case is given in section 4.1and the solution method for the partial information case is described in section 4.2. The partial information case represents the case where not all infections are reported. In this scenario, the fraction of missing information is unknown, i.e., the nodes that are unreported may or may not have been infected. These nodes are referred to as zero-information nodes, i ∈ N\I. The nodes with known timestamps are referred to as information nodes, i ∈ I. In the case of partial information, the objective is to determine the set of links which spread the infection, while simultaneously determining the time at which zero-information nodes were infected, if at all. What follows is a non-linear integer programming formulation: s.t. x ij ≤ X j∈N x ji ∀i∈N nI ð14Þ The two decision variables are i) x ij , which is set to 1 if link (i,j) is included in the infection tree and 0 otherwise, and ii) t i which is the timestamp assigned to node i. The objective (5) enforces that the set of links included in the final spanning tree maximizes the likelihood of the tree. The first constraint (6-8) provides the definition of Δt. The next two constraints enforce consistency between the x and t variables: If x ij =1, meaning i is the predecessor of j in the infection tree, then (9) guarantees that the infection time of j will be later than that of i. Constraint (10) ensures that that the infection time of j will be within the time period when i is infectious, i.e., within 1 time units of the infection time of i. Constraint (11) fixes the timestamp variable t i for all informationnodes. Constraints (12)-(14) enforce the spanning tree structure of the solution. Constraint (12) ensures that every known infected node, except the source node, is infected exactly once, i.e., has exactly one incoming link. Constraint (13) allows zero-information nodes to be part of the infection tree, but restricts them to have at most one predecessor. Constraint (14) ensures that only zero information nodes that have been previously infected will be able to in turn infect other nodes. Constraints (15) and (16) force the decision variables x ij to be binary and t i to be integer. The objective function (5) can be transformed from a product of terms to an equivalent summation of terms by maximizing the natural logarithm of the objective function. The last term of this expression represents the penalty associated with link (i,j) resulting in no infections based on the infection of node i. The individual terms of this summation can be redistributed among the expressions corresponding to each incoming link to i. Redistributing the terms, and expanding the first term, we obtain the following expression: We can further simplify the above expression by examining the behavior of the term Lemma 1 If the x variables are integer, the behavior of the expression Proof. If x ij =1, because of constraints (13) and (14), Using Lemma 1, we can simplify the formulation, cancel similar terms, arriving at the objective function shown in Eq. 17. The new formulation features additive terms rather than multiplicative terms, although it remains nonlinear due to interaction terms between the x and t variables. The set of constraints remains the same, resulting in the formulation below: s.t. x ij ≤ X j∈N x ji ∀i∈N nI ð24Þ We will refer to constraints (18) -(26) as the IP constraints. The full information case is a special, more tractable version of the partial information case presented above. Under the assumption that I=N, i.e., we know the full set of infected nodes, we need not optimize over the variables t, since they are all fixed, which allows us to make the problem linear. The simplified formulation allows us to exploit specific properties to develop a much more efficient solution method than solving this linear program directly. The properties of the full information case result in a spanning tree that branches to every node i ∈ I. The general problem of finding a directed maximum branching tree can be solved using a simplified version of the algorithm developed by Edmonds (1967) . Edmonds' algorithm consists of maintaining an optimal sub-network that reaches every node, and works towards feasibility by replacing links that form a cycle in that sub-network. As such, the cycle finding subroutine of the algorithm is the most computationally taxing part of the algorithm. A significantly more efficient algorithm can be developed for the full information version of the problem by pruning the set of links to be considered based on constraints (19) and (20). The resulting network is acyclic, which greatly simplifies the maximum branching procedure. Let the set of feasible links (i,j) ∈ L be such that t i t i ,t j