key: cord-150218-javbnjrg
authors: Gupta, Prateek; Maharaj, Tegan; Weiss, Martin; Rahaman, Nasim; Alsdurf, Hannah; Sharma, Abhinav; Minoyan, Nanor; Harnois-Leblanc, Soren; Schmidt, Victor; Charles, Pierre-Luc St.; Deleu, Tristan; Williams, Andrew; Patel, Akshay; Qu, Meng; Bilaniuk, Olexa; Caron, Ga'etan Marceau; Carrier, Pierre Luc; Ortiz-Gagn'e, Satya; Rousseau, Marc-Andre; Buckeridge, David; Ghosn, Joumana; Zhang, Yang; Scholkopf, Bernhard; Tang, Jian; Rish, Irina; Pal, Christopher; Merckx, Joanna; Muller, Eilif B.; Bengio, Yoshua
title: COVI-AgentSim: an Agent-based Model for Evaluating Methods of Digital Contact Tracing
date: 2020-10-30
journal: nan
DOI: nan
sha: 
doc_id: 150218
cord_uid: javbnjrg

The rapid global spread of COVID-19 has led to an unprecedented demand for effective methods to mitigate the spread of the disease, and various digital contact tracing (DCT) methods have emerged as a component of the solution. In order to make informed public health choices, there is a need for tools which allow evaluation and comparison of DCT methods. We introduce an agent-based compartmental simulator we call COVI-AgentSim, integrating detailed consideration of virology, disease progression, social contact networks, and mobility patterns, based on parameters derived from empirical research. We verify by comparing to real data that COVI-AgentSim is able to reproduce realistic COVID-19 spread dynamics, and perform a sensitivity analysis to verify that the relative performance of contact tracing methods are consistent across a range of settings. We use COVI-AgentSim to perform cost-benefit analyses comparing no DCT to: 1) standard binary contact tracing (BCT) that assigns binary recommendations based on binary test results; and 2) a rule-based method for feature-based contact tracing (FCT) that assigns a graded level of recommendation based on diverse individual features. We find all DCT methods consistently reduce the spread of the disease, and that the advantage of FCT over BCT is maintained over a wide range of adoption rates. Feature-based methods of contact tracing avert more disability-adjusted life years (DALYs) per socioeconomic cost (measured by productive hours lost). Our results suggest any DCT method can help save lives, support re-opening of economies, and prevent second-wave outbreaks, and that FCT methods are a promising direction for enriching BCT using self-reported symptoms, yielding earlier warning signals and a significantly reduced spread of the virus per socioeconomic cost.

DCT, however, is not free of disadvantages. Due to a wide range of privacy concerns about smartphone communications, DCT suffers from poor adoption by the public [10] .

Additionally, most countries using DCT have adopted a simple form which informs and recommends quarantine to all digitally-recorded contacts of cases confirmed through testing. We call these systems Binary Contact Tracing (BCT) because they recommend users either to quarantine or not (binary decisions) based on whether a past contact took place with a confirmed index case (binary input feature). COVID-19 is a challenging disease to mitigate with BCT for two primary reasons (i) BCT currently relies on reverse transcriptase PCR (RT-PCR) tests which have high disease phase-dependent false negative rates. To make it worse, these tests are expensive, and may require a long time to obtain results [11, 12] (ii) the majority of transmissions of SARS-CoV-2 take place before the infector shows any symptoms, thereby reducing the likelihood that a potential infector would have been tested before transmission [13] .

We observe that there are a wide variety of clues potentially available to a contact tracing app that would allow for non-binary, individualized recommendations, thereby offering significant improvements to BCT. We call these methods feature-based contact tracing (FCT), and hypothesize they could provide an important and effective means of reducing the spread of the disease, perhaps even more effectively than BCT at lower adoption rates.

Recognizing this potential, we propose COVI-AgentSim -a software testbed 2 to design, evaluate and benchmark DCT methods using cost-benefit analysis in terms of lives saved, reduction in effective reproductive number (R t ) of the virus, disability-adjusted life years (DALYs) averted, and productive hours lost. By using an agent-based model (ABM) as the foundation of this testbed, we are able to simulate a rich set of individual-level input features. COVI-AgentSim can be adapted to a region of interest by providing appropriate demographics and contact pattern information for that region. It can then be calibrated to match published data for that region of interest.

We calibrate COVI-AgentSim to reproduce COVID-19 case and hospitalization data for the region of Montreal, Canada. In order to ensure the simulator is a fair and reliable testbed, we also check that the relative ordering of methods is preserved across wide ranges of simulator parameters and over several metrics. We propose a simple rule-based FCT method which leverages individual-level features to make non-binary recommendations, and compare this approach to BCT and compare both to no-DCT via cost-benefit analyses. We find that both BCT and FCT methods are able to reduce spread of the disease, and our results echo those of recent research [14] suggesting that DCT methods can still save lives even at low adoption rates. We find evidence that FCT approaches, which leverage rich individual-level features to make graded recommendations, are promising for improving DCT even further.

Additionally, by stratifying DALYs over age groups, we observe the most DALYs averted per person for those over 80 years of age, even with low app adoption rates in that age group, thus showing the protective effects of younger people using DCT. These results are conservative in estimating the benefits for the most vulnerable populations, since we randomly assign DCT app usage proportional to smartphone usage, yet more vulnerable people (or those close to more vulnerable people) may be more likely to use DCT. Our results thus strongly support the usage of DCT methods as a component of effective public health strategies, and we hope COVI-AgentSim will be a useful resource for development, benchmarking, evaluation, and improvement of DCT methods.

Agent-based models (ABMs) are frequently used to study geospatial and other patterns of disease which vary at an individual level (e.g. [15, 16] ). They are thus often useful for studying differential effects of policy decisions and interventions on different subgroups of the population; for instance [17] use an ABM to study which post-lockdown measures most effectively protect the most vulnerable, in terms of disease incidence, mortality, and ICU occupancy, and [18] and [19] study patterns of COVID-19 spread in different representative locations (university, workplace, and highschools), and the impact of different intervention scenarios in each of these locations. One of the interventions studied by these works is DCT, and both find evidence that it can reduce ICU admissions and help curb the spread of the disease. ABMs are also useful for modeling the impact of outlier individuals or events such as super-spreading, e.g. [20] , which are not easily captured by mathematical models of population-level spread.

Several works have studied the use of smartphone apps in epidemic management, e.g. [21, 22] , and some work has begun specifically on COVID-19. For instance, Ferretti et al [23] propose a mathematical model of infectiousness based on early epidemic data in China and compare binary contact tracing to manual tracing. They quantify the contribution of different transmission patterns (infection through symptomatic individuals, presymptomatic individuals, and from the environment) and the requirements for effective contact tracing. Assuming a 3-day delay in notification (and thereby quarantine of the individual), the authors demonstrate MCT could not bring R t below 1 and hence, could not control the epidemic. Instantaneous contact tracing by a digital tracing application on smartphones could do so (R t <1). Shamil et al.

[24] follow a similar approach, but with an ABM taking into account realistic contact patterns, studying the potential efficacy of BCT in controlling the spread of the disease. They find strong dependence of the efficacy of digital contact tracing on app adoption, suggesting that BCT alone is insufficient to control a pandemic unless over 60% of the population is using the app. [14] find similar results, emphasizing however that even at very low adoption rates DCT is able to save lives. This suggests that DCT should be considered an important component of public health strategy for mitigating COVID-19. We find similarly that BCT and FCT are unlikely to control a pandemic on their own at low adoption rates (see Section 8) , showing that even at 60% adoption rate these methods must be combined with other strategies to contain the disease.

Perhaps most similar to our work is Hinch et al. [25] , who also propose an open-source ABM which allows manual and digital contact tracing methods to be compared, with benefits stratified across age. Developed concurrently to our simulator, similarities between the two approaches highlight the importance of several design decisions made independently but converging to the same solutions, e.g. the use of ABMs, a python interface, and the need for empirical testbeds of this nature. A key difference in our simulator is the rich set of individual-level features (including e.g. pre-existing medical conditions), which allow us to benchmark feature-based contact tracing methods, and also allow for stratification over a larger variety of subgroups. The cost of this level of detail is computation; our simulator models smaller populations at higher fidelity for the same computational budget. We perform a scaling analysis (Appendix K) in order to ensure the dynamics we produce on these smaller populations are representative of larger populations. However, the simulator of [25] may be preferable for studying binary-only contact tracing methods, or when faster computation is needed relative to individual-level detail.

The simulator is an agent-based compartmental model [26] implemented in Python [27] and C [28] , using Simpy [29] , a process-based discrete-event simulation framework. For each agent the simulator tracks transitions through Susceptible, Exposed, Infectious, and Recovered (SEIR) states, as well as a variety of individual characteristics, including pre-existing medical conditions, self-reported symptoms, and test results. This rich set of individual features enable the simulation of contact tracing apps which make use of such features. At the same time, we parameterize our simulator using real-world data when available, and when no data is available we make weak assumptions and investigate the sensitivity of our results with respect to these assumptions (see Section 5.3).

COVI-AgentSim simulates the spread of the SARS-CoV-2 virus in a city through contagion events between agents. Simulator is initialized with a synthetic population along with the mobility and contact patterns informed by census and empirically derived data. It can be configured easily for any region of interest (see Appendix F). Each agent i in the simulation has individual characteristics (e.g. age, sex, pre-existing medical conditions) denoted X i . Dwelling characteristics, workplace association, and contact patterns are derived from age-stratified surveys and empirical studies (see Appendix E.2).

At start of a simulation, a fraction α of the agent population is randomly exposed to COVID-19. Infection spreads through communities via contagion events at households, workplaces, schools and other random locations. Agents move around the city transitioning between locations like households, workplaces and other locations. The pattern of each individual's mobility (i.e., which locations they visit, how often they visit them, and how they interact with other individuals at these locations) is set according to [30, 31] . While at a location, agents sample contacts according to age-stratified contact matrices derived from [32] . A detailed discussion of agents' mobility patterns and location dependent contact pattern is provided in Appendix E. Figure 1 compares simulated age-stratified contact patterns with surveyed matrices. 

Virus transmission takes place anytime an infectious and a susceptible individual are within 2 meters of each other for at least 15 minutes, thereby possibly transmitting viral load to a newly infected agent. We model the probability of COVID-19 transmission P according to [23] . Borrowing notation from the authors, briefly, this probability is proportional to age-dependent susceptibility S a of the susceptible agent with age a, location-dependent multiplicative factor B n (for location n), symptom status (asymptomatic, mild, severe) dependent ratio A s of the infectious agent, and a surrogate for cumulative viral load (EVL) transmitted from the infectious agent for duration δt. We discuss EVL in the next section. A proportionality factor r is used to calibrate the reproductive number of the disease spread. For the sake of completeness, a mathematical form of this transmission model is presented in Eq 1 and Eq 2.

where P (δt, S a , A s , n) is the probability of the contagion event. We use the same values for constants as used by authors in their open-sourced code 3 .

After such a contagion event, we sample for the infected agent the variables controlling the course of the disease, including symptoms and severity. Additionally, we sample a time-series of a quantity proportional to viral load as measured by [33, 34] , which we call effective viral load EV L; this quantity represents the interaction of the virus with the host's immune system. We further discuss (EV L), its dependence on age and preexisting conditions, and how it is sampled in Appendix B. A model for sampling symptoms for each agent conditional on whether the agent has cold, flu, or COVID-19 is discussed in details in Appendix J.

In COVI-AgentSim, we model RT-PCR testing and its relation to contact tracing applications. As discussed earlier, the disease trajectory for each agent depends on their age and preexisting conditions. Thus, there are agents who experience symptoms on a spectrum from none (asymptomatic) to severe: agents who experience more severe symptoms are more likely to seek an RT-PCR test. Given the limited testing capacity at the onset of the pandemic, we model a testing facility with a fixed maximum capacity. For an infectious agent, the outcome of a test is modeled according to disease phase dependent false negative rates as per [35] . As an example, 4 days after a SARS-CoV-2 infection, the infected agent will have a false negative rate of 67%. Upon receiving a positive test result, the agent is put to self-isolation for d max days with a probability of not following such interaction modeled via dropout parameters. We discuss details about testing in Appendix C and hospitalization in Appendix D.

Unlike existing COVID-19 ABMs, our agents are designed to follow varying levels of contact patterns. For example, number of contacts sampled by an agent in level k at location l corresponds to a fraction (γ l k ) reduction in contacts with respect to pre-pandemic number of contacts for that location. These pre-pandemic number of contacts are available through surveyed studies [32] which we discuss in Appendix E. Thus, if there are n + 1 such levels ranging from 0, 1, ..., n, an agent at location l in level 0 will draw contacts using pre-pandemic number of contacts (γ l k = 0), and an agent at location l in level n is in quarantine i.e. samples no contacts (γ l k = 1). Fraction γ l k for intermediate levels k ∈ {1, ..., n − 1} are obtained by interpolation scheme (e.g. linear) between γ l 0 and γ l n . Our choice is motivated by the desire to have a simple model grouping together the effect of choices individuals can make to reduce their likelihood of becoming infected like washing hands, wearing a mask, and physical distancing.

Each of these levels are further associated with a dropout parameter that represents the fraction of time an agent in level k will drop to level 0, returning to a higher level of activity (specifically, pre-pandemic numbers of contacts). COVI-AgentSim can be configured to quarantine individuals due to any of the following triggers (a) confirmed positive test (b) self-reporting of symptoms, (c) recommended by the app, and (d) household member of any of the above cases. Additionally, in order to model population-wide mobility restrictions we use a Bernoulli distribution with parameter β ∈ [0, 1] to sample contacts. Thus, an agent that usually draws 12 contacts will now draw β * 12 contacts on average. This modeling choice is a simplification of the varying degrees of government imposed mobility restrictions, controlled by a population-wide β.

DCT methods rely on capturing high-risk contacts and communicating risk information. The population-level app adoption rate is a key parameter because the fraction of contacts captured by this system will be proportional to the square of the app-using fraction of the population. To distribute apps throughout our simulated population, we use an age-dependent distribution across smartphone owners. We use an age-based breakdown of smartphone users as in [23] , and use an U P T AKE parameter to vary the population-level adoption rate. If an agent is assigned an app, there is a provision to report individual characteristics like age and preexisting conditions as well as daily symptoms. We further model dropout rates for reporting symptoms as well as a "drop-in" rate for falsely reporting symptoms to account for malicious users and other confounding factors that produce symptom inputs, such as colds or flu.

When app users i and j have a significant encounter on day d (which can lead to contagion) the CT method can exchange messages between them, on the same day or later. We have considered two kinds of messages: a contact message exchanged on day d allows i and j to keep track of the encounter event (such as when it happened) as well as a handle allowing them to send each other messages later (potentially via some server), while minimizing breaches in privacy. A contact message could also contain information about the current degree of contagiousness risk they pose to each other on day d. Later, if i observes new clues suggesting their contagiousness on day d was actually higher, i can send an update message to j, with information which can help j update their own evaluation of contagiousness.

We denote M d i,j (t) the risk message sent on day t by i to j regarding their encounter on day d. If t = d this is the initial (contact) message, whereas for t > d this is used to update the information j will have at day t about their encounter with i on day d. Note that i may send multiple update messages to j, as they acquire more information about having been infectious on day d. Additionally, we restrict the frequency of communication: if t is the time of the last risk message sent by i about some encounter, the next message is sent

The clues which agent i may use to come up with its risk messages include symptoms, test results, pre-existing conditions and received risk messages. The way to come up with these risk messages, as well as how to adjust behavior based on estimated risk, is specified by the CT method, such as those described below.

In digital contact tracing we have two goals: to reduce the individual's spread of disease (by recommending a reduction in mobility to risky individuals), and to inform contacts of additional risk. Agent's adherence to a recommendation level k entails sampling location specific contact patterns according to that level (see Section 3.6). We denote agent i's behavior level on day d by ζ i d such that for a total of n + 1 behavior levels, ζ i d ∈ {0, 1, ..., n}. Determination of this recommendation level is done via a risk estimator that uses a rich suite of features to evaluate agent i's risk history on day t. That risk history is denoted r i t , where we constrain risk levels to be non-negative integers with a maximum value of R max . We use r i t,d to denote estimated risk of agent i for day d such that t − d max < d < t. If an agent i had an encounter with an agent j on day d such that t − d < d max , r i t,d is sent to the past contacts as a risk message M d i,j (t) = r i t,d if j was a contact of i on day d (see Section 3.7) . This determines what information is available to agent's contacts such that they may modify their own behavior or to propagate risk further.

Useful Notations We use N (i) to denote a set of agent indices that had at least one digitally recorded contact with agent i in the past d max days.

Further, we use the colon symbol (:) to iterate through all possibilities for the variable at that position. For example, M d i,: (:) represents risk messages from agent i to all agents j ∈ N (i) for encounter on day d, if there was one. Similarly, M :

i,: (:) represents risk messages sent by agent i to agent N (i) within the past d max days.

We 

The most common class of digital tracing methods, Binary Contact Tracing (BCT), can be viewed as a binary classifier with final decision (behavior recommendation) being whether the agent should quarantine or not. Most often, the decision boundary is simple: did the individual had a recent contact with someone who received a positive RT-PCR test? We refer to these methods as Test-based BCT, which we describe formally next.

In Test-based BCT method, for an agent i with test d i = +1, agents j ∈ N (i) are notified and recommended to quarantine themselves as discussed in Section 3. We call this particular method BCT1 because it only affects the immediate contacts (and their household members) of individuals with positive test results; in section 5.3 we show results for this baseline, BCT2, where second order contacts may also be quarantined.

We describe a class of methods we call Feature-based Contact Tracing (FCT), which leverage the potentially rich set of features available on a smartphone to make graded recommendations. As discussed earlier, we make use of following available information on an app to update an agent i's estimated risk history r i d on day d:

i,: (:), and (e) previous estimated risk history r i d−1 . The agent's estimated risk for the past d max days is then propagated as discussed in Section 3.7. In addition to risk estimation, the agent's behavior is set to a level ζ i d based on the estimated risk for that day i.e. r d,d . In this section, we describe one such rule-based implementation of FCT -Heuristic-FCT which forms the basis of our experiments.

With the help of available information about how COVID-19 spreads and manifests itself in an individual in the form of symptoms, we designed a rule-based FCT method. Specifically, for every available aforementioned feature type, we determine the agent's risk history for the past d max days. The agent's risk history on each day d is taken as the maximum across these per-feature estimated risks.

Broadly, Heuristic-FCT uses the following rules:

• Test results, T i d : the agent's risk is set to r MAX = R max if there is a positive test result in the past d max days. This rule takes priority over any other rules (assuming that a positive test gives us maximum certainty about being in the top risk level).

• Symptoms, S i d : we identify three categories of symptoms based on how indicative they are of COVID-19. The presence of a highly informative symptom in S i d results in a high risk level r HIGH ; a moderately informative symptom results in a moderate risk level r MODERATE ; and a mildly information symptom results in mild risk level r MILD . We assign these risk level values for the past d max /2 days in r i d , similarly to [25].

• Risk messages, M :

i,: (:): the risk of an agent receiving a risk message r MAX is estimated to be ρ = r HIGH , while one receiving r HIGH is estimated to be at ρ = r MODERATE , and so on. The rationale is that the level of risk decreases rapidly as we move away from one agent to its contacts, to the contacts of these contacts, etc. We then compute the duration of time when agent i could have been infectious if this contact had caused their infection as d + 1 < d < d. Thus, we set the agent's risk to this value ρ until d days in the past.

• Other rules: There are some rules that are designed to override the above rules. For example, an agent with a negative test on day d is assigned a risk of 0 from that day onwards. Further, an agent is estimated to be at risk level 0 if there had been no positive test in d max days, no symptoms in the last d max /2 days, and no high, moderate of medium risk messages in a certain past time horizon.

We present a formal description of this risk estimator in Appendix I.

In this section we seek to provide following evidence: (1) our simulator is producing a reasonable approximation of real-world COVID-19 dynamics, and (2) it is a reliable testbed for comparing contact tracing methods. We address (1) by checking the output epidemiological characteristics match published literature (see Figure 2 and Table 1 ), and by checking that the hospitalization and mortality statistics are well-aligned with those found in real world data ( Figure 3 ). We address (2) by performing a sensitivity analysis of the simulator to a wide range of parameters, and checking that the ranking between contact-tracing methods is preserved over these settings, for different metrics.

We calibrate the simulator so that the observed statistics in the simulator are similar to what is observed for COVID-19, plotting SEIR curves in 2 and comparing statistics to published data in 1.

Simulator Reported numbers are average µ in days and corresponding 1 standard deviation σ, computed over 10 random seeds on a population size of 3000. It is important to note that these statistics are a result of the many processes happening within the ABM; there are no parameters that specifically encode these values. 

We run 10 simulations using random seeds and plot the mean and standard deviation of the hospitalization and mortality statistics over 100 days. We compare these results to the real data reported in Montreal from the same time period. Our simulations use 30,000 people and we report results as a proportion of population. We see that under these settings the proportion of the population that is hospitalized or deceased aligns with the data from Montreal (See Figure 3 ).

A primary contribution of this work is the creation of an ABM which can act as a testbed for comparing COVID-19 contact tracing methods. While the majority of the parameters in our ABM are chosen according to published literature, much about COVID-19 is still unknown or uncertain. For this reason, we conduct a sensitivity analysis of some key parameters which exhibit high variance across different studies.

Specifically, we study the impact of the asymptomatic proportion of the population (see Figure 4 ), which is a difficult to measure without widespread serological testing, and the initially infected proportion of the population ( Figure 5 ). We observe that the relative efficacy of different methods (i.e. the ranking of methods) holds across a wide range of settings, a desirable characteristic for a comparison testbed. We compare against the real hospitalization and mortality data (as a percentage of population) during the first wave of COVID-19 in Quebec. It is important to note that we only report a post-lockdown scenario. We report results with our simulator from 10 runs with a population of 30,000 people. Figure 4 : Sensitivity to asymptomaticity: These figures show a comparison between 4 contact tracing methods (including the No Tracing setting) and varying the proportion of the asymptomatic population between 0 and 100%. We plot two views of the same data with different Y-axes: fraction infected andRt. The relative performance between the methods are consistent across rates of asymptomaticity and choice of Y-axis. We report the mean and standard deviation across 5 seeds with a population of 3000 people. On the right, we see a counter-intuitive result: theRt at the end of the simulation is lower for simulations which started with a larger initially infected population. This can be explained by noting thatRt in some sense reports the instantaneous change in rate of spread of the disease while the fraction infected reports the absolute extent of the spread. Crucially, the ranking between the methods remains similar across choice of visualization and initially infected population. We report the mean and standard deviation for 5 runs under each condition with a population of 3000 people.

We focus our analysis on the region of Montreal for the period between March and June. Our agents are initialized with dwelling and workplace characteristics informed from publicly available census data [38] . We use n + 1 = 4 recommendation levels γ i , with γ 0 = 0 (full pre-pandemic mobility), γ 4 = 1 (full quarantine), γ 3 as per post-lockdown contact patterns reported in [39] , and rest of the levels as γ k = γ k+1 /2. Given a memory intensive infrastructure required for managing risk messages, we run simulations on a smaller population size of 3000 people. We initialize each simulation with 0.2% of the population as infected (6 infections), and run 10 different seeds for each value of β ∈ {0.25, 0.275, 0.30, ..., 0.85} resulting in a total of 420 runs. These values are chosen so that we can get estimates ∆R of the change in the reproduction number of the virus, as discussed in Section 6.3.

We compare the Heuristic-FCT method proposed in Section 4.2 with the Test-based BCT method discussed in Section 4.1, and a No Tracing scenario. The scenario of No Tracing corresponds to agents initialized at level 1 (post-lockdown contacts with some restriction advised) instead of level 0 (pre-pandemic contacts), in order to compare our methods in the context of the scenario where economies are gradually reopening. We use Test-based BCT1 and Test-based BCT2 to distinguish between the two variants when necessary, otherwise we use Test-based BCT to imply Test-based BCT1 which is the method being considered by most of the countries.

We compute the following metrics to evaluate various scenarios:

• Average number of contacts per day per human (C): We empirically compute the average number of daily contacts per agent in simulation.

• Proxy R (R): We use an empirical calculation to estimate the reproductive number R, we call this estimateR. At any timepoint in the simulation we may approximate R t byR t , by computing the infection tree and taking the ratio of number of children number of parents , where parents are recovered agents. This ratio gives an approximate rate of growth of the tree. We useR t at the end of the simulation as ourR.

• Daily cases (%): Percentage of population exposed, i.e., new cases on any given day of a simulation.

• Cumulative cases (%): Percentage of agents in exposed, infectious, or recovered state up until a particular day of a simulation.

• Prevalence (%): Percentage of population in infectious or exposed state on any given day of a simulation

• Incidence (%): Number of agents exposed per 1000 susceptible agents on any given day of a simulation. It is also referred to as attack rate.

As the contact tracing methods proposed in this paper change agents' behaviour to varying degrees, it is crucial to compare different methods across similar social mobility restrictions. For example, BCT can only use levels 1 and n while FCT can use all levels from 1 to n. Therefore, at the same value of β, BCT will likely under perform as it can not reduce infections by using intermediate levels. Thus, for a fair comparison of BCT with FCT, we run simulations with varying values of β ∈ (0, 1) and match them for average number of contacts.

To compare the performance of different methods for the same mobility restriction we empirically compute pairwise difference in meanR for a fixed number of contacts C. This is achieved by obtaining the performance in terms ofR of a method across a spectrum of values of C (by varying β), and fitting a Gaussian Process (GP) regression [40] to obtain a functional dependence for each method between C andR. We denote this fitted regression for method A byR GP A . To compute the advantage of method A over B, we find

Note that R = 1 is the threshold between exponential growth and exponential decay of the virus and serves as a good point of interest when comparing methods since the objective is to choose a method which brings R below 1, all else being equal (e.g., general restrictions on social mobility). Figure 6 shows a comparison of the GP regression fits between aforementioned DCT methods at 60% adoption rate, and a No Tracing scenario, across a range of values of C. We observe that Heuristic-FCT significantly improves over Test-based BCT by reducingR by 6.7%. Of note is the region around C = 5.61 ± 0.5 whereR = 1.2 for No Tracing. We compare the performance of DCT methods in this region to set our comparisons in the context of current scenarios (partial lockdown) where government-imposed restrictions keep R under control.

To further investigate the reason for the improvement of Heuristic-FCT over Test-based BCT, we peek into the simulations in the concerned region of C. Figure 7 shows that on average, Heuristic-FCT exhibits lower incidence as well as prevalence on any given simulation day. This is expected as Heuristic-FCT makes use of far richer features as compared to binary test results used by Test-based BCT to evaluate individual's risk of infecting others.

Finally, due to network effects, it is likely that adoption rate plays an important role in the performance of DCT methods. Thus, we evaluate DCT methods at various adoption rates. To do this, we run simulations of DCT methods at different adoption rates in a similar way as explained in Section 6.3, and concern our analysis on simulations chosen according to the selection criterion described above. Figure 8 shows the performance of the considered DCT methods at different adoption rates. As expected, the performance diminishes with lower adoption rates, for all the methods. However, we observe that Heuristic-FCT retains its advantage over BCT methods even at the lower adoption rates. At the same time, we also note that poor adoption brings both the DCT methods close to No Tracing, Figure 6 : Pareto front. We compare DCT methods at 60% adoption rate, and a No Tracing scenario. For each method, a GP regression is fit as discussed in Section 6.3, approximating a trade-off between mobility and spread of disease. We plot the mean fitted function along with 95% confidence intervals. Relative to the No Tracing scenario, we observe a statistically significant 17.4% reduction inR by Heuristic-FCT as compared to 10.7% by Test-based BCT method. Figure 7 : Case curves. We investigate the dynamics of the simulations chosen as per the criterion mentioned in Section 6.4 to compare the performance of DCT methods at 60% adoption rate in the context of partial-lockdown. Shown are the mean with one standard error band of the metrics described in Section 6.3. Simulations under Heuristic-FCT exhibit lower attack rates (incidence) as compared to Test-based BCT method, thereby explaining the advantage visible in Figure 6 .

thereby making it more difficult to measure significant advantages over the partial lockdown scenario. Thus, we argue that by continuity while at any adoption rate DCT methods can save lives, adoption rate is crucial for their efficacy. Figure 8 : Effect of adoption rates. Performance of DCT methods is heavily dependent on adoption rates. However, Heuristic-FCT retains its advantage over Test-based BCT across this spectrum.

COVID-19 has been a challenge for health agencies and economic policy decision makers alike. Countries around the world have taken unprecedented measures to prevent the collapse of health and economic welfare of people. Yet, at times, these two objectives have seemed to be at odds with each other in efforts to contain the pandemic. Therefore, we acknowledge that the evaluation of DCT methods should also stand on sound assumptions of utility maximization in economic theory. In this section, we attempt to relate simulation dynamics to economic outcomes that policy makers can use for decision making.

To assess the socio-economic burden of COVID-19, we examine the following metrics: (a) Disability-Adjusted Life Years (DALYs) averted [41] , a measure of lost years of healthy life, and (b) Temporary Productivity Loss (TPL), a measure of economic cost to society of restrictive measures. Of particular interest is the trade-off between these two measures. We use TPL per DALY averted to compare the cost-effectiveness of DCT methods with respect to the No Tracing baseline. This can be thought of as an Incremental Cost-Effectiveness Ratio (ICER), which answers the following question: how much more does each unit of additional benefit (in averted DALYs) cost with respect to the No Tracing baseline? A breakdown of the methodology used to calculate DALYs and TPL can be found in Appendix L. One assumption made in computing TPL is that total foregone work hours due to quarantine are scaled by a factor of 0.49 [42] to account for the proportion of agents able to work from home.

To assess the impact of the aforementioned CT methods on DALYs and TPL, we run 10 simulations of 3000 agents for each method. For a reliable economic analysis, it is necessary to have data on the full trajectory of simulations reaching a post-epidemic steady state comprised of agents that are either susceptible or recovered. Such a trajectory would help assess whether the DCT methods actually avert the loss of life years, or simply delay them. Since CT methods are used to stop individuals from spreading the disease, it is important to know whether the CT methods have successfully reduced the overall proportion of individuals, or if they have simply delayed these infections.

If the simulations are only run for 60 days with 0.2% infections seeded at the start, then the epidemic has still not reached a post-pandemic steady state, resulting in a possible overestimation of the benefits of the aforementioned CT methods. Thus, we run longer simulations of 90 days with 5% infections seeded at the start of simulations. This particular choice lets us draw preliminary insights into the cost-effectiveness of CT methods without having to scale up the simulations which are extremely compute intensive.

As a first step towards understanding simulation dynamics in terms of healthcare and economic costs, we compute DALYs averted and TPL for each scenario considered in Section 6 under the conditions described above. We use No Tracing as a reference scenario to independently evaluate differential benefit of DCT methods over doing nothing.

ICER for No TracingUnder the assumptions discussed at the onset of this section, our experiments (results in table 2) suggest that Heuristic-FCT saves almost six times more DALYs than Test-based BCT, a significant number of life years saved. However, we note that the TPL of implementing Test-based BCT is 30% that of Heuristic-FCT: higher proportions of agents are quarantined under Heuristic-FCT. Finally, we note that the cost per DALY averted ofHeuristic-FCT is 58% of that of Test-based BCT. Thus, the cost per healthy year of life saved by Heuristic-FCT is lower than the Test-based BCT method.

ICER for Test-based BCT To quantify the cost-effectiveness advantage that Heuristic-FCT provides over Test-based BCT, we also calculate the ICER of Heuristic-FCT over Test-based BCT. On average, Heuristic-FCT saves (58.12 − 10.01) = 48.11 more DALYs than Test-based BCT. However, it costs ($755K − $223K) = $532K more as well. Therefore, the ICER of Heuristic-FCT over Test-based BCT is 532K 48.11 = $11K per DALY averted. DALYs per age group Since there is evidence in the literature that COVID-19 disproportionately affects the elderly [43] , we stratify the DALYs per person by age in Figure 9 . It is highest among people aged 80-89 years for both CT methods, which is consistent with the literature. It is worth noting that the mean of Heuristic-FCT performs at least as well as Test-based BCT across all age ranges, and notably better for agents aged 60 and over. Figure 9 : Impact of different CT methods on DALYs, binned by age group. DALYs per person are total DALYs per age group over the size of each group. Heuristic-FCT consistently outperforms Test-based BCT across older age groups, most vulnerable to COVID-19.

We attempted to create an easily configurable platform to assist in evaluation of DCT methods and variants thereof. Further, we proposed an FCT method that has a potential to improve upon BCT methods, that is widely used in practice. However, we would like to point out some of the limitations our work.

First, although most of the parameters in COVI-AgentSim are informed from published literature, there are assumptions we had to take in the absence of available data. In this paper, we have enlisted most of these parameters and corresponding references that informs them. It is important to note that as more about COVID-19 will be known these parameters are subjected to change, thereby affecting results in Section 6. Additionally, we introduced intermediate levels of user behavior to contrast FCT with BCT. This is done with the help of introducing a factor γ n for level n that represents fraction reduction in number of contacts relative to pre-pandemic number of contacts (γ 0 ), empirical estimates of which have been widely used in epidemiological modeling. To obtain intermediate levels, we used interpolation such that γ k = γ k+1 /2. Although it is trivial to experiment with various interpolation schemes, in the absence of user-behavior research, there is not enough that can be done in this regard other than providing the sensitivity analysis of the parameters involved therein. In addition to this, we foresee the use of such assumptions to be translated into government policies such that resulting number of contacts can approximate these assumptions.

Second, our modeling framework of FCT relies on certain assumptions of technology which might not hold in practice. For example, our assumptions of proximity detection using bluetooth signals (Appendix G) might be unrealistic. However, [44] 's work on improving the reliability of bluetooth signal can be a way to address this. Further, we assume the app to be active (in foreground) to be able to communicate with nearby phones all the time, an assumption that might not hold depending on the app design. There are a variety of such technological and ethical considerations which are discussed in [45] to design a successful peer-to-peer FCT app. COVI-AgentSim incorporates most of the technological assumptions in [45] , however, with additional effort, other assumptions can also be incorporated in COVI-AgentSim to evaluate DCT methods in different settings.

Third, we designed the Heuristic-FCT method using rules that were informed by domain knowledge about COVID-19's spread characteristics. At this point, we acknowledge that there should be room for improvement in these rules which can be brought upon by amalgamation of ideas in disciplines such as epidemiology, user behavior research, computer science, and statistics to name a few. Alternatively, machine learning methods could be used to learn such rules. Thus, we think that a unified mathematical framework to analyze FCT methods might help further in development of better FCT methods.

This work presents COVI-AgentSim, a simulation testbed for evaluation of DCT methods. COVI-AgentSim is an agent-based compartmental model that is initialized with a synthetic population sampled from the census data. Daily activities as well as interactions for each agent are sampled according to the empirically derived contact matrices. We calibrate COVI-AgentSim to approximate the spread of COVID-19 virus to the region of Montreal, however, the simulator can be easily configured for other regions via change of an appropriate configuration file.

Finally, we propose the FCT class of contact tracing methods that utilize a richer set of input features as compared to BCT methods (which rely on binary signals like presence or absence of a positive test result). In doing this, we aim to provide infected individuals with a warning signal earlier than BCT methods. To put FCT in practice, we designed Heuristic-FCT which uses hand-designed rules to inform an individual's risk of infection and infectiousness to others.

Our empirical results show that Heuristic-FCT results in 6.5% improvement in R t over BCT methods, and both the methods themselves provide a significant improvement over a partial lockdown scenario. Experiments with varying adoption rates suggest that the efficacy of DCT methods is heavily dependent upon adoption rates. It is, however, observed that Heuristic-FCT retains its benefit over Test-based BCT method across the adoption rate spectrum, but this advantage was not statistically significant in the face of very low adoption rates, at the scale of our simulations.

Using an agent-based compartmental model as the foundation of this testbed allows us to simulate a rich set of individual-level features, which we show can potentially be leveraged by DCT methods to improve over the existing BCT methods. We hope that the baselines established in this work will encourage and enable the informed development of DCT methods as a first step in their responsible deployment as an epidemic intervention tool, potentially saving lives at lower economic cost during deconfinement and/or second-wave prevention.

Finally, this work joins a growing body of work in considering novel methodologies for rigorous evaluation of interdisciplinary technologies. Epidemiology is fundamentally an intersectional science, touching sociology, biology, behavioural psychology, geography, political science, ecology, mathematical and computational modeling, and many other fields, in a society which is increasingly digitized and globalized. By working together across fields, with careful empirical study, we have hope in dealing with the important issues we face.

A major direction for future work is to benchmark a wider variety of CT methods, including probabilistic and machine-learning based methods which could make even better use of the features our simulator provides than the Heuristic-FCT proposed here. The data generated by our simulator are potentially of interest for training such models to estimate individual-level characteristics which are predictive of the spread of the disease.

Our cost-benefit analysis using DALYs and TPL is a first step towards an integrated framework to help policy makers in their decision process. We see imbuing economically sound decisions in our simulated agents as a step in creating an integrated framework for richer evaluation of DCT methods.

Although COVI-AgentSim is designed to evaluate DCT methods, we foresee directions for the simulator to investigate the impact of a gamut of non-pharmaceutical interventions on containing COVID-19. For example, with some work and expertise in manual CT methods, one can compare and evaluate various variants thereof. Another example is to analyse various COVID-19 testing strategies in conjunction with DCT methods. Studies of this nature could potentially help public health agencies as well as policy makers in their decision process. A far cry, though worth mentioning, is the fact that COVI-AgentSim has essential components for simulating an outbreak, thereby enabling adaptation to other infectious diseases like influenza or tuberculosis.

[19] Giulia Cencetti, Gabriele Santin, Antonio Longa, Emanuele Pigani, Alain Barrat, Ciro Cattuto, Sune Lehmann, and Bruno Lepri. Using real-world contact networks to quantify the effectiveness of digital contact tracing and isolation strategies for covid-19 pandemic. medRxiv, 2020.

[20] Yunhwan Kim, Hohyung Ryu, and Sunmi Lee. Agent-based modeling for super-spreading events: A case study of mers-cov transmission dynamics in the republic of korea. a test, this agent is added to the queue. For different symptom severities, we assign a different probability of seeking a test. At some points during the pandemic, Montreal experienced a restricted testing capacity. In order to model that setting, we limit the daily testing capacity to a proportion of the population (generally set to 0.1%).

To determine which people get tested under such a restriction, we allocate tests to people with more severe symptoms and app-based recommendations. After a test has been conducted, there is a delay before results are returned (2 days in our experiments). During this period, the individual who is being tested is isolated, and if the test returns positive then an additional 12 days of isolation is recommended. Any quarantine which is applied to the person receiving the test is also applied to other members of their household.

We adopt a simple model of hospitalization and interaction within them. To simulate post-lockdown scenarios, we assumed no infections in hospitals. We adopt a probabilistic model of hospitalization where likelihood of being admitted to the hospital or ICU depends on symptom severity and age. Mortality rates are conditioned on age-group following data from Quebec public health (date?) [59] . The duration of a hospitalization, likelihood of requiring critical care, and mortality rate given critical care by age follow nationally conducted surveys available publicly 5 . The number of hospitals is defined in relation to the population, using the same ratio of hospitals to people as are found in Quebec (1.99 hospitals per 100,000 people). The number of hospital beds per capita, and occupancy ratios are taken from 6 , and the number of icu beds per capita and occupancy ratios are taken from 7 . Hospitals are staffed by doctors and nurses, who are modelled as people with a profession that requires they work at the hospital and have protected interactions with patients.

Literature on the link between underlying medical conditions and COVID-19 encompasses risk of being hospitalized due to COVID-19, risk of severe complications (e.g. mechanical ventilation) and risk of mortality. Preexisting medical conditions were used to model: 1) hospitalizations and deaths outcomes and 2) conditional probability of symptoms. To model hospitalizations and deaths in the population, we used risk ratio estimates from studies focusing on risk of hospitalisation and risk of death as outcomes. Risk ratios adjusted for other individual characteristics were preferred over crude estimates.

To inform the hospitalisation and death outcome model from the simulator, we selected the following risks ratios: diagnosis of heart disease, 1.17 [48] , stroke history, 2.16 [48] , asthma with recent oral corticosteroid use, 1.13 [48] , COPD, 1.08 [47] , cancer (excluding haematological malignancy), 1.72 [48] , diabetes, 2.24 [47] , obesity stages 1 and 2 (body mass index (BMI) = 30-39.9 kg/m 2 ), 1.8, obesity stage 3 (BMI >40 kg/m 2 ), 2.45 [47] , CKD, 2.60 [47] , immuno-suppressed because of asplenia, 1.34 and because of immunosuppressive conditions (excluding asplenia and haematological malignancy), 1.70 [48] . Given the uncertainty on the association between smoking and COVID-19 prognosis [56] , we did not consider this risk factor in the simulator.

In this section we describe contact patterns in pre-pandemic situation. Scenarios of lockdown and contact patterns in intermediate behavior levels are a modification by a factor γ n as discussed in the section 3.

We use empirically derived matrices in 2017 from [32] for Canada that we further project on to Montreal's demographical structure. Projection of country-wide matrix to a regional matrix is done via method described in [60] . However, ABM can be configured to bypass the step of regional projection of contact matrices. Given a discrepancy between population wide mean daily contacts inferred from projected matrices and Montreal's number of contacts reported in a 2020 survey [39] , we scale the projected matrices appropriately. We ran 12 simulations with no infected agent i.e. α = 0 to observe the pre-pandemic contact patterns to yield simulated contact patterns in this section. Additionally, the simulated contact patterns shown in this section are descaled with the same multiplicative factor that is used to scale the projected matrices.

As discussed in the main section, agents are grouped into houses according to census data [61] . Thus, we simulate dwelling characteristics of the city of Montreal. However, it can be configured easily for any other city by using appropriate parameter values. We consider house of sizes ranging from 1 to 5. Age distribution of agents living solo also follows census data. Further, house sizes ranging from 2 to 5 consider three broad categories of dwelling characteristics -(a) couple with x kids, (b) single parent with x kids, (c) random allocation, where x represents number of kids required to complete the house size. For example, for a house with single parent and of size 5, x = 4. The distribution of these characteristics also follow from census data. Finally, we also consider senior residencies where a proportion of agents above age 65 live. We inform this proportion from the census data as well 8 . We explicitly model older adults living in assisted care resulting in oversampling of contacts in that age group. We discuss it further in the Appendix E.2

As a result of housing allocation discussed above, we yield a contact pattern as shown in Figure E .12. We make two observations (a) there is an oversampling of contacts towards the older age groups: It is because older agents grouped in collectives like senior residencies are modelled explicitly. This choice was motivated from [62] which suggests inclusion of collectives in proper response to the COVID-19 pandemic. (b) a slight discrepancy we observe in the intensity of the main diagonal is due to insufficient social gatherings at households.

We consider an age-dependent workplace allocation such that agents in each age group have a probability of attending a school or a workplace. We consider schools for the following age groups (a * ) 2-4 years old (y.o) (b) 4-5 y.o (c) 5-12 y.o (d) 12-17 y.o (e * ) 17-19 y.o: (f * ) 19-24 y.o (g * ) 25-29 y.o, where * marks the age group in which only a fraction (informed by census data) of agent population was allocated schools. Further, we assume 100% employment so that all agents older than 17 y.o and younger than 65 y.o. were allocated a workplace. Agents in senior residencies are allocated a common room as their workplace where they get together during working hours. Such allocation of younger agents to schools give a contact pattern as shown in Figure E 

To model infections at locations other than workplace and house, we consider locations where agents remain for a relatively shorter duration as compared to house and workplace. Specifically, we model interactions at locations like restaurants, grocery stores, and parks. Note that this category of locations is also termed as "other" in [32] . Further, as the mean number of contacts at house are greater than the number of residents, it was important to consider socializing activities organized at houses. To do this, we maintain a pool of agents that an agent interacts with, and bring them together for a social activity at either a restaurant or at the agent's house. We discuss scheduling of these activities next. A contact pattern resulting at these random locations is shown in Figure E. 15.

We consider adult supervision for agents below 14 y.o i.e. except for when agent goes to school, at least one adult agent (older than 14 y.o) has to be present all the time. Thus, we pre-plan the schedule of agents older than 14 y.o at the start of the simulation, and plan the schedule of agents younger than 14 y.o during the simulation. Planning the schedule takes into account workplace opening hours as well as regularly scheduled activities like social gatherings 9 , exercising, and grocery shopping 10 . Thus, an agent who has gone to a grocery store or a restaurant on one day will be less likely to go again during that same week, and so on. The schedule additionally depends on the day of the week. For instance, agents with school as workplaces are scheduled to be at school on weekdays, whereas most of their time will be spent at home on weekends.

On the day of activity, however, these activities might stand cancel due to sickness, quarantining requirements, or hospitalization. In these situations, location of activity is appropriately changed for a required duration. At the same time, if an agent requiring adult supervision is sick, has to quarantine, or is hospitalized, an adult from the same house has to cancel the activity to stay with the agent. Of note, agent's mobility i.e. presence in locations other than house is reduced when they are experiencing symptoms (to a degree proportional to symptom severity). Note that we do not change the schedule unless the agent is quarantining i.e. normal mobility is maintained all the time unless the agent is put in the level n. Figure E.16 show a breakdown of contacts at each location on weekdays and weekends.

We implement age-stratified contact sampling as informed from empirically derived contact matrices as described in Appendix E.1. Specifically, for each agent we draw number of contacts as per the location-specific agestratified number of contacts obtained from the contact matrices. We use a negative binomial distribution [64] to draw number of contacts. Further, we use these matrices to infer probability of interaction with other agents in each age group, thereby, implementing location dependent assortativity in interactions. We call these interactions as encounter. Finally, we also Figure E.16: Simulated mean daily contacts on weekdays and weekends broken down by age groups. Agent activities are scheduled such that the mean number of contacts on work and non-work days follow surveyed data as reported in [63, 37] draw amount of time spent in each encounter as per the survey conducted in California [65] standardized to the demographics of Montreal. Thus, we obtain an aggregated contact pattern as shown in Figure E.17 . 

We briefly describe the required steps for customization of the simulator. Location-specific demographic and contact data may be modified simply by adding a new configuration file to the configs folder. Configuration files are written using YAML, a human friendly data serialization standard. Essentially, these files contain key-value pairs. The values for a new region must be specified and contained in the new configuration file. Examples of modifications required to model a new region include: population-level distribution of age, housing distribution i.e. number of houses from size 1 to 5, occupation characteristics including age for kids to go to schools, retirement age, etc. and, finally, how often people go out to stores, socialize,

We provide details of the messaging protocol that uses Bluetooth signals to exchange tokens. The privacy protocol which provides anonymous message exchange between phones which have exchanged these tokens is covered in more depth in another paper 11 . In the context of the current work, we focus on the description of how we model the Bluetooth communication range of phones and on the description of the clustering algorithm we use to prepare received messages for input to a risk prediction model.

Given a ground-truth distance and duration for our encounter, we want to determine if two app-users should exchange encounter messages. We only exchange encounter messages if the perceived distance is below 2 meters and the duration greater than 15 minutes. To compute this, we use a naive model of the Bluetooth noise which uses a per-person noise and a per-encounter noise. Each smartphone has a different noise sampled uniformly between 0 and 1. Each encounter takes the mean of this value across both users. We then apply a relative offset to the real encounter distance by multiplying the combined user noise and a uniform random variable centered at 0 with range [0.5, −0.5]. The magnitude of the distance offset is up to 2 metres when the real distance is 2 metres, and up to 0.5 metres when the ground truth distance is 1 metre.

An encounter message m enc is composed of the day the encounter occurred d and the sender's quantized risk at the time of the encounter r d . The formal definition of a risk message is:

An app user (say "Alice") receives new encounter messages four times per day in batches; we call the set of new encounter messages on day d, M new . The risk messages which Alice has already clustered are noted M enc (which is an empty set when Alice first installs the app). It is useful to think of encounter messages as database records. Alice inserts records into her database whenever she encounters other app users ("Bobs") with their current risk estimate. Update messages are like database update statements, and are used to change the risk values in old encounter messages. 11 full citation to be provided in the camera-ready version

If a user (say "Bob") had many encounters with Alice, and Bob subsequently receives a positive test result, becomes sick (reporting symptoms), or otherwise re-evaluates his risk, then Bob will send risk update messages to Alice. Similar to encounter messages, these risk update messages may be sent through the server up to four times per day. More specifically, if Bob updates his estimate of his risk on some previous day d old , then Bob will construct a risk update message for every encounter message that he had on d old and these messages are sent to the users with whom Bob had a contact. If this update message is sent on day d, it is composed of the current day d, his new risk r new , his previous risk estimate for that day r old and the day of the encounter d old . Therefore, the formal definition of a risk update message is: m update = (r new , d, r old , d old ) ∈ M updates

By virtue of the strong privacy protocol in place, we are not able to create message clusters that correspond the true chains of contacts for the encountered users. The only information we have to create clusters is the day and risk level of the encounters. Our hypothesis is that a user's contact patterns can provide a rough idea on the number of individual people they encounter on each day.

All encounter messages received on a given day with a similar risk level are put into the same cluster. Given that there are 16 risk levels, it is only possible to create 16 new clusters on a given day, unless the user receives update messages. Every update messages is created for exactly one encounter message. If there are fewer update messages for a given day / risk combination than there are encounter messages for that day / risk, then we split the existing cluster for that day / risk into two, with one cluster containing newly updated encounter messages, and another containing the rest of the messages.

We do not claim that this clustering algorithm is optimal, that the input format to the neural network is optimal, or that this messaging protocol is optimal. There exists an interesting trade-off between privacy and risk precision which we hope will be explored in future work.

Agent-Agent Transmission All encounters are assumed to be at a distance equal or less than two meters. Transmission events can only take place between infectious and susceptible individuals. Although the ABM models all encounters taking place between individuals, a susceptible individual is only considered exposed (i.e. the model considers that an effective contact has taken place) when the encounter lasts a minimum of 15 minutes and bernoulli distribution with probability determined via Equation 2 gives 1. The likelihood of viral transmission during a given effective contact is proportional to the time duration of the encounter, and depends broadly on characteristics of both the infected and susceptible individual.

We use the transmission model as explained in [23] . Thus, encounters taking place in certain locations (i.e. senior residencies and households) are also inherently more prone to result in a transmission event, all other factors being equal. This is modeled via B n . At the same time, characteristics affecting infectiousness (of the infected individual) include progression of the disease (i.e. effective viral load) (EV L), whether (and to what extent) they are symptomatic (A s ). Susceptibility of the exposed individual depends notably on age (S a ).

Environment-Agent Transmission Please note, that the probability of environmental transmission is not supported by any published study, and we consider 0 environmental transmissions in the experiments in this paper. However, ABM can be configured to simulate environmental transmissions. Empirical estimates of such transmissions stands at 10% as per [23], therefore, we consider a transmission model that models environmental infections in pre-pandemic scenario. Given the lack of such estimation in post-lockdown scenario, we do not consider any such transmissions.

We model these transmissions by considering a linearly time-decaying probability of location being contaminated which is triggered by the presence of an infectious individual. Initial magnitude of contamination is dependent on the agent's current phase of the disease. Further, the duration of such contamination is informed from experimental study [66] , which lists surfaces and the duration for which virus survives on them. We consider an experimental environment transmission model that estimates probability of infecting agent as proportional to (a) contamination strength of the location, and (b) susceptibility of an agent. The proportionality factor being the environmental transmission control knob which lets us model the disease spread as per the observed data.

We denote S HIGH as the row indices in S i d ∈ {0, 1} Nsymptoms×dmax such that the symptom at those indices are highly informative symptoms (see Sec. J.8). Similarly, we obtain sets of indices S MODERATE and S MILD .

We used n + 1 = 4 recommendation levels, thereby allowing an agent behavior level to be either of {1, 2, 3, 4}. We also denote the set of days {d 1 , d 2 , ...d max } as D. The Heuristic-FCT algorithm is implemented as Algorithm 1. We next describe each function of the Heuristic-FCT. In each of the following sections we detail the probability of a given symptom in each of 5 phases of the disease. The probability of a person having fatigue as a symptom is given in the first column, while the probabilities of having the other symptoms given that this person has fatigue are given in the columns following the Fatigue column. In order to have unusual symptoms, the person must also be over 75 years of age; that is, the probabilities in the table given for unusual symptoms are the probabilities of having unusual symptoms given this person is over 75 years of age and is experiencing fatigue. The probability of having trouble breathing as a symptom is given in the TB column for the corresponding COVID-19 phases. For an individual to experience severe chest pain (SCP), the individual must also be extremely sick; that is, the probabilities given for having SCP as a symtpom are the probabilities of having SCP given an individual has trouble breathing and is extremely sick.

The probability of having light/moderate/heavy trouble breathing is given by the probability of an individual having light/moderate/severe symptoms of COVID-19 and the probability of having trouble breathing as a symptom.

The probability of having loss of taste as a symptom of COVID-19 is 25% during the onset phase, 35% during the plateau phase, and 0% for all other phases of COVID-19.

Aches are not caused by COVID-19 in this simulator, but are caused by the flu. The probabilities of having aches on the first and last days of the flu are 30% and 80% respectively, while the probability of having aches for all other days with the flu is 50%.

We wish to verify that the dynamics of the simulator at the population sizes we model are representative of larger populations. Because of the computational demands of an agent-based model, particularly with messaging between agents, it is more efficient to model smaller populations, as long as the dynamics remain reprsentative. As shown in Figure K .19, we find that population sizes above 2k to be representative of the dynamics of larger populations across a range of metrics. We thus ensure all experiments are run with populations of 2k and over.

Disability-Adjusted Life Years (DALYs) are a summary measure of the public health burden associated to a specific cause's premature mortality and morbidity. To calculate DALYs, we individually compute the years of life lost due to premature mortality (YLLs) [67] for agents that died during the simulation, as well as the years of life lost due to disability (YLDs) [68] for agents that were infected and symptomatic. Disability weights (DW) are taken from the 2017 Global Burden of Disease Study [69] : they represent health preferences such that a DW of 0 is perfect health and a DW of 1 is equivalent to death. Hence, the higher the amount of DALYs, the worse health outcomes are. DALYs are calculated without discounting or age-weighting, following the WHO methodology [70] . For each agent i, TPL Temporary Productivity Loss (TPL) is the loss in productivity due to absenteeism from work. To calculate TPL, we extract from the simulator the number of work hours that agents aged 25 to 65 years had to forego due to quarantine, taking care of a dependent (supervision), or illness to the point of being unable to work. We then multiply this quantity of foregone work hours by the 2019 average hourly wage in Montreal [71] to obtain total TPL. We follow the methodology for calculating TPL presented in [43] . 

Disability weights Covid-19 is a novel disease, and there are no published disability weights for different levels of severity of the disease. Therefore, we use similar conditions as proxies for different health states of the agents. Agent hospitalization status is used as a proxy for actual health status, and can be divided into three categories: agents that are symptomatic and not hospitalized, agents that are hospitalized but not in critical are, and agents that are in critical care. The disability weights, as well as their equivalent causes in the GBD 2017 study can be found in Table L. 3.

Full trajectory Due to the computational strain of the message-passing within the simulations, observing the full trajectory under binary contact tracing and feature-based contact tracing is currently unfeasible. Future work will consider longer simulations that reach a post-pandemic steady state..

PPL In addition to Temporary Productivity Loss (TPL) due to absenteeism from work, cost-benefit analyses following the Human Capital Model (HCA) [43] typically include a Permanent Productivity Loss (PPL) due to premature mortality component. When longer simulations become possible, future work will take into account PPL. .

To properly evaluate the impact of different tracing strategies on the socio-economic burden of a disease, it is important to evaluate whether the proposed strategies avert DALYs, or simply delay them [72] . Such evaluations require a longer trajectory to arrive at a postpandemic steady state, which we will consider in future work. .

The focus of this paper is not the cost-benefit analysis of the outcomes of the simulator, but rather the simulator itself, a sensitivity analysis of the cost-benefit results has been relegated to future work.. Table L .5: YLL, YLD and DALYs ×1000 for different CT methods. In all three cases, the bulk of DALYs is due to Years of Life Lost due to premature mortality (YLL), rather than Years of Life Disabled (YLD). Women are also disproportionately affected in all three scenarios.

As can be seen in Figure L .20, under No Tracing, the total DALYs is 129.42, and the TPL is $1.370M. Test-based BCT slightly affects the health and economic outcomes: the TPL is $1.591M and the total DALYs is 119.42. However, Heuristic-FCT has a comparatively large effect on the total DALYs, which drops to 71.31. However, this is contrasted by a rise in the TPL to $2.122M: health outcomes are drastically improved at the cost of a greater drop in productivity. Of note is the difference in impact of both tracing methods: whereas Test-based BCT has very little effect on the health and economic outcomes, Heuristic-FCT reduces total DALYs by 44.90% at the expense of an increase in TPL by 54.89%.

Bootstrapping DALYs and TPL are computed for each seed separately before being aggregated to obtain a mean and standard error. When measures are calculated across subgroups of the simulation's population, such as across age groups, bootstrapping is used to capture TODO Figure L.20: Impact of different CT methods on DALYs and total foregone work hours. No Tracing foregoes the least amount of work, but results in a high amount of DALYs. Test-based BCT foregoes more work, but still results in a large loss of health. Heuristic-FCT foregoes even more work than No Tracing does, but results in a large decrease of DALYs, alleviating the health burden. Total foregone hours due to quarantine, aggregated across individuals, are scaled by a factor of 0.49 [42] to account for the proportion of agents able to work from home. Standard errors are computed by bootstrapping 100 samples of 6 runs over 10 seeds.

Lockdown timing and efficacy in controlling covid-19 using mobile phone tracking

Lockdown-type measures look effective against covid-19

Covid-19 pandemic and lockdown measures impact on mental health among the general population in italy

Economic and social consequences of human mobility restrictions under covid-19

Mitigate the effects of home confinement on children during the covid-19 outbreak

The economic cost of covid lockdowns: An out-of-equilibrium analysis. Economics of Disasters and Climate Change

A national plan to enable comprehensive COVID-19 case finding and contact tracing in the us

Temporal dynamics in viral shedding and transmissibility of COVID-19

Modes of contact and risk of transmission in covid-19 among close contacts. medRxiv

Rapid review of contact tracing methods for COVID-19

Variation in false-negative rate of reverse transcriptase polymerase chain reaction-based sars-cov-2 tests by time since exposure

Fast coronavirus tests: what they can and can't do

Covid-19 pandemic planning scenarios

Modeling the combined effect of digital exposure notification and non-pharmaceutical interventions on the covid-19 epidemic

An agent-based approach for modeling dynamics of contagious disease spread

Agent-based modeling and simulation

A stochastic agent-based model of the sars-cov-2 epidemic in france

Modelling the impact of testing, contact tracing and household quarantine on second waves of covid-19

An Introduction to Agent-Based Modeling: Modeling Natural, Social, and Engineered Complex Systems with NetLogo

Python 3 Reference Manual. CreateSpace

The C programming language

Frequency of food store visits in canada as of 2018, by province

Canadians' connections with family and friends

Projecting social contact matrices in 152 countries using contact surveys and demographic data

Potency and timing of antiviral therapy as determinants of duration of sars cov-2 shedding and intensity of inflammatory response. medRxiv

The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: estimation and application

False-negative results of real-time reversetranscriptase polymerase chain reaction for severe acute respiratory syndrome coronavirus 2: role of deep-learning-based ct diagnosis and insights from two cases

Estimating effective reproduction number using generation time versus serial interval, with application to covid-19 in the greater toronto area, canada. medRxiv

Social contacts and mixing patterns relevant to the spread of infectious diseases

Census profile, 2016 census -montreal, quebec and canada

Epidémiologie et modélisation de l

Global, regional, and national disability-adjusted life-years (dalys) for 359 diseases and injuries and healthy life expectancy (hale) for 195 countries and territories, 1990-2017: a systematic analysis for the global burden of disease study 2017

Remote work and employment dynamics under covid-19: Evidence from canada

Impact of the burden of covid-19 in italy: Results of disability-adjusted life years (dalys) and productivity loss

Risk scoring calculation for the current nhsx contact tracing app

Report from the Canadian Chronic Disease Surveillance System: Heart Disease in Canada

Factors associated with hospital admission and critical illness among 5279 people with coronavirus disease 2019 in new york city: prospective cohort study

Factors associated with covid-19-related death using opensafely

Characteristics associated with hospitalization among patients with covid-19-metropolitan atlanta, georgia

Determinants of covid-19 disease severity in patients with cancer

Public Health Agency of Canada. Diabetes in Canada

Predicting Mortality Due to SARS-CoV-2: A Mechanistic Score Relating Obesity and Diabetes to COVID-19 Outcomes in Mexico

Overweight and obesity based on measured body mass index, by age group and sex

Prevalence estimates of chronic kidney disease in canada: results of a nationally representative survey

World Health Organization. Smoking and covid-19

Age-dependent effects in the transmission and control of covid-19 epidemics

Temporal profiles of viral load in posterior oropharyngeal saliva samples and serum antibody responses during infection by sars-cov-2: an observational cohort study. The Lancet Infectious Diseases

Quebec public health

Projecting social contact matrices to different demographic structures

The need to include assisted living in responding to the covid-19 pandemic

Contacts in context: largescale setting-specific social mixing matrices from the bbc pandemic project. medRxiv

Wikipedia contributors. Negative binomial distribution

Using time-use data to parameterize models for the spread of close-contact infectious diseases

Sustainability of coronavirus on different surfaces

Global, regional, and national age-sex-specific mortality for 282 causes of death in 195 countries and territories, 1980-2017: a systematic analysis for the global burden of disease study

Global, regional, and national incidence, prevalence, and years lived with disability for 354 diseases and injuries for 195 countries and territories, 1990-2017: a systematic analysis for the global burden of disease study

Global Burden of Disease Collaborative Network. Global burden of disease study 2017 (gbd 2017) disability weights

World Health Organization. Who methods and data sources for global burden of disease estimates

Weekly and hourly earnings of employees by sex, montréal and all of québec

Covid-19 and the health policy recession: whatever it takes, grandma or the economy or what makes sense?

Information on the Montreal population regarding age, sex and occupation distribution was retrieved from Canadian Census data [38] . Prevalences of selected medical conditions considered as COVID-19 infection and prognosis risk factors were determined based on prevalence estimates from nationally representative surveys or medical surveillance programs in Canada: heart disease [46, 47, 48, 49] , stroke [48] , asthma [47, 48] , chronic obstructive pulmonary disease (COPD) [47, 48, 49] , cancer [47, 48, 50] , diabetes [47, 48, 49, 51] , obesity [47, 48, 49, 52, 53] , chronic kidney disease (CKD) [47, 48, 49, 54] , immuno-suppressed conditions [48] and smoking [55, 56] . National prevalence estimates were extracted based on age group (<10, 11-20, 21-30, 31-40, 41-50, 51-60, 61-70, 71-80 and >80 years of age) and sex.We determined conditional probability of developing COVID-19 in ABM based on symptoms and risk factors associated with COVID-19 in published literature. A mathematical modelling study of the epidemic with Canadianspecific estimates [57] was used to model COVID-19 susceptibility in the pediatric population of the simulator.

We model inoculum, the amount of virus transmitted during an exposure event, as a random variable uniformly distributed between 0. and 1.. The magnitude of inoculum is used to determine the type and severity of symptoms.We sample parameters for a piece-wise linear model of what we call effective viral load (EVL) 4 . We think of EVL as a piecewise linear function, attributes of which are sampled for each individual separately. This approximation follows empirical studies on viral load progression [58, 34] . Figure B .11 is the mean of sampled effective viral load curve.

We model RT-PCR test allocation as a priority queue. When an agent experiences symptoms, or is advised by a contact tracing application to seek