key: cord-0159543-qi42ctm3
authors: Liu, Shangching; Liu, Koyun; Chiang, Hwaihai; Zhang, Jianwei; Systems, Tsungyao Chang Synergies Intelligent; Inc.,; Hamburg, University of
title: Continuous Learning and Inference of Individual Probability of SARS-CoV-2 Infection Based on Interaction Data
date: 2020-06-08
journal: nan
DOI: nan
sha: 46f8993989838d816958317de923eb779405abb9
doc_id: 159543
cord_uid: qi42ctm3

This study presents a new approach to determine the likelihood of asymptomatic carriers of the SARS-CoV-2 virus by using interaction-based continuous learning and inference of individual probability (CLIIP) for contagious ranking. This approach is developed based on an individual directed graph (IDG), using multi-layer bidirectional path tracking and inference searching. The IDG is determined by the appearance timeline and spatial data that can adapt over time. Additionally, the approach takes into consideration the incubation period and several features that can represent real-world circumstances, such as the number of asymptomatic carriers present. After each update of confirmed cases, the model collects the interaction features and infers the individual person's probability of getting infected using the status of the surrounding people. The CLIIP approach is validated using the individualized bidirectional SEIR model to simulate the contagion process. Compared to traditional contact tracing methods, our approach significantly reduces the screening and quarantine required to search for the potential asymptomatic virus carriers by as much as 94%.

The pandemic of the SARS-CoV-2, which causes COVID-19 outbreaks, has a significant impact globally, especially on human life and economic activities. As resources are limited, current policies are having difficulty in identifying and quarantining asymptomatic virus carriers. As a result, it is much harder to control the spread of the virus. To prevent further spread of COVID-19, immediate action is needed. Contact * corresponding author : Jianwei Zhang, zhang@informatik.uni-hamburg.de † corresponding author : Tsungyao Chang, m@sis.ai 1 arXiv:2006.04646v2 [cs.SI] 14 Jun 2020 tracing is a method that helps patients recall with whom or where they have been. Identifying contacts and ensuring they do not have a chance to interact with others is critical to slow down the pandemic NCIRD (2019). This paper is the first in which an approach with continuous learning capabilities is used to analyze the probability of asymptomatic carriers of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).

To this end, we compute a ranking model with city GPS spatial dynamics data Tang et al. (2018) . The approach is a framework for finding and ranking the source of infection among a moving crowd and can be easily applied to the dynamic modelling of the spreading of the SARS-CoV-2 virus. It is highly efficient in calculating the rich interactive features with continuous data to approximate the individual probability of being infected since the (Monte Carlo tree search) MCTS on IDG reduces the time to search the important center-surround features. The infection probability of each person exposed in a crowd over time can be quickly obtained by the CLIIP. Moreover, even a superspreader (active in motion, high viral titer, asymptomatic)

can be found when we use backward and forward tracking at the same time. The backward tracking [1] is backward finding of a day when one possibly got infected and the forward tracking [1] is going through the whole day of inference detection on possible days. learning while the newly infected people are being recorded. The incubation period requires us to group the patients into their actual contagious time according to the distribution of (a) from time t to range (t−n, t−1). n is the maximum incubation time. The arrangement describes the possible time of the latent infection, which lies in the infectorÂťs incubation time. The inference model of every t is represented by an individual directed graph (IDG) and we use a day as t in our simulation. The red circles denote confirmed infected people, the hollow squares mean exposed, and a filled square is an individual who stays on the path from one infector to the others which might be asymptomatic virus carriers. A hollow green diamond is labeled as a healthy person. The arrows denote the possible path of transmission derived from the people's location and staying time. The virus will stay in the same place for a while van Doremalen et al. (2020) which makes the latter person leave the specific place fall into high-risk for getting infected. Also, we define each layer as the number of edges between two nodes. For instance B is a center-surround node in the fourth layer of A.

Contact tracing is currently the most common way for public health institutions to track infected people and the sources of the virus Eames and Keeling. (2003) ; Scutchfield and Keck (2003) . This method can locate infected individuals and minimize the spread of the virus by isolating them and their contacts at risk of infection from the public. In past decades, it has been not only used for controlling diseases but also a critical tool for investigating new diseases or unusual outbreaks; for example, SARS and H1N1, two of previous pandemics, were suppressed by the help of contact tracing. Governments and health institutes have had or proposed the adoption of contact tracing Kiss et al. (2005) ; Lalvani et al. (2001); Zastrow (2020) to follow the daily routes of residents to decrease the likelihood of infected people's contact with healthy people.

Recently, in order to determine the contact paths of infected people more quickly, the method has been advanced from manual recording and tracking people's mobile phones via Bluetooth, or GPS techniques Apple.com (2020); Chan et al. (2020) ; Cho et al. (2020) ; Ian Sherr (2020); Wuhan (2020). Moreover, Hellewell et al. Hellewell et al. (2020) used the model to quantify the potential effectiveness of contact tracing and isolation of the confirmed cases in controlling the outbreak of a severe acute respiratory syndrome coronavirus like SARS-CoV-2. Peng et al. Peng et al. (2020) developed the method of a trinary split into red, yellow, or green states to track infectors. Recent contact tracing methods, such as Zhou et al Zhou et al. (2020) , use mobile data with regional infection numbers to predict an individual's possibility to get infected.

However, contact tracing cannot identify the probability of asymptomatic carriers and is not always the most efficient method of addressing infectious diseases. Under the current limitation of medical resources, governments can only isolate the people in direct contact with the confirmed cases as the primary way to control the out spreading of the SARS-CoV-2 virus.

As the current speed and capacity of virus testing still cannot meet the demand, the outbreak of the COVID-19 is difficult to be under control. So far, the most feasible way for countries and cities to lesson the spread of infection is to enforce a lockdown or stay-at-home order to stop unnecessary social interactions of residents. However, the longer lockdown or quarantine has been implemented, the greater impact it has on a country's economy, people's mental health and many other aspects of their lives. The non-ranking and exhaustive inspection method of contact tracing with only the confirmed cases is not efficient enough to suppress the outbreak of COVID-19 and its recurrence, especially after the re-opening of a city or country.

The detection of asymptomatic infected people, along with appropriate social distancing, effective medical treatments, and the development of vaccination, will greatly determine the extent a current or new disease outbreak can be controlled.

As a result, we propose a machine-learning algorithm to predict the spreading of the SARS-CoV-2 virus and reduce the time to locate infected people. We use a gradient boost ensemble learning tree model after the individual state is updated through an IDG to calculate the probability and continuous learning will keep improving the model of the LightGBM Ke et al. (2017) algorithm. It can obtain a better result without 3 parameter adjustment. The CLIIP is an innovative approach under combining temporal difference learning which learns by bootstrapping with value function approximation on predicting the probability of getting infected when it comes to real circumstances. To continuously measure the real-world physical activity on machine intelligence, the approximation of the value and the professional inference is essential to taking care of and our approach bridge the gap between theory and reality.

We develop a framework with the inference model, which is a more efficient and precise method to Figure 2 : The CLIIP temporal learning framework has two inputs. The first one is continuous spatial data for building an IDG, and the other is a set of labels that provide people's infected states. Combining interaction data with the states and the relation, and we can train this model to learn continuously when new data comes into the framework. The updating IDG is built through comparing the place where two people stop and their overlap time, which defines the relation between two people. An arrow points to the person who stayed longer at a waypoint than the other. According to the path of virus transmission Salje et al. (2016) , people's continuous spatial data are a set of essential interaction features, X i , i = 1, 2, 3..., which we use from mobile location data and we can call interaction data. In this paper, we use their location and timestamp as their key interaction features.

• Definition of input 2: With each time unit, everyone has a label to indicate the state. {p i state(t)} = {state 1 , state 2 , ..., state q }, where q is the number of kinds of people's infected states at time t.

We use seven kinds of states, which are susceptible S, susceptible_and_ quarantined Sq, exposed E, exposed_and_ quarantine Eq, infected I, hospitalized H, and recovered R. There is some dependency between these states of a SEIR model Younsi et al. (2015) .

The system aims to give out the ranking by order of priority of infection, as described in Fig. 2 . We start from the people's interaction features over time as an input to the framework. The interaction data is filtered out by standard spatial data with more accuracy through map-matching work from Newson et al. work Newson and Krumm (2009) , or by combining it with other data like credit card transaction data or check-in data as Limited (2020) . By reconnecting the path for all people, it becomes the social interaction network in the form of an IDG that we use for further research. To build up the interaction data as an IDG, we extract the key interaction features describing the dynamic behavior of each person p i ( Fig. 2 step (1)) from continuous spatial data, which we can extract the frequency and distance of people's contacts. Another input comes from the SEIR model describing people's state updated each time t, like "infected" or "recovered". To prove the effectiveness of the model, we use the dynamic spatial GPS data of a crowd in the city and convert it to approximate the interaction data for 30 consecutive days as input 1 from City GPS spatial data 5 Tang et al. (2018) and Table. 1. We calculate the spreading of the virus in the city using the agent-based simulation of the improved SEIR model for SARS-CoV-2 as input 2 to prepare the infected environment. four states: S holds susceptible people, E contains exposed people incubating the disease, (and possibly some that are infectious, however, the numbers of infected people are insufficient for the confirmed infected) I holds for confirmed infected people, and R for recovered people. There are the states, Susceptible quarantined Sq,

Exposed quarantined Eq, and Hospitalized H, are taken into consideration as Fig. 3 .

With key interaction features from input 1, we generate an IDG at an updated time t (Fig. 2 step (2)), which is a directed acyclic graph used as a people's connection model. We treat each node in the IDG as a person, and each directed edge as a spreading relation between two people who stayed at the same location for a while for certain time. The direction of the edge means the infection source-destination, which is defined so that and the arrow points to the person who left a place later since he/she is more likely to be infected by the other who left earlier. With input 2, we label people's states in the IDG, and update the previous IDG in incubation period [t − n + 1, t − 1] Makar et al. (2018) at the same time ( Fig. 2 step (3) ). When getting an updated IDG in the period [t − n + 1, t − 1] and t, we compute the probability and ranking of each person, including S and E. Using the IDG (Fig. 2 step (4) ) and SEIR states generate each individual's status and calculating the features to feed the model. The learning process can enhance the capability to search the asymptomatic carriers. Finally, we update the probability and ranking of each person in the period. We then introduce an algorithm using a very simple yet highly efficient searching strategy for training a lightGBM model with data derived from running the SEIR model and relation graph updating.

In the IDG, we label infected people as red nodes, susceptible people S who may be healthy as green nodes, and exposed people who may be infected or virus carriers but not confirmed as yellow nodes. When newly infected people are confirmed from input 2 at time t, we use the incubation period distribution Bays et al. (2020) to assign their actual time to become infected through a discrete probability of each day. This gives us a way to update states in the IDG between [t − n + 1, t], with n being the duration of the incubation period. The SEIR model updated every 2 hours between t and t + 1, following the step in 4.1 below.

Therefore we end up with 530 infected environments in 30 days in the city.

The CLIIP learning algorithm (Fig. 2) while 

If we know the new people who got infected from input 2, we backtrack the route of transmission by using incubation period time distribution to begin searching in the range of [t − n, t − 1] days ago. Then we do the forward tracking as shown in Fig. 4 . If we find the source of the virus in the first layer, the search will stop, and we will rebuild the all relations of IDG. Then we will start to predict the possibility of people in the order that E goes first and then S. Else if we see it in other layers of the search, we put the people between the path of the first group and add remain E into the second group, and collect S to the third group. The ordinal numbers of groups are the ranking order on calculating the probability by the LightGBM model. The input interaction features of the model will be [X ∆T ime , X ∆Distance , X Inf ectedpeople_around , X Exposed_around ], the annotation between (3) and (4) in Fig. 2 . X ∆T ime and X ∆Distance is the duration and closest distance between two IDs inside the data. The other interaction features X Inf ectedpeople_around and X Exposed_around stand for several infected people, and exposed people around them. The label Y is the state generated from the SEIR model. The use of this interaction feature is motivated by the inference logic that a virus must come from the people around an infected person. And the output is labeled from the SEIR model simulation.

Following Fig. 4 is a computed example for the result. Finally, for each person at a specific time, we not only having the infected states pointing out who got infected or exposed but also the probability of asymptomatic carriers.

The epidemic data used in this paper comes from Shi et al. (2020) . On condition of the limit data we assume, there are 100 infected people in the group. Then the other states are the same ratio as in Shi et al. (2020) , except for S; that is the number of ID recorded in the data set being assigned to other states. We Step 1: Load new model in next time step

Step 2: If (member of S t > member of S t−1 ) get ∆S from Sq t−1 to S t

Step 3: If (member of Sq t > member of Sq t−1 ) get ∆Sq from S t−1 to Sq t

Step 4: If (member of E t > member of E t−1 ) get ∆E from possible_list_of _E generated by I t−1 with relation built by certain ∆T ime and ∆distance.

Step 5: If (member of Eq t > member of Eq t−1 ) get ∆Eq from {possible_list_of _E − E}

Step 6: If (member of I t > member of I t−1 ) get ∆I from E t−1 and the probability of choice depends on individual incubation day of E t−1 .

Step 7: If (member of H t > member of H t−1 ) get ∆H from ∆Eq t−1 and I t−1 by random.

Step 8: If (member of R t > member of R t−1 ) get ∆R from I t−1 and H t−1 by random. (This should improve by depending on a curved day)

The IDG from Section 3 is the foundation to build the possible_list_of _E inside of the update state.

To simulate the situation more realistically, we made some arrangements regarding the initial individuals.

First, we randomly sorted people into the Eq, Sq group. For state I, we split 100 initial values into two groups; one group was chosen randomly, the other group was selected depending on the first and second layers of the first group of people. This process could yield the primary connection between the first group of infected people. Then the state E people will be picked from a group of connection to the state I. Then the rest of the people will become S. Then we applied the update rule to the last section. From here, we could get people's state as X input, for which now we merely considered the interactive time, interactive distance, first, second, and third layers of infected people, and exposed numbers. Moreover, we attributed label Y in a specific state to the IDG. This will be updated by future data. Average AUC of our model

result [745, 205, 344, 356 ] 

With this infected environment model, we create a perfect fit in the individual SEIR model. First, we used the incubation period time distribution to begin searching in the range of the previous 5-7 days. Then we continued the finding process until infected being found and start ranking the people on the path. In the real world, things become more complex as the virus could spread out not only from people's contacting, thus there will be more missing nodes on disease spread map which we will need to consider more layers on this condition. However, we can claim that the method can locate about 96% (Table. 2 We estimate the average precision of the model and by seeing the interaction feature importance in Tab.

2, we claim that the model found the rules inside of the SEIR model such as the transmission rule.

As resources are limited in the real world, there should be some priority in ranking the crowd. As in Fig. 1 , the first level of people exposed to the infector has higher priority than the second level and so on. shows the primary contact tracing performance that compares to the method. Generally, it needs to search until the end to make sure there is no missing person out there.

Moreover, we use this model to test a larger group of people with more healthy people in the test group.

The CLIIP model can cover most infected people when checking the same number of people because we make the ranking order, which speeds up the testing. We plot more than thousands of points to address the result. Furthermore, the baseline we compare is the average performance of contact tracing. Thus the CLIIP model can find infected people more precisely and decrease the required social and medical resources. If the testing follow the order of probability generated by our approach for the bigger group of people than purple lines demonstrating, 1500 24781 ≈ 6% of people need to be tested in comparison with contact tracing, which means reducing up to 94% of screening resources usage.

We propose a novel interaction-based inference learning approach and the major advantage lies in calculating the individual probability of getting infected from interactions along a timeline. In addition, the learning algorithm allows us to employ multi-modal datasets and interactive features such as weather, subjective feelings of individuals, wearing masks Kai et al. (2020) , hand washing, and other health-related factors. This could further increase the accuracy in calculating and ranking the infection probability. Our approach can be further applied to more real world scenarios:

Ranking the probability of potential asymptomatic carriers of the crowd by our approach helps with precisely controlling the spread of the SARS-CoV-2 virus. This approach simulates very well under the condition of sufficient spatial mobile data during citywide outbreaks. Healthcare officials can develop a more precise control or quarantine strategy toward the aïňĂected regions, areas, or individuals than a citywide lockdown. Furthermore, an adaptive and flexible "exit" strategy can also facilitate the re-opening and maintain normal economic activities with a limited quarantine.

• Searching for superspreaders The disease spreading map in our IDG makes the ranking of superspreaders possible. Following the state of contacting people, the superspreaders are most likely to be in the path between two infectors. Using our approach to analyze individuals of the surrounding layer of the spreader, the possibility of being a superspreader can be described as the equation below: possibility = layer 1 _inf ectors * w 1 + layer 2 _inf ectors * w 2 * ...layer n _inf ectors * w n , (1) which guides the search for superspreader and creates more learning samples for further finding action.

This enhances the learning precision and accelerates the inference process significantly.

The approach can simulate the situation after executing the policy. The individual model of the virus spread could give suggestions on, such as:

-Disinfection, sterilization, and preservation.

Based on the distribution of the spread probability of outdoor areas, indoor simulation is also possible. Combining surveillance camera data with the CLIIP to give the infected index of each contacting region, the precise disinfection of, for instance, elevator buttons in the area is possible, for example, when a threshold of the accumulative possibility of being touched by high-risk people is reached.

-Optimal testing times.

With new testing methods like nucleic acid tests, PCR based tests, antigen tests, and serology tests, we could add the features of fail testing probability and recalculating the individual infection probability to our approach. Considering all individuals in society based on our approach, it is possible to calculate the R0, a mathematical term that indicates how contagious an infectious disease is. However, we need to rebuild the model of the CLIIP and make labels like R0 to train the new model. Decision makers can refer to the R0 to obtain the infection degree of an area and thus decide on the testing times and methods.

To counter the reinfection of SARS-CoV-2, the CLIIP can reuse the data from the first infection model to predict the probability of reinfection for the rest of the people. Although the source of exogenous people is unclear in the transmission route, especially regarding the latency of SARS-Cov-2, our approach is able to observe each person in society to calculate the individual probability of reinfection.

Beyond the virus spreading, our approach can be applied to modelling, learning and inference on the individual level of general latent influence networks, such as in P2P e-commerce, searching for terrorists, predicting risks of digital security and so on. In social networks, for example, people send diverse comments to each other, influencing the others via their mood, intent, and thus generating the individualized relation graph. By measuring people's center-surrounded commented mood/intent and by continuous learning, their decision-making policy on issues, such as purchasing behaviours, finding terrorism and preventing digital virus spreading, and so on, can be gradually modelled and their future actions can be precisely predicted.

Applying probability-weighted incubation period distributions to traditional wind rose methodology to improve public health investigations of legionnaires' disease outbreaks

Pact: Privacy sensitive protocols and mechanisms for mobile contact tracing

Contact tracing mobile apps for covid-19: Privacy considerations and related trade-offs

A note on two problems in connexion with graphs

Contact tracing and disease control

Quantifying sars-cov-2 transmission suggests epidemic control with digital contact tracing

Feasibility of controlling covid-19 outbreaks by isolation of cases and contacts. The Lancet Global Health

Apple and google are building coronavirus tracking tech into ios and android âĂŞ the two companies are working together, representing most of the phones used around the world

Universal masking is urgent in the covid-19 pandemic: Seir and agent based models, empirical validation

Lightgbm: A highly efficient gradient boosting decision tree

Disease contact tracing in random and clustered networks

Enhanced contact tracing and spatial tracking of mycobacterium tuberculosis infection by enumeration of antigen-specific t cells

Checkin-19 touchless guest register

Learning the probability of activation in the presence of latent spreaders

The Shortest Path Through a Maze. Bell Telephone System. Technical publications. monograph. Bell Telephone System

Coronavirus disease 2019

Hidden markov map matching through noise and sparseness

Lessons to europe from china for cancer treatment during the covid-19 pandemic

Estimating infectious disease transmission distances using the overall distribution of cases

Principles of public health practice

Seir transmission dynamics model of 2019 ncov coronavirus with considering the weak infectious ability and changes in latency duration

Visual analysis of traffic data based on topic modeling

Aerosol and surface stability of sars-cov-2 as compared with sars-cov-1

Inside china's smartphone 'health code' system ruling post-coronavirus life

Seir-sw, simulation model of influenza spread based on the small world network

South korea is reporting intimate details of covid-19 cases: has it helped? Nature

Detecting suspected epidemic cases using trajectory big data