key: cord-251676-m8f6de33 authors: Trivedi, Amee; Zakaria, Camellia; Balan, Rajesh; Shenoy, Prashant title: WiFiTrace: Network-based Contact Tracing for Infectious Diseases Using Passive WiFi Sensing date: 2020-05-25 journal: nan DOI: nan sha: doc_id: 251676 cord_uid: m8f6de33 Contact tracing is a well-established and effective approach for containment of spread of infectious diseases. While bluetooth-based contact tracing method using phones have become popular recently, these approaches suffer from the need for a critical mass of adoption in order to be effective. In this paper, we present WifiTrace, a network-centric approach for contact tracing that relies on passive WiFi sensing with no client-side involvement. Our approach exploits WiFi network logs gathered by enterprise networks for performance and security monitoring and utilizes it for reconstructing device trajectories for contact tracing. Our approach is specifically designed to enhance the efficacy of traditional methods, rather than to supplant it with a new technology. We design an efficient graph algorithm to scale our approach to large networks with tens of thousands of users. We have implemented a full prototype of our system and deployed it on two large university campuses. We validate our approach and demonstrate its efficacy using case studies and detailed experiments using real-world WiFi datasets. approaches use Bluetooth for proximity sensing, sometimes in combination with GPS and other locationing techniques present on the phone for location sensing [9] . In this paper, we present an alternative network-centric approach for phone-based contact tracing. In contrast to client-side approaches that depend on the use of Bluetooth and mobile apps a network-centric approach does not require data collection to be performed on the device or apps to be downloaded by the user on the phone. Instead, users use their phone or mobile device normally and the approach uses the network's view of the user to infer their location and proximity to others. Our approach is based on WiFi sensing [24, 43] and leverages data such as system logs ("syslogs") that are already generated by the enterprise WiFi networks for contact tracing. Although our approach does not require the use of WiFi location [30] , such techniques, where available, can further enhance the efficacy of our approach. Our network-centric approach to contact tracing offers a different set of trade-offs and privacy considerations than Bluetooth-based client-centric methods; one of the goals of our work is to carefully analyze these tradeoffs. The following scenario presents an illustrative use case of how our approach works. Consider a student who visits the university health clinic and is diagnosed with an infectious disease. The university health clinic officials decide to perform contact tracing and seek the consent of the student for network-based contact tracing. Since the user could have transmitted the disease to others over the past several days it is important to determine what campus buildings and specific locations within each building were visited by the student during that period and which other users were in the proximity of the student during those periods. The health officials input the WiFi MAC address of the student's phone into the network-centric contact tracing tool. The tool analyses WiFi logs generated by the network, and specifically association and dissociation log messages for this device, at various access points on campus to reconstruct the location(building, room numbers) visited by the user. It further analyzes all other users who were associated with those access points at those times to determine users who were in proximity of the user and for how long. This location and proximity reports are used by health officials to assist with contact tracing. Additional reports for each impacted user can be recursively generated. In designing, implementing, deploying, and evaluating our network-centric contact tracing tool, our paper makes the following contributions: • We present a network-side contact tracing method that involves passive WiFi sensing and no client-side involvement. We discuss why such an approach may be preferable in some environments, such as academic or corporate campuses, over client-side methods. • We present a graph-based model and graph algorithms for efficiently performing contact tracing on passive WiFi data comprising tens of thousands of users. • We implement a full prototype of our system and deploy it on two large university campuses in two different continents. • We validate and experimentally evaluate our approach using anonymized data from two large university networks. Our results show that the efficacy of contact tracing for three simulated diseases and highlights the need to judiciously choose WiFi session parameters to reduce both false positives and false negatives. Through case studies, we show the efficacy of judicious iterative contact tracing while avoiding an exponential increase in co-located users who need to be traced, and also evaluate our approach for normal campus mobility patterns and mobility patterns under quarantine. We show that our graph-based approach can scale to settings with tens of thousands of users and also present the limitations of using WiFi sensing for contact tracing. In this section, we provide background on contact tracing and present motivation for our network-centric approach. 8:00am 4:00pm 12:00pm 8:00pm Location 1 Location 2 WiFiTrace (Proximity Report) Location 1: 8:00am -11:00am 12:00pm-4:00pm Location 2: 6:00pm -7:45pm Location 1: 8:00am -4:00pm Location 2: 6:00pm -7:45pm WiFiTrace (Location Report) -Arrival -Departure Contact Tracing: Contact tracing is a well-established method that is used by health professionals to track down the source of an infection and take pro-active measures to contain its spread [15] . The traditional method is based on questionnaires -upon diagnosis, the user is asked to list places visited and other people whom they have had contact and this information is used to iteratively contact these individuals and so on [15] . The goals of contact tracing are two-fold: identify the potential source of infection for the diagnosed individual and determine others who may have gotten infected due to proximity or contact. Since there is often a 2 to 14 day incubation period between the time of infection and onset of the illness, infected users often need to use their recollection of where they have been over multiple days or weeks, a process that can be error-prone due to gaps in memory. The manual process is challenging to scale up to larger numbers of users, especially for larger outbreaks of disease. Phone-based Contact Tracing: Since smartphones are now ubiquitous, the use of phone-based sensing for contact tracing has emerged as a key technology to automate and scale the contact tracing process [2, 9] . The most common method involves the use of Bluetooth to transmit a unique (and often anonymized) identifier from each phone. A phone also listens for such identifiers from other phones in its proximity. Thus, the device can determine which other users/phones are in its proximity at each time instant. When a particular user contracts an infection, their device id is used by others to determine if they have been in the proximity of the infected user. This basic approach has been implemented by Apple and Google into their contact tracing API [2]. Many standalone contact tracing apps have also implemented this approach, which also involves having each phone upload its collected data to a server for contact tracing analysis [1, 3, 4, 6, 9] . We note that such a client-centric approach requires a user to first download a mobile app before contact tracing data can be gathered-users who have not downloaded the app (or have opted in) are not visible to other phones that are actively listening for other devices in their proximity. Thus the overall effectiveness of the approach depends on the level of user adoption. This is seen as a key hurdle from the experience of Singapore's TraceTogether app [6] , which has seen only 1.1million downloads despite needing a critical mass of 4million active users (around two-thirds of the population) to be effective [5] . Health experts have argued that while technology-based contact tracing solutions are useful, they should not be seen as a replacement for traditional means of contact tracing, which is still an effective approach [18] . Our network-centric approach is designed to address these issues. First, it is designed to help health professionals improve traditional contact tracing methods, rather than supplant manual contact tracing using technology. Our network-centric tool is designed to integrate into health professional's contact tracing workflows; unlike some Bluetooth apps, they are not designed for end-users to self-monitor their proximity to infected users. Second, a network-centric approach overcomes the critical mass adoption hurdles faced by Bluetooth approaches-since it is based on passive WiFi sensing that does not require any app to be downloaded by users or require active client participation. With near-ubiquitous availability of WiFi in environments such as offices, university campuses, and shopping areas, WiFi sensing has emerged as a popular approach for addressing a range of analytic tasks [24, 43] . WiFi sensing can be client-based (i.e. done on the mobile device) or network-based (i.e. done from the network's perspective). Performing triangulation via RSSI or time of flight measurements to multiple WiFi access points to localize a device's position is an example of client-side WiFi sensing [30] . In contrast, network-centric WiFi sensing involves using the network's view of one or more devices to perform analytics. The approach has been used for monitoring the mobility of WiFi devices by analyzing the sequence of the access points that see the same device over a period of time [24] . While mobility characterization and modeling using WiFi sensing has seen more than a decade of research [25, 27] more recent-work has leveraged WiFi sensing for a range of analytic tasks such as tracking health [29] , stress [41] , retail analytics [42] and more. We build on this prior body of work and focus on the network-centric approach for contact tracing. The key premise of the approach is that the mobility of a user's phone is visible to the network through the sequence of WiFi access point associations performed by the device as the user moves, which allows the network to determine the locations visited by the users' device and other co-located devices that were present at those locations by virtue of being associated with those APs. Thus, the approach relies on passive WiFi sensing by passively observing devices as they move through the network. There are some key advantages of such an approach over a client-centric approach, Unlike a client-based approach that needs a critical mass of users to opt-in or download an app before proximity can be effectively determined, the wireless network can "see" all devices that are connected (associated) to it at all times. Hence, a network-centric method is easier to deploy and scale to large numbers of users without any initial deployment hurdles. Second, the client-centric approach involves data collection on each device for proximity sensing. By its very nature, a network-centric approach does not require any data to be collected on the device. In many cases, the approach may not even require an additional data to be collected by the network. This is because this method relies on syslog of network events, SNMP reports, or RTLS events that are routinely logged by many enterprise networks for purposes of performance and security monitoring. Our network-centric approach "mines" this already logged data for performing contact tracing. Of course, our approach does require network logging of AP events by the network if this information is not already being logged. Third, a client-centric method uses Bluetooth for proximity sensing and must use a second sensing modality such as GPS for sensing location where those devices were seen. In contrast, a network-centric approach can use a single modality -WiFi sensing -to determine the location (based on the AP locations) and proximity (based on AP associations). Note that methods like GPS do not work well inside buildings, while passive WiFi sensing can provide AP-level locations of users even without any additional WiFi locationing technology. However, the approach is not without challenges. Bluetooth-based approaches claim to sense other devices that are within a few feet of the user, which is then used for proximity analysis. Although the use of Bluetooth to coarse-grain proximity measurements (e.g., users co-located within the range of an access point). Coarse-grain proximity sensing can increase false positives, and hence the approach uses the duration of proximate co-location as an indicator of risk of infection and the duration of proximate co-location can be determined accurately (same as Bluetooth). Moreover, since we designed our approach to enhance traditional contact tracing, rather than replace it, coarse-grain proximity information, along with co-location duration, is still useful to health professionals for identifying users who should be subjected to traditional contact tracking checks. 1 WiFi-based contact tracing only works in areas with WiFi coverage -which are largely indoor spaces and a few key outdoor spaces. This method does not work outdoors where no WiFi coverage is available. In contrast, Bluetooth methods work "everywhere"-both indoors and outdoors-since they involve listening to other devices and do not depend on a network. While this is a key limitation of a network-centric approach, they are nevertheless effective in university campuses or corporate environments where employees spend a significant portion of their day. Finally, all contact tracing methods, whether client or network-based raise important privacy concerns. However, privacy considerations of network-based methods are different from those of Bluetooth-based client methods [11] . We discuss these in detail in Section §4 and show how user privacy can be safeguarded in such methods. The deployment of network-centric contact tracing technology raises privacy issues, which we discuss in Section 4. Ethical considerations that came up during the design of this technology are discussed here. Data collection for experimentally validating the efficacy of our approach has been approved by our Institutional Review Board (IRB) and is conducted under a Data Usage Agreement (DUA) with the campus network IT group that restricts and safeguards all the WiFi data collected. To avoid any privacy data leakage all the MAC ids and usernames in the syslogs are anonymized using a strong hashing algorithm. The hashing is performed before syslog data is stored on disk under the guidance of the IT manager who is the only person aware of the hash key of the algorithm. Any data analysis that results in the de-anonymization of the users is strictly prohibited under the IRB and signed DUA. Users on the USA campus involved in the data collection consent to an acceptable use IT policy, that permits the campus IT department to collect network-level syslog data events for a system diagnosis or analysis of cyber-attacks on the enterprise network. Additionally, all researchers sign a form of consent to adhere to the signed IRB and DUA and undergo mandatory ethics training. In short, the data used to validate and evaluate our approach prior to its actual deployment is anonymized and subjected to multiple safeguards as part of an IRB-approved study. This section presents an overview of our approach, followed by the details of our graph-based contact tracing algorithm. Fig 2 depicts the architectural overview of our contact tracing system. The system uses a three-tier pipelined architecture. The data collection tier uses network logging capabilities that are already present in enterprise WiFi systems to collect the WiFi logs of device associations to access points within the network. Many enterprise IT administrators already collect this data for network monitoring, in which case this data can simply be fed to the next tier in the pipeline. Otherwise, the IT admins need to turn on logging to start gathering this data. The next tier in the pipeline ingests this raw data and converts it into a standard intermediate format. In other words, this tier performs pre-processing of the data. Since the raw log files will have vendor-specific formats, this tier implements vendor-specific pre-processing modules that are specific to each WiFi manufacturer and its logging format. This tier processes log files in batches every so often and generates data in intermediate form. Our final tier ingests the data produced by the vendor-specific pre-processor and creates a graph structure that captures the trajectories of user devices. It exposes a query interface for contact tracing, For each query, it uses the computed trajectories over the query duration to produce (i) a location report listing locations visited by the infected user and (2) a proximity report listing users who were in proximity of that user and for how long. As discussed below, this tier uses time-evolving graphs and efficient graph algorithms to efficiently intersect trajectories of a large number of devices (typically tens of thousands of users that may be present on a university campus) to produce its report. Consider a WiFi network with N wireless access points that serves M users with D devices. We assume that the N access points are distributed across buildings and other key spaces in an academic or corporate campus and that the location of each access point (e.g., building, floor, room) is known. Large enterprises such as a residential university will comprise thousands of access points (our work is based on deployment and data from two large Overview : User Name : JaneDoe Start Time : 10:00am 10/Jan/2020 End Time : 11:59pm 10/Jan/2020 Showing all locations visited for 10 mins or higher Visit Details : Overview : User Name : JaneDoe Start Time : 10:00am 10/Jan/2020 End Time : 11:59pm 10/Jan/2020 Displaying co-located users in descending order of total co-location time. Number of users co-located : 6 Alice 180 ... 3. An example contact tracing report produced by our tool: (a) Patient Report (b) Proximity Report universities, one based in the Northeastern USA that comprises 5500 access points and one based on Singapore that comprises 13,000 access points). The number of users and devices seen in such networks is typically in the order of tens of thousands. To manage such a large network, enterprise WiFi networks uses controller nodes that have the capability to administer and manage the APs and the network traffic, along with detailed logging and reporting capabilities. As a user moves from one location to another, their mobile device (typically a phone) associates with a nearby access point. Since the locations of APs (building, floor, room) is known, the sequence of AP associations over the course of a day reveals the trajectory of the user and the visited locations. To reconstruct this trajectory we assume that the WiFi network logs contain association and disassociation events as seen by each AP. Typically this information is of the form: timestamp, AP MAC address, Device MAC, optional user-id, event-type, where event-type can be one of association, disassociation, reassociation, authorization, and unauthorization. Typically when a device switches to a new AP due to user mobility, this is visible to the network in the form of disassociation with the previous AP and an association with a new AP. Given this log information, contact tracing of a user involves two steps: (1) determine all APs visited by the user in the specified time period and (2) determine all users who were associated with each of those APs concurrently with the infected user. To do so, we can analyze the log to first construct the time-ordered sequence of AP sessions of the concerned device (a session is the time period represented by an association followed by a disassociation). Since AP locations are known, this session list represents the location visited by the user and the time duration. Next, for each AP session in the above user trajectory, we can analyze the log to determine overlapping sessions of all other users at that AP. These are users (i.e. their devices) who were present in the proximity of the infected user. Of course, the WiFi log does not reveal the distance between the two users or whether physical contact occurred. Nevertheless, it enables us to determine users at risk by computing the duration for which the two users were in proximity of one another. In some cases, the location where they were co-located may reveal the degree of risk (e.g., a hour long meeting in a small conference room or a lecture classroom). To enable health ... professionals to further assess the risk during contact tracing, we generate a location report, showing locations visited by the user and for how long as well as a a proximity report of co-located users at each location and the duration of co-location. Figure 3 depicts a sample report resulting from the process. Since an enterprise network with thousands of APs and tens of thousands of devices will generate very large log files (for example, the log file from one of our campuses contains more than 9 billion events over a 4 month semester period). Scanning the log to compute the location and proximity can be slow and inefficient. Consequently, we present an efficient graph-based algorithm based on time-evolving graphs in the next section. To efficiently process contact tracing queries, we model the data as a bipartite graph between devices and APs. Each device in the WiFi log is modeled as a node in the graph; each AP in the network is similarly modeled as a node. An edge between a device node and an AP node indicates that the device was associated with that AP. Each edge is annotated by the time interval (t 1 , t 2 ) that denotes start and end times of the association session between that device and the AP. Note that data is continuously logged to the log files, which causes new edges to be added to the graph as new associations are observed and new nodes to be added as new devices are observed in the logs. Thus, our bipartite graph is a time-evolving graph. For computational efficiency, each device and AP node in the graph is limited to a time duration, say an hour or a day. This is done to limit the number of edges incident on each node, which can keep growing as device associate with new APs or APs see new association session. As a result of associating a time duration with each node, each device or AP is represented by multiple nodes in the graph, one for each time duration where there is activity. In this case, we can view the node ID as the mac address concatenated by the time duration. For example, MAC 1 [10:00,10:59], MAC 1 [11:00,11:59], represent two nodes for the same device, each capturing AP association edges seen within that period. In case of AP nodes, this would capture all device association to that AP within those time periods (see Figure 4 ). The duration for partitioning each node's activity in the graph is a configurable parameter, and this duration can chosen independently for a device node and an AP node if needed. Given such a bi-partite graph, a contact tracing request is specified by providing a device MAC address and a duration (T st ar t , T end ) over which a contact trace report should be generated. The query also takes a threshold τ that specifies only AP sessions of duration longer than τ should be considered. The graph algorithm first identifies all device nodes corresponding to this user that lie within the (T st ar t , T end ) interval and identifies all edges from these nodes. These edges represent all AP locations visited by the user, and session durations represent the time spent at each location. Only edges with the following constraints are considered: (1) the session must lie within the query time interval, i.e., [t 1 , t 2 ] ∈ [start, T end ] and (2) the session duration must be at least τ , i.e., (t 2 − t 1 ) ≥ τ . Edges that do not satisfy either of the above criteria are ignored and the remaining edges are used to enumerate the AP locations visited by the device and the time duration spent at each location. To compute the proximity report, the algorithm traverses each edge and examines the corresponding AP node. For each AP node, the list of incident edges corresponds to all devices that had active sessions with that AP. The session duration [t 3 , t 4 ] on each edge is compared to the infected users session [t 1 , t 2 ] and the edge is included only of the two session overlap. This process yields a list of all other users who had an overlapping session with the infected user. The algorithm can also take an optimal parameter w that indicates the minimum overlap in session between the two for the user to be included in the proximity representation, i.e., w ≥ [t 1 , t 2 ] ∩ [t 3 , t 4 ]. The parameter w specifies the minimum duration of co-location necessary for a user to be included in the proximity report. Algorithm 1 lists the pseudo code for our graph algorithm. Thus, a time evolving bipartite graph allows for efficient processing of contact tracing queries over a large dataset. Since contact tracing technologies use location and proximity information of users, they raise important privacy concerns. Privacy concerns for client-side Bluetooth-based applications are well-known [11] . Since networkcentric client tracing is an alternative approach that raises a different set of concerns, we discuss these issues in this section and describe techniques used in our ongoing deployments to mitigate them. First, our network-centric tool is aimed at health-care and medical professionals who perform contact tracing and is not an end-user focused tool for self-monitoring prior contacts. Contact tracing is a well-established approach that has traditionally been performed manually through questionnaires [15] . Our tool has been designed to fit into this workflow and serves as an additional source of information, in addition to interviews, for professionals engaged in contact tracing. Unlike some Bluetooth-based apps, it does not allow end-users to lookup information about themselves or anonymous infected users. By focusing on health professionals and not end-users, our tool avoids some of the privacy pitfalls from giving end-users access to anonymous proximity data. Second, even though data access is limited to health professionals, the data contains sensitive location information and is still prone to privacy misuse. There are two approaches to handling this problem. First, we recommend that operational control of the tool be in the hands of the organization's IT security group. Recall that the approach is based on WiFi network monitoring data that is already routinely gathered by IT departments for network performance and security monitoring. For example, our campus uses such data to track down compromised devices that are connected to our WiFi network and may be responsible for DDOS attacks from inside. Another example is tracking down student hackers, since the hacking of university computers (e.g., to change course grades) is a common exploit on university campuses. Audit and compliance laws in many regions also necessitate gathering network logs for subsequent analysis and audits. To address these issues, IT departments routinely collect detailed network logs and use them for optimizing performance or handling security incidents. Since the IT department already has access to the raw data used by our network-centric tool, deploying the tool within the IT department does not increase privacy risks (since this raw data is already prone to the same privacy risks independent of our tool, and IT departments have strict safeguards in place to protect such data and limit access to it). Here, limiting operational control of the tool to the organization's IT group can provide good privacy protection in practice. However, it may not always be feasible to limit control of the tool to IT professionals alone. For instance, larger outbreaks of disease may require allowing direct access to health officials who are performing contact tracing. In this case, we can address privacy concerns by not storing user identities or real MAC addresses with the tool itself. Instead, user names and device MAC addresses are anonymized by a cryptographic hash (eg SHA-2 hash). All queries on the tool are done using hashed identities and not the real ones. The actual mapping of user names and device MAC addresses to their hashed values is stored separately from the tool, and this information is accessible only to a small trusted group. To perform contact tracing on an individual, this trusted person needs to authorize it (e.g., once user consent is obtained) by releasing the mapping of the actual name of the user and their device MAC to the hashed values. The tool can then be queried using the hashed values of that user's information. Similarly, once proximity reports are generated, they can be sent to the trusted person, who can then deanonymize that information using the mapping table. In this manner, it is not possible to query the tool to track activities of an arbitrary user, unless first authorized, which prevents misuse. Our current campus deployment uses this anonymized data approach for additional safety. Finally, many countries have strict privacy laws that require user consent before collecting sensitive data. To comply, many organizations require users to consent to their IT policy that enables them to gather network data for critical safety operations-a prerequisite for such network monitoring. Further, health care professionals are required to obtain user consent to perform manual contact tracing-a process that can be used for network-centric contact tracing as well. 2 This section presents our system implementation. We have implemented our system using python and Perl. Our tool is available as open-source code to researchers and organizations who wish to deploy it (source code is available at http://wifitrace.github.io) As shown in Figure 2 , our implementation uses a three tier architecture. The first tier is based on the logging capabilities that are already supported by enterprise-grade WiFi networks. Our system simply uses these capabilities and implements only the next two tiers. Our system currently supports WiFi Access Points from Cisco and HP/Aruba, two large vendors of enterprise WiFi equipment. We have implemented a pre-processing code for both 2 These professionals can decide whether to pro-actively notify co-located individuals who are deemed to be at risk during an outbreak or to publicize a list of locations and times visited by an infected individual(s) and request other users to contact them if they are impacted. In the latter case, the proximity report data is used for further contact tracing once co-located users contact health officials. The latter approach is presently used on our USA campus. vendors to take raw monitoring data and convert it to a standard intermediate data format for our second tier. For HP/Aruba network, our tool supports the processing of both syslogs (generated by Arubas WiFi controllers) as well as RTLS logs generated by Aruba APs. Both types of logs provide association and disassociation information. In case of Aruba RTLS, we log WiFi data directly from the controller nodes using either real-time location services (RTLS) APIs [31] . In case of Aruba syslogs, we periodically copy the raw syslogs generated by the controller and pre-process this raw data. For the Cisco networks, we log WiFi data directly from the network using the Cisco Connected Devices (CMX) Location API v3 [37] . All of these preprocessor scripts convert raw logs into the following standard record format : Timestamp, AP Name or Id, Device MAC Id, event type, (optional) User Name By default, we assume anonymized (or hashed) device MACs and Usernames and assume a separate secure file containing a mapping of real names to hashes. Our third tier implementation then uses this data to support contact trace querying. A query is of the form (hashed) username or device MAC, start duration, end duration, threshold τ , and co-locator treshold w. Internally the data generated by the pre-processing code is represented as a bi-partite graph, as discussed in Section 3.3. Our system supports a variety of queries on this graph through a graph API depicted in Table 2 . This graph API is used to implement the graph algorithm described in Section 3.3. The algorithm yields a location report, which shows all locations (APs) visited by the user for longer than τ and a proximity report that shows all users who were connected to those APs for a duration greater than w. Figure 3 shows a sample location and proximity report generated by our system. In addition to human-readable query reports, our system can optionally output query results in JSON format, which is convenient for visualization or subsequent processing. Our system also supports additional report types beyond location and proximity reports. For example, it can produce reports of additional users who visit a location after the infected user has departed from that location. This is useful when a location has high-contact surfaces that may continue to transmit a contagious disease even after an infected user departs. Such a report can produced by specifying a window parameter, that specifies the time window over which additional users are identified as being at risk at each location after the user departs. We have deployed and operationalized our tool on both our university campuses (one in northeastern USA and one in Singapore) through a collaboration with our university's health and IT service. Both campuses have large WiFi networks, one with 5500 HP/Aruba APs and the other with a mixed Cisco/Aruba network of 13,000 APs. While our tool can be used for contact tracing of any infectious disease(we have originally begun developing it inspired by an outbreak of meningitis on our campus), health officials on both campuses view it as a method for scalable contact tracing for Covid-19. While our tool has been operational for several months, fortunately, as of May 2020, neither campuses had seen any Covid cases on the campus that required the use of our tool. This is largely because residential universities such as ours switched to online learning in March 2020 and asked most students to vacate their dorms and enforced a work-from-home policy for faculty and staff. Except for a small number of students who were unable to return to their home countries (due to global lockdowns), the campus have been largely empty. One of our campuses saw a single employee case of Covid-19, but initial (manual) contact tracing determined that the employee worked in a setting with limited contact with others, and university health professionals did not see a need to perform additional contact tracing, manually or using our tool. WiFi sensing has been used by researchers for mobility studies since the early 2000s [25, 27] , and it is well established among researchers that WiFi devices reveal user mobility patterns. Our work builds on this WiFibased mobility research, and in this section, we validate its use for network-centric contact tracing. We conducted a small-scale user study to gather ground truth data to validate three question related to the use of passive WiFi sensing for contact tracing: (1) How accurately do WiFi access point associations reveal true user locations? (2) How accurately do WiFi session durations reveal true durations of times spent at a location? and (3) How accurately do co-located WiFi device sessions reveal co-located users at those locations? To answer these questions, we had a group of volunteers walk around our campus to visit multiple locations for varying durations while carrying their mobile devices. Each user manually logged the entry and exit times at each location as well as the path used to walk from one location to another. The trajectories of some of the user's devices were correlated, which meant the users were co-located whenever the devices were connected to the same AP concurrently. Our user study produced a ground truth dataset that includes seven devices that visited a total of 19,000 distinct locations over a course of ten days. For each of the user, we computed a location report containing all visited locations (assuming a threshold τ = 0) and compared the locations as seen by the WiFi network to the ground truth locations recorded by each user. Figure 5 (a) shows the confusion matrix, with a precision of 0.93, recall of 0.94, and a high F1-score of 0.93. As can be seen, the inferred location matches the ground truth location with high accuracy. The errors mainly occur when a user is walking (in all cases, these involved short session of tens of seconds to 3 minutes). When a user is in transit between locations, their mobile device makes AP transition by disassociating from previous AP and associating with a more proximate AP. The threshold for switching APs and aggressiveness of these switches varies across mobile phone makes, models and manufacturer. This results in some mobile phones that stay connected with an earlier AP even through there is a nearby AP with better connectivity; this can result in a location error where the ground truth location is a bit further away from that shown by the more distant AP. In almost all cases, it the user stays at the new location for more than a few minutes (2 to 3 minutes in our observations), their phone switches to the closer AP which has a stronger signal. Hence, for very short sessions during walks, the true location may be off from the inferred location by up to one AP "cell. " Figure 6 depicts the accuracy of the inferred location for varying session lengths observed across four of the devices (namely, iPhone, Samsung, Motorola and LG phones) used in our user study. As can be seen, once the session length exceeds around 3 minutes, the accuracy rises to 100%. For contact tracing, we are typically interested in locations that visited by a user for a few tens of minutes; as shown in the figure, the approach provides high accuracy for such cases. Figure 5 (b) shows a scatter plot of session duration as reported by our tool and the ground truth. As can be seen, there is good match between the actual and ground-truth of session durations; the small errors occur at location entry or exit due to the lag in the mobile device switching to the nearest AP. Next we validate the accuracy of co-locations. We use our tool to generate the proximity report for each device and compare it to the ground truth trajectories reported for each device. Figure 6 shows the accuracy of the co-located devices as seen by our approach. We see that our approach can capture co-located devices (and users) with high accuracy for sessions exceeding 3 minutes. As noted above, short transitions are often off by one AP cell, which implies that two devices that are near one another will be seen by the network as being connected to adjacent APs, rather than the same one. Fortunately these effects do not hamper the efficacy of contact tracing since two users need to be near one another for a period of time (e.g., 15 minutes or more) to be considered at risk. As can be seen, longer sessions are captured with high accuracy. Finally, we conduct a validation experiment where we count the number of users entering and leaving room in the library and compare it to the number of devices (users) reported by our approach at that location. As shown in figure 7 , the WiFi based occupancy closely follows the ground truth manual count. The slight mis-match occurs for short WiFi sessions when a user is present only for a brief period (and when their devices have not switched from the previous AP to the one on the room). The user counts are accurate for all sessions that exceed a few minutes since their devices eventually switch to the closest AP. Together, these results validate the efficacy of using passive WiFi sensing for location and proximity sensing for contact tracing. In this section, we describe case studies that evaluate the efficacy of our contact tracing tool and also present results on the efficiency of our graph algorithms and general limitations of our WiFi sensing approach. We first describe our dataset and then our results. Since our tool has been deployed on two university campuses, we use production WiFi logs from the university WiFi networks for our experimental evaluation. This is the same data that would be used by health professionals for their contact tracing, except that we use a fully anonymized version of this data for our experiments. Table 3 depicts the characteristics of the WiFi logs. The US university has an Aruba network of 5500 APs deployed across 230 buildings. It has 38000 users comprising 30000 students and 8000 facultystaff (figures are rounded to the nearest thousand). The dataset spans Jan 2020 to May 2020, which includes the COVID-19 lockdown that began during Spring break (mid-March). The Singapore university has a mixed Aruba and Cisco network comprising 13,000 APs deployed across 240 buildings. It has 50,000 users comprising 40,000 students and 10,000 faculty/staff. The dataset spans Feb 2020 to May 2020 and also includes the COVID19 quarantine which was progressively announced by the government, ending with a full lockdown like the US University. We randomly choose a user from our dataset and assume they are infected with one of the above diseases and use our tool to compute the number of locations visited by the user over that period and the number of co-located users. We perform contact tracing assuming τ = ω = 10 mins and τ = ω = 30 mins, which implies location visited for at least 10 (or 30) minutes and co-location of at least 10 (or 30) minutes. For each disease, we repeat each contact tracing experiment for 50 randomly selected students, and then 50 randomly chosen faculty or staff users. Figure 8 depicts our results. As can be seen, the number of locations visited by an infected user grows as the duration of contact tracing grows from 2 days for Flu to 10 days for Measles. We find that the number of location visits is insensitive to τ beyond τ > 10 mins (as discussed in more detail in the next section). A student visits ≈ 37 locations per day while a faculty/staff user is somewhat less mobile and visits ≈ 15 locations per day. Figure 9 depicts the proximity results from our contact tracing experiment. As shown, τ = 10 min yields a large number of colocated user, 500 colocated users for Flu over a 2 day period, rising to over 800 users for MEasles over a 10 day period for a student. For τ = 30 min, the number of colocated users is lower-but still high -300 for flu and 400 for Measles. The colocation count is lower for facultystaff users (Figure 10 (b) ) but is still quite high (200 to 500) for τ = 10 and substantially lower (between 100 and 200) for τ = 30 mins. These results yield the following insights: • First, we note that the number of colocators does not increase linearly with an increase in contact trace duration. The growth is sublinear indicating that users have a social circle of users and there are repeated interactions with the same set of users over different days. • Second, it is infeasible to manually contact trace several hundred users for each infected user. This can be addressed by carefully selecting the parameter τ and ω and also carefully considering the tool output for subsequent manual contact tracing. In particular, τ = 10 min is too low due to a high rate of chance co-location. Choosing τ = 30 mins and ω be 15 or 30 mins may yield better results. Further our results show that common areas like dining, cafeterias add substantially to the colocation counts. It is straightforward to filter out those AP sessions to determine users with higher risk. Figure 9 (a) and (b) shows that the number of co-locators drops substantially once cafeteria visits are excluded. Finally, our report (see Figure 3 provide the total time spend with colocator in sorted order as well as the location where co-location occurred. It is possible to consider the top N (eg. N = 15) users with the most proximity minutes or only consider specific locations such as a small conference room or a classroom for subsequent manual tracing. Such strategies are already used by professional contact tracers to hone in on the most probable at-risk co-locators while eliminating users who may be false positives. Tracing. While the above experiment involved a single level of contact tracing in many cases, contact tracing may have to be iterative, with each colocator subjected to contact tracing. Given that a user may come in contact with more than a hundred users in a single day (eg if they attend a few lectures in the classroom and visit a cafeteria) iterative tracing even for two iterations can be prohibitive. As explained in the previous section, the colocators list needs to be pruned at each step to identify the users at most risk. In the previous section, we suggested using a carefully chosen τ and ω to filter out certain locations or focus on high-risk locations (eg a small conference room). These are subjective strategies and can yield errors and miss "true positives". An alternate strategy is to "test and trace", which combines testing with contact tracing -a strategy used by many countries for COVID-19. In this case, each colocated user is administered a test to check if they are infected, Only infected users are subjected to iterative contact tracing and the rest are filtered out. In this case, the number of users subject to contact tracing grows based on the rate of transmission(referred to as the R in the medical literature). For example, if R=2, then only 2 out of the several tens of users identified by our tool will be subjected to additional tracing in each step (we assume that all users are tested to find R users who are infected). Table 4 depicts the number of users identified by this strategy for testing and tracing -as can be seen, the growth is much lower than a naive iterative strategy. Tracing during Quarantine Periods. While the previous experiments performed contact tracing during pre-COVID semester periods where mobility patterns were "normal", we now examine how contact tracing results will change in the presence of strict lockdown policies. Figure 10 (a) shows the number of locations visited per day by different types of campus users. While users visited 20-80 locations per day for τ = 10 mins during the normal period, after March 25 th , the number of AP locations visits drops sharply for all users due to lockdown policies. This will significantly alter contact tracing results fro our tool for users who become ill during such lockdown. Figure 10 (b) shows the number of locations visited for a user subjected to COVID-19 contact tracing (duration of 4 days). As shown the number of locations visited varies from 5 to 20 for τ = 10 and it drops to 1-6 locations visits for τ = 30 mins or greater. Figure 11 depicts the number of co-locators for τ = 30 mins for several users based on the pre-COVID and lockdown mobile patterns. As can be seen, social distancing and lockdown policies bring the colocator count to be less than ten for all types of users, an order of magnitude reduction. In such cases, comprehensive contract tracing of all colocators is feasible through manual means. To evaluate the efficiency of our graph algorithm, we compare the execution time of naive linear search approach and our graph based algorithm across varying size of co-locators. Since different users display different amount of mobility, the number of co-locators seen for each user will be different. Searching the co-locators using linear search requires complete scan of the entire dataset sequentially, resulting a high overhead across all runs irrespective of the number of observed co-locators of device. Additionally, as the number of nodes increase, the search overhead also increases. In contrast, our graph algorithm efficiently identifies relevant edges and nodes relevant to the specified query, thereby reducing the search space overhead Also, adding the constraint of τ results in further pruning of edges resulting in reduced search space reducing the time and space complexity of our algorithm. This behavior is depicted in Figure 12 that compares the execution overhead of the two approaches for our campus dataset. As shown, our graph-based implementation outperforms the naive sequential search by a significant margin. WiFi-Sensing has well-known limitations and this section analyzes the implications of these limitations on contact tracing. Multi-device Users : Researchers have previously studied the behavior of multi-device users and shown that it is very common for users to own two or more devices [39] . A key consequence of this result is that device count seen by an AP does not equal user count. While all WiFi logs log device association information not all of them provide user ownership information. If such information is missing, RADIUS authentication logs should be additionally used to map devices to owners to avoid double counting devices as separate users. Figure 13 shows the number of unique devices seen by APs in different types of campus buildings and the corresponding user count (eg ARUBA syslogs provide both types of information). As shown locations like dorms and classrooms see between 1.5X to 2X difference in unique devices and unique users (since users may connect a phone and a laptop to the network), Only dining areas (cafeteria) see low over counting since users are likely to carry only their phone when eating. This result highlights the importance of considering device ownership to avoid over counting users by only considering connected devices. Unassociated Devices : Not all users may connect their mobile devices to the WiFi network. Such devices are visible to the network when they perform SSID scans using a randomized MAC address. Unassociated devices can cause multiple challenges. First ignoring them altogether will undercount users in a location but simply counting all devices can yield a large number of false positives. Figure 14 (a) depicts the number of unassociated devices seen in four buildings in our Singapore campus. Since the buildings are next to a public road or public bus stop, the number of unassociated devices per day is 5X greater than the number of associated users. Figure 14 (b) shows that enforcing a session duration of 15 minutes filters out most of these chance associations and the number of such devices (likely visitors) is around 12% of the total number of associated devices. Impact of Session Duration: Our contact tracing tool uses two parameters τ and ω that are directly related to WiFi session durations. Judicious choice of these parameters can allow for a good tradeoff between eliminating false positives and eliminating true positives. Figure 15 shows the number of AP locations visited by campus users for varying values of session length τ . The figure shows that the location visits stabilize around τ = 10 mins and then yields 20-40 location visits per day. Small values of τ include locations visited when in transit and should be ignored. Figure 16 shows the impact of varying values of τ and ω and the figure shows a decreasing gradient as both τ and ω are increased for all user types. Finally Figure 17 shows the number of colocated users for varying values of τ and ω. As shown, using values that are tens of minutes allows the tool to filter out overlapping sessions caused by users in transit. These results highlight the importance of carefully choosing τ and ω depending on the infectious nature of disease but also avoiding false positives. The prevalence of many infectious diseases in our society has increased the importance of contact tracing-the process of identifying people who may have come in contact with an infected person-for reducing its spread and disease containment [28, 35] . For performing contact tracing, the infected user needs to provide the places visited and persons who were in proximity or direct contact [15] . While the traditional method relies on interviews, the COVID-19 pandemic has seen the use of a method such as GPS, Bluetooth [6] , credit card records [13] , and cellular locationing. Manual contact tracing as a mode for containment of diseases with a high transmission rate has proved to be too slow and cannot be scaled. Research [17, 18, 32] has shown that technology-aided contact tracing can aid reduce the disease transmission rate by quicker scalable tracing and help achieve quicker disease suppression. Bluetooth and Bluetooth Low Energy (BLE) based contact tracing has emerged as a possible method for proximity detection [33] . A handful of systems based on Bluetooth or BLE have been rolled out few of which have been supported by the government of various countries such as Singapore [6] and Australia [3] . The main limitation of these approaches is the need for mass adoption before it becomes effective [5] and its reliance on Bluetooth distance measurements, which may not always be accurate. Authenticity and Privacy attacks are other key issues in using Bluetooth for contact tracing. [12] has shown that authenticity attacks can be easily performed on Bluetooth based contact tracing apps. Such attacks can result in forging the location visited and creating a fake history of a user introducing risk to the society as shown in [12] . Bluetooth apps suffer from privacy issues as noted in [16, 38] . As a result, privacy issues for Bluetooth-based contact tracing has received significant attention [10, 11, 22] . Privacy-preserving methods include the use of homomorphic encryption for determining contacts [8] and the use of private messaging to notify possible contacts [40] , to name a few. Technology-aided contact tracing is becoming increasingly important tool for quick and accurate identification of co-locators. While bluetooth-based contact tracing method using phones have become popular recently, these approaches suffer from the need for a critical mass of adoption in order to be effective. In this paper, we presented a network-centric approach for contact tracing that relies on passive WiFi sensing with no clientside involvement. Our approach exploits WiFi network logs gathered by enterprise networks for performance and security monitoring and utilizes it for reconstructing device trajectories for contact tracing. Our approach is specifically designed to enhance the efficacy of traditional methods, rather than to supplant it with a new technology. We presented an efficient graph algorithm to scale our approach to large networks with tens of thousands of users. We implemented a full prototype of our system and deployed it on two large university campuses. We validate our approach and demonstrate its efficacy using case studies and detailed experiments using real-world WiFi datasets. Finally, we discussed the limitations and privacy concerns of our work and have made our source code available to other researchers under an open-source license. Apple Google Partner Covid-19 Contact Tracing Singapore built a coronaviris app but it hasnt worked so far TraceTogether App Covid-19 Contact Tracing FluSense: A Contactless Syndromic Surveillance Platform for Influenza-Like Illness in Hospital Waiting Areas EPIC: Efficient Privacy-Preserving Contact Tracing for Infection Detection BlueTrace: A privacy-preserving protocol for community-driven contact tracing across borders Assessing Disease Exposure Risk with Location Data: A Proposal for Cryptographic Preservation of Privacy Contact Tracing Mobile Apps for COVID-19: Privacy Considerations and Related Trade-offs Afshan Amin Khan, and Roohie Naaz. 2020. Applicability of Mobile Contact Tracing in Fighting Pandemic (COVID-19): Issues, Challenges and Solutions. Cryptology ePrint Archive Development Finance Division. 2020. MOEF: Korea Contact Tracing Johns Hopkins University COVID19 Dashboard Contact tracing and disease control Apps Gone Rogue: Maintaining Personal Privacy in an Epidemic Epidemic contact tracing via communication traces Quantifying SARS-CoV-2 transmission suggests epidemic control with digital contact tracing Estimated Influenza Illnesses, Medical Visits, Hospitalizations, and Deaths Averted by Vaccination in the United States SARS basics fact sheet Quest: Practical and Oblivious Mitigation Strategies for COVID-19 using WiFi Datasets Feasibility of controlling COVID-19 outbreaks by isolation of cases and contacts Experiences & Challenges with Server-Side WiFi Indoor Localization Using Existing Infrastructure Extracting a mobility model from real user traces The effectiveness of contact tracing in emerging epidemics Analysis of a campus-wide wireless network Interrupting transmission of COVID-19: lessons from containment efforts in Singapore Wireless health monitoring using passive WiFi sensing Location determination using WiFi fingerprinting versus WiFi trilateration Aruba Networks. 2020. IT Analytics for Operational Intelligence Eryk Dutkiewicz, Symeon Chatzinotas, and Bjorn Ottersten. 2020. Enabling and Emerging Technologies for Social Distancing: A Comprehensive Survey Stefaan Verhulst, and Patrick Vinck. 2020. Mobile phone data and COVID-19: Missing an opportunity A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States COVID-19 epidemic in Switzerland: on the importance of testing, contact tracing and isolation A high-resolution human contact network for infectious disease transmission Cisco Systems. 2020. Cisco DNA Spaces Qiang Tang. 2020. Privacy-Preserving Contact Tracing: current solutions and open questions Empirical Characterization of Mobility of Multi-Device Internet Users How to Return to Normalcy: Fast and Comprehensive Contact Tracing of COVID-19 through Proximity Sensing Using Mobile Devices StressMon: Scalable Detection of Perceived Stress and Depression Using Passive Sensing of Changes in Work Routines and Group Interactions Analyzing ShopperâĂŹs Behavior through WiFi Signals Sensorless sensing with WiFi