key: cord-0221618-twb7wptc
authors: Gotsman, Craig; Hormann, Kai
title: Secure Data Hiding for Contact Tracing
date: 2020-08-14
journal: nan
DOI: nan
sha: bb07ad56b5d21671ec9850affff98e0e22958f47
doc_id: 221618
cord_uid: twb7wptc

Contact tracing is an effective tool in controlling the spread of infectious diseases such as COVID-19. It involves digital monitoring and recording of physical proximity between people over time with a central and trusted authority, so that when one user reports infection, it is possible to identify all other users who have been in close proximity to that person during a relevant time period in the past and alert them. One way to achieve this involves recording on the server the locations, e.g. by reading and reporting the GPS coordinates of a smartphone, of all users over time. Despite its simplicity, privacy concerns have prevented widespread adoption of this method. Technology that would enable the"hiding"of data could go a long way towards alleviating privacy concerns and enable contact tracing at a very large scale. In this article we describe a general method to hide data. By hiding, we mean that instead of disclosing a data value x, we would disclose an"encoded"version of x, namely E(x), where E(x) is easy to compute but very difficult, from a computational point of view, to invert. We propose a general construction of such a function E and show that it guarantees perfect recall, namely, all individuals who have potentially been exposed to infection are alerted, at the price of an infinitesimal number of false alarms, namely, only a negligible number of individuals who have not actually been exposed will be wrongly informed that they have.

Contact tracing has proven to be an effective tool in controlling the spread of infectious diseases such as COVID-19. It involves digital monitoring and recording of physical proximity between people over time, so that when one user reports infection, it is possible to identify all other users who have been in close proximity to that person during a relevant time period in the past and alert them. These users would be required to monitor their health and isolate, allowing early treatment and preventing further spread. Contact tracing has been deployed successfully in countries such as China, South Korea, Singapore, Israel, Australia, and Germany, and seems to be the only effective way to detect and contain the spread early in the process.

There are two main approaches to contact tracing. The first is based on the relative distance between users. Using the Bluetooth sensor on a smartphone, it is possible to detect signals from other users with Bluetooth emitters who are physically close by (i.e. within a certain range) and record the proximity, either locally on the user's device, or at a central authority/server. This method, currently under development by Apple, Google and others [1], has the advantage that absolute locations of users are never disclosed, ensuring some degree of privacy. The disadvantage is the reliability of the Bluetooth sensors and their ability to work well under all relevant conditions (e.g. occlusion) and at all relevant ranges and some security concerns about the popular decentralized approach to storing this type of data on user devices.

The second approach to contact tracing involves recording on a central server the absolute locations, e.g. by reading and reporting the GPS coordinates of a smartphone, of all users over time. This obviously provides the server with more information to work with than the first approach, enabling not only alerts to nearby users, but also to identify geographic hotspots and other patterns of contagion. It also provides a historic record of the evolution of an epidemic which can be mined and analyzed in many other ways.

Despite the simplicity of the second approach, privacy concerns have prevented its widespread adoption. Many people do not want their location history to be known to any third party, thus would avoid using any software that explicitly discloses this information. Some have gone so far as to call contact tracing based on unprotected disclosure of location data illegal or unconstitutional [2] . A number of commercial contact tracing apps, which report and store explicit location data, have recently been found in violation of user privacy policies, having shared this data with unauthorized third parties [3] . Such privacy concerns must be addressed if automatic contact tracing is to be deployed, as it is not very effective unless adopted by a majority of the population.

Technology that would enable the "hiding" or "obfuscation" of location data could go a long way towards alleviating privacy concerns and enabling contact tracing at a very large scale. Since the outbreak of COVID-19, this has been the topic of recent research, incorporating cryptographic techniques such as private set intersection [4] , private proximity testing based on an equality testing protocol [5] and homomorphic encryption [19] . We refer the interested reader to the comprehensive survey by Reichert et al. [20] . The objective of this article is to describe a very simple method to hide data, which can also be used to hide spatio-temporal data. By hiding, we mean that instead of disclosing a data value , a user would disclose an "encoded" version of , namely . For this to be useful, it should be easy for any user to compute if given , but be very difficult, from a computational point of view, to invert , namely to recover when provided only with (even for the user who encoded . By "difficult" we mean it would require a prohibitive amount of storage or of computational resources, which would effectively deter any such attempt. Although quite distinct, as we will make clear later, these resemble in spirit one-way functions or cryptographic hash functions used in classical cryptography. In its simplest form, the function is deterministic and injective, as then it is easy to check if by simply checking if . In the contact-tracing scenario, the data , is a data value consisting of a concatenation of the time with the location . Given the function , a user with ID would periodically transmit to a central server the pair , , where is the encoded version of . The server would store these pairs in a database indexed by the second component. Given a query vector (of a detected infection), it should be easy to search this database to determine all pairs ′, ′ such that ′ , namely identify which other users (having ID ′) were also at location at time and alert them.

We depart from traditional cryptographic techniques by not requiring the use of encryption keys of any sort, neither private nor public keys. This means that even the user who computed from cannot recover from unless she explicitly records the connection between the two or stores some additional information which might facilitate the recovery. While the basic embodiment of is deterministic, it is possible to add an extra layer of security by introducing a non-deterministic (probabilistic) element to , namely could assume more than one value for any given . In this case we need to modify the database search to a matching procedure: given a query , instead of searching for other vectors ′ such that ′, we search for all other vectors ′ such that , ′ , where is the Hamming distance function between two vectors, namely the number of coordinates in which they differ, and is some threshold. These ′ are called matches of . An exact match is, of course, the special case where 0. A judicious choice of the encoding function and the value will guarantee no false negatives (i.e. perfect recall), namely, given a query corresponding to some data , we will always find all other matching values ′ corresponding to the same . It will also guarantee a negligible (ideally zero) number of false positives (also called false alarms), namely, almost never report values ′ corresponding to a different data value . In the contact tracing scenario, perfect recall is necessary so that all individuals who have potentially been exposed to infection are alerted. A tiny number of false positives are tolerable as all this means is that a small number of individuals who have not actually been exposed will be informed that they have.

This article proposes encoding functions for spatio-temporal data. In a nutshell, it maps a 2D location and time , combined and represented as a large integer in a discrete world, to andimensional vector of integers , where is quite large, e.g. 100. The range of the components of can be much larger than , e.g. 0, … ,502 . The function is based on well-known number-theoretic techniques, the preferred one making use of polynomials over finite fields. First deployed in 1960 in Reed-Solomon error-correcting codes [6] and its variants (the most important being the BCH code), the technique has also found use in other cryptographic methods, such as Shamir's secret sharing method [7] and even blockchain [8] . The most important property of is that it transforms a very large integer into a long vector of much smaller integers in an injective way, which can be thought of as an embedding in a higherdimensional space, and this transformation cannot be inverted unless a minimal number of the vector coordinates (and their indices in the vector) are known. We take advantage of this by sorting the vector coordinates so that their correspondence to the coordinate indices is lost, making it difficult to apply the standard decoding methods. An attacker has no choice but to try all possible permutations of subsets of size of the coordinates, making it computationally infeasible, even for relatively small values of and . Another important property is that, although there are simple algebraic relationships between the coordinates of the vector, to the naked eye, and even to a statistical test, they look like random integers. Thus, the distribution of the encoded vectors in the embedding space is quite uniform, which will work in our favor.

Consider an integer domain (the "world"): 0, … , 1 . Any integer ∈ is a valid message and we may express it as a sequence of digits , … ,

where is a prime number (or more generally a prime power) and ∈ ℤ 0, … , 1 . Note that this implies that log and taking a larger is superfluous. Essentially, is synonymous with a subset of ℤ , the set of all vectors of length , where each coordinate is taken from ℤ .

In the contact tracing application, the spatio-temporal world consists of two-dimensional (latitude and longitude) GPS coordinates at 1 meter resolution (or the Open Location "Plus" Codes [9]), which translates to a grid with 10 points, and 10 different time stamps for every 30 seconds over the past month, implying a "world" of size 10 . If we use the prime 503, this would mean 8.

We propose the following non-deterministic encoding scheme:

Let 0, … , 1 be an integer domain, a positive integer and a prime. Denote by ℤ the set of vectors with elements from ℤ and by Δ the set of vectors with elements from ℤ in non-decreasing order, also known as the ordered discrete simplex. The encoding function : → ℰ ⊂ Δ has parameters , , , , where 0 and log .

To compute for a domain element ∈ :

1. Express in base : ∑ . 2. Compute the basic encoding 0 , 1 , … , 1 ∈ ⊂ ℤ , where ∑ mod is a polynomial of degree 1 over the finite field ℤ . 3. Sort the coordinates of in non-decreasing order to ′ ∈ ℰ. 4. Randomly modify arbitrary coordinates of ′ , while preserving the increasing order of the coordinates, resulting in ∈ ℰ.

Note that as a result of step 4, 0 implies that is non-deterministic, namely may assume multiple values.

The basic code space ⊂ ℤ , defined as the set of all possible basic codes of world elements : ∈ consists of vectors of length , such that ∈ ℤ . It has the following properties:

1. is injective, namely iff . 2. has Hamming distance 1, namely any two distinct codewords , ∈ differ from each other by at least coordinates:

, . This is because any polynomial of degree 1 over a field is uniquely determined by of its values. So not only is an injective function (i.e.

0 , but it maps distinct world elements quite far apart from each other in . 3. may be recovered from by a variety of efficient methods, including inverting a linear Vandermonde system. The basic coding function described above was proposed by Reed and Solomon [6] as an error-correcting code to overcome corruption of ⌊ /2⌋ coordinates of . When presented with ′, which is a corrupted version of , Property 2 guarantees that is the unique codeword in such that , ′ , thus error-correction performed by replacing ′ with the vector closest to it in by the Hamming distance, is well-defined and yields the correct result

. The corrected codeword may be found by efficient algorithms (e.g. [10] ) which take into account the special algebraic structure of .

Our non-deterministic encoding function is a variation on the theme of error-correction. In our scenario, we are presented with two vectors , ∈ ℰ originating from , ∈ . We would like to have a threshold such that iff , . To give the flavor of our approach, we remark that it is relatively easy to determine this threshold if the encoding procedure does not contain the sorting step 2 in the encoding procedure, as the following lemma implies. Proof. From the definition of , we have

While not incorporating the sorting step 2 is amenable to easy analysis and identification of and , it also compromises the security of the encoding , namely, it is then quite easy to recover from . This is essentially error-correction from errors, which, as mentioned above, is possible by a number of efficient algorithms, taking advantage of the special algebraic structure of [10] .

The advantage of introducing sorting step 2 is precisely because it prevents the use of the standard error-correction algorithms, since the critical correspondence between the coordinates of ′ (and thus of ) and the indices in the original is lost. The disadvantage of introducing sorting step 2 is that it modifies the Hamming distance present in , which is not likely to be preserved in ′ and ℰ. In theory it could increase the distance, but it is much more likely to decrease it. It seems like it will be difficult to obtain a lower bound on this distance (which could have then been used to determine a threshold , akin to Lemma 1), since all the algebraic structure that was present in has been destroyed in the transition to ′ and ℰ.

Luckily, we are still able to make useful observations about the nature of the encoded vectors in ℰ. To the naked eye, the basic code space will consist of integer vectors of essentially random values in the range 0, … , 1 . By "random" we mean actually pseudo-random, namely that although completely determined by , it will be statistically impossible to distinguish between these vectors and completely random vectors. The sorting of the vectors will make them less random, but it will still be quite difficult to distinguish between the vectors in ℰ and random non-decreasing integer vectors.

Let us recall the application: We have a database of pairs of user ID's and encoded spatiotemporal values:

, :

1, … , . Given the -a vector -we wish to find all matches of , namely, find all database entries , ′ such that both and ′ are possible encodings of the same data value , i.e.

, ′ for a suitable threshold . We say that is the matching threshold and matches .

Recall that the size of the world is | | 10 . Assuming 1 billion = 10 users, each storing location data for every 30 seconds over the past month, namely, close to 10 timestamped locations, this implies that the database could contain 10 entries. We would like to show that even though the vectors are sorted, a matching threshold of 2 for "reasonable" values of , as in Lemma 1 above, is still a good choice. This is because the size of the database ( ) is much smaller than the size of the world ( ), thus the probability that database vectors match a typical query vector is infinitesimally small, unless they are encodings of the same world data.

Remember that ≪ ≪ , where and . Now, if given a query for which there exists a matching database entry ′, then obviously , ′ 2 . So to avoid false negatives, namely, to avoid missing correct matches, we must take 2 . Can we expect a given query vector to "accidentally" match a vector ′ corresponding to another in the database because of the sorting and corruption of the original basic code vectors in ? The following theorem implies that this false positive is highly unlikely.

Theorem 1: Given any ∈ ℰ, an upper bound for the probability of a vector ∈ ℰ, generated by sorting the coordinates of a random vector ∈ ℤ , differing from in at most non-adjacent coordinates is

Proof. For the case 0, the probability of an exact match in all coordinates is at most !/ , since all ! permutations of can be taken as among all possible unsorted vectors in ℤ , such that , ′ 0. For every coordinate of that occurs with multiplicity 1, the probability reduces by a factor of !, because the order of the repeated coordinate in does not matter.

For the case 1, let us study the number of sorted vectors ′ ∈ ℰ that differ from in exactly one coordinate. Letting 0 and 1, it is clear that each coordinate ′ of ′ for 1, … , can take any value in , … , 1, 1, … , without compromising the correct order. Hence, there are 1 2 2 2 sorted vectors ′ ∈ ℰ at distance , ′ 1 from and thus the number of sorted vectors ′ ∈ ℰ with , ′ 1 is at most 2 1. Using the same permutation argument as before, this proves the upper bound for 1. For the case 1 we apply the previous argument iteratively times while using the assumption that the coordinates of ′ that differ from those of are non-adjacent. Then a vector at distance 1 is just a modification of a vector at distance in one additional coordinate, thus the number of modifications is at most 2 . Note that this is an overestimate as a modification may occasionally reduce the distance by one. Since the order of modification of the modified coordinates is not important, we have counted each distinct modification ! times. □

The assumption that the differing coordinates of and are non-adjacent makes the proof of Theorem 1 easier, but we have experimentally observed that this upper bound holds also for the unrestricted case.

So the expected number of false positives for any given query is at most , , , which decreases as decreases. For the values 503, 100, we may use 10 and matching threshold 20, thus , , 10 . Since 10 , the expected number of false positives per query is infinitesimal (10 , and even the expected number of false positives when each database entry is used as a query is still only , , 10 .

Conclusion. In our encoding scheme, it suffices to take a corruption parameter which is not too small and not too large, and then use 2 as the matching threshold. Such a threshold will completely avoid false negatives and produce a negligible number of false positives.

Retrieving Matching Data. Now that we have a suitable matching threshold for our matching algorithm, we must address the algorithmic question of how to organize the database of encoded values (which are sorted integer vectors), such that given a query vector , it is possible to efficiently find all pairs ′, ′) in the database such that ′ matches , namely such that , ′ ? This is known as the "static Hamming distance range query". Of course, exhaustive search of the database is possible, but that would cost time, which is too costly in our scenario where 10 . Efficient data structures have been devised for dealing with this problem, as in Manku et al. [11] . This requires storage (which is significant but not prohibitive in our application but has very fast ( log query runtime. See also Liu et al. [12] for more recent work.

Now that we have an encoding algorithm and are able to match two encoded vectors, we describe the procedure to be followed by the individual users and the central server to do the actual contact tracing and alerts.

 The user continuously transmits to the server data pairs , where and , is her time and location, tagged as "uninfected". The user also stores the triples , , in a local database indexed by and (e.g. on her smartphone), so that it is easy to retrieve all ′s transmitted during a given time interval and recover , from its encoding .  If the user discovers she is infected, she sends again all pairs , generated by her over the past, say, two weeks (by querying her local database) back to the server, tagged as "infected".  Upon receipt of message tagged with "possible infection" from the server, the user recovers the infection time and location , from (by querying her local database). The user self-isolates for two weeks and can possibly report , separately to friends and family.

 Upon receipt of a data pair , tagged "uninfected", the server stores the pair on the server database (of size ).  Upon receipt of a pair , tagged "infected", the server retrieves from the server database (by the matching algorithm described in Section 4) all pairs ′, ′ for which ′ matches . The server then sends these ′ to user ′ tagged with "possible infection".

Recall that a critical objective is to "hide" the data by its encoding, namely render it computationally infeasible to recover the (large) integer ∈ from the integer vector ∈ ℰ, either because it would require too much computation time or too much storage space. We describe here three possible methods of attack and argue that they are infeasible.

Brute-force attack. The simplest method is just to exhaustively scan the entire world and check if the encoded version ′ of any world point matches the given encoding (namely, that , ′ . This would require | | 10 encodings and comparisons, which is prohibitive in runtime.

Table attack. We could reduce the runtime of the brute-force attack by trading off space for time, employing a very large database. Simply compute some encoding for every possible ∈ in a preprocessing phase and store the pairs , in a database indexed by . Given an encoding , the matching algorithm described in Section 4 would then be able to quickly retrieve all matches of . However, this requires a database of size which is 10 times larger than the server database. For 10 , 503, 100 and 20, this is at least 10 bytes, and would be prohibitively large.

Direct attack. A direct attack occurs when an adversary tries to invert the encoding through a subset of the coordinates by applying the traditional decoding algorithms such as solving a linear Vandermonde system. This is foiled by the sorting of the coordinates of the vectors. Since inversion requires knowledge of the correspondence between coordinates and their indices for at least uncorrupted coordinates, this is what an attempt to invert must look like: Note that failure in one iteration due to one or more corrupted coordinates will not reveal which of the coordinates are corrupted, so that there is no extra information that can help to choose a "better" set of coordinates in the next iteration. In total, the expected number of solves for this attack would be

. For 100 and 503, we have 8. With 10, the expected number of solves is 10 , which would take too long.

While we have presented an encoding method based on polynomials over finite fields, it is possible to use another method which is also employed in error-correcting coding and secret-sharing. This involves so-called redundant residue number systems. Originally proposed in the 1950's for efficient arithmetic computations on large integers [13] , this technique was adopted for error-correction coding soon after [14, 15] and is also used in cryptography [16, 17] . The main difference between this method and the basic coding method described above based on polynomials is that now the basic code space is ℤ ℤ ⋯ ℤ instead of ℤ for a sequence of distinct primes , … , . Recall that the "world" is 0, … , 1 . Let , … , be a sequence of increasing primes and an integer such that ∏ ∏ , Denote ∏ . The encoding function : → ℰ for a domain element ∈ , has parameters , … , , , , where are primes and 0 is an integer. The basic coding function is simply mod , … , mod ∈ . Similar to the case of polynomials over finite fields, the infamous Chinese Remainder Theorem [15] guarantees that can be recovered from any subset of coordinates of along with their indices, so this code also has Hamming distance 1, and error-correction may be done using a variety of methods taking advantage of the algebraic structure (e.g. [18] ). Our encoding proceeds as above, by sorting the coordinates of the basic code and corrupting a small subset without changing the order. Nothing else is changed.

Despite this approach actually being simpler to implement than the polynomial-based approach, it is less desirable due to being more constrained as a function of the primes used. For example, for 80 , taking to be all the consecutive primes from 877 to 1,451 (having geometric mean 1,143) yields only

7. An appropriate would be 8, thus 16. The probability of a false positive is then 10 and the complexity of the direct attack is 10 (see Table 1 ). 

Increasing the security. It is relatively easy to increase the security of the system, i.e. making a direct attack on the system more difficult. In the scenario described above, where 10 , we took 100, 503, implying 8, thus the complexity of a direct attack is 10 . If we were to take instead 100 and 101, so that ⌈log 10 ⌉ 10, the complexity would increase to 10 (although we would have to take 1 and 2 to keep the probability of a false positive at 10

, and if this were not enough, we can increase this further by increasing both and . See Table 1 for a comparison of the attack complexity resulting from different values of the system parameters. Increasing obviously increases the (bit) size ⋅ log of the code and thus the size of the server database, but the same is true for the database of the "table" attack.

Using a deterministic mapping. Our encoding method is non-deterministic, namely involves randomly corrupting a subset of 0 coordinates in the sorted basic code vector. The advantage of a large is that it increases the difficulty of a direct attack on the database, as described in Section 6. However, for certain values of the other system parameters, it may be possible to make do with a deterministic encoding method, namely 0. In this case, matching a query vector within the server database reduces to exact vector match, which may be done easily by binary search on a table (of size of the database entries , , sorted in lexicographic order of .

Detecting proximity. The method outlined in this article provides an easy way to determine whether by comparing and . Recall that and are taken from a discrete world, which are essentially samples of the true continuous world at some finite resolution grid. However, sometimes in contact tracing it is necessary to also determine proximity beyond the grid resolution, either because of an increased radius of infection or simply because the accuracy of the measured location (typically taken from a GPS device) is much worse than the grid resolution and the chances of an exact match in measured location even when two users are within grid resolution, is very slim.

It would seem difficult to achieve this, since the encoded vectors have a pseudo-random distribution and any spatio-temporal correlation between two data points would be "lost in encoding". The easy way to circumvent this is for the user to transmit to the central server encodings of not just her current location, but also of the neighboring grid points, effectively "dilating" the data point. This would incur some overhead in storage and transmission costs on both client-side and server-side.

"Inflating" the world. The world size, in our contact tracing application, is 10 integers, which is very large, but constrains some of the parameters in our encoding scheme. In particular, the parameter , if too small, could compromise the security against the direct attack, as described in Section 6. One way to rectify this would be to "inflate" the world by means of some function : → ′ with | | ≪ | ′| ′. This function should be injective and nonpolynomial, so that it cannot be inverted easily at each individual coordinate. One possibility for such an is the following:

Let denote the -th prime (i.e., 2, 3, etc.) and observe that the product of the first 16 primes is a little larger than the size of our world. Hence, the first step is to map ∈ to the residue code vector w.r.t. these 16 primes, namely compute , … , with mod . For the next step, let ∑ denote the sum of the first 1 primes (i.e., 0, 2, 5, etc.) and let us map each to the 1 -th prime, giving the vector ′ ′, … , ′ with ′ . Finally, we define ∏ ′ and note that is a square-free integer with exactly prime factors. Moreover, as the mapping is injective, it follows that and for have at most 1 common factors, thus guaranteeing the injectivity of . The size of the inflated world is ′ ∏ 10 . We now continue to encode ′ ∈ ′ instead of ∈ with the polynomial-based approach outlined above, but now having the advantage of a larger ′ 15 instead of the previous 8.

Other linear codes. The basic code based on polynomials that we use is a linear code, in the sense that the coding operation is just multiplication by a matrix: ⋅ over ℤ . is the Vandermonde matrix, which has the special property that all submatrices of size have full rank. This property allows to recover from any subset of coordinates of by multiplying them by the inverse of the appropriate submatrix of . Thus any matrix with similar properties would serve the same purpose. Furthermore, were we to construct an matrix with the property that some of the submatrices of size have rank less than , and that full rank is obtainable only when the submatrix is enlarged to , this, coupled with the corruption of coordinates during encoding, could further complicate the direct attack on the method described in Section 6.

Assessing disease exposure risk with location data: A proposal for cryptographic preservation of privacy

A note on blind contact tracing at scale with appli-cations to the COVID-19 pandemic

Polynomial codes over certain finite fields

How to share a secret

Polynomial-based modifiable blockchain structure for removing fraud transactions

A new algorithm for decoding Reed-Solomon codes" in Communications, Information and Network Security

Detecting near duplicates for web crawling

Large scale Hamming distance query processing

The residue number system

Self-checked computation using residue arithmetic

Error correcting properties of redundant residue number systems

How to share a secret

A modular approach to key safeguarding

Chinese remaindering with errors

Towards privacy preserving contact tracing

A survey of automatic contact tracing approaches