key: cord-0119567-7uxv2dfq
authors: Helminger, Lukas; Kales, Daniel; Rechberger, Christian; Walch, Roman
title: Aggregated Private Information Retrieval: A First Practical Implementation to Support Large-Scale Disease Analytics
date: 2020-05-05
journal: nan
DOI: nan
sha: 83fec1e348f33f23efdc912f565e5f444724aeab
doc_id: 119567
cord_uid: 7uxv2dfq

With the outbreak of the coronavirus, governments rely more and more on location data shared by European mobile network operators to monitor the advancements of the disease. In order to comply with often strict privacy requirements, this location data, however, has to be anonymized, limiting its usefulness for making statements about a filtered part of the population, like already infected people. In this research, we aim to assist with the disease tracking efforts by designing a protocol to detect coronavirus hotspots from mobile data while still maintaining compliance with privacy expectations. We use various state-of-the-art privacy-preserving cryptographic primitives to design a protocol that can best be described as aggregated private information retrieval (APIR). Our protocol is based on homomorphic encryption, with additional measures to protect against malicious requests from clients. We have implemented our APIR protocol in the SEAL library and tested it for parameters suitable to create a coronavirus hotspot map for entire nationstates. This demonstrates that it is feasible to apply our APIR protocol to support nationwide disease analysis while still preserve the privacy of infected people.

Due to the ongoing global threat of the SARS-CoV-2 virus, a number of technological approaches are currently developed to help reduce its spread and impact. A lot of focus is on automatic contact tracing, challenges include privacyfriendliness, scalability and utility, efforts include [22, 7, 15, 6, 10, 16, 50, 1, 31, 45, 51] .

These approaches crucially rely on sizable parts of the population using smartphones, enabling Bluetooth, and installing a new App on their phones. In contrast, our proposal does not help with contact tracing, but gives potentially useful epidemiological information to health authorities without requiring people to carry around smartphones, as any mobile phone will be sufficient, and Bluetooth is also not needed.

Mobile network operators increasingly share location data to manage the coronavirus. Example: Governments in Italy, Germany, and Austria are relying on this metadata to monitor how people are complying with stay-at-home orders 3 .

In addition, the EU Commission urged Europe's biggest mobile network operators to hand over location data in the coronavirus fight 4 . In order to respect Europe's strict privacy requirements, mobile network operators only share anonymous, aggregated information that they have already gathered. Other types of such aggregated data sharing are e.g. provided by Google 5 .

Although this information is useful, it has its limitations. After the process of anonymization and aggregation, it is only possible to make statements about the whole population. For example, the shared data can not be used to question whether there is a difference in behavior between people who got infected with Sars-Cov-2 and those who did not. Consequently, it is impossible with this data to detect public places where there is a significantly higher risk of getting infected with Sars-CoV-2 (hotspots). This, in turn, could be valuable knowledge in the further containment of Sars-CoV-2. The question is if there is a way to find Sars-CoV-2 hotspots without undermining privacy requirements. Landau in [39] independently also motivates increased efforts in aggregated location tracking as contact tracing may not work as well as hoped.

This research aims to construct a protocol for detecting coronavirus hotspots from mobile data in accordance with strict privacy requirements. To accomplish this, we design a specialized private information retrieval (PIR) protocol.

PIR protocols, as introduced by Chor et al. [20] , aim to realize efficient methods for a client to retrieve elements from a database managed by one or many untrusted servers. Thereby, none of the involved servers should learn which element was retrieved by the client. In this paper, we are interested in the single-server variants, namely computational PIR (CPIR), which rely on cryptographic hardness assumptions to hide the query from the server. Recent work [43, 14, 26, 38, 40, 41, 52, 3] has heavily improved on the original ideas of Chor et al. Many CPIR implementations use homomorphic encryption (HE) to hide the queries from the server while still allowing him to perform operations on the query.

HE is a cryptographic primitive that allows performing computations on encrypted data without knowing the secret decryption key. The first fully homomorphic encryption (FHE) scheme, which in theory allows computing all possible functions on encrypted data, was introduced by Gentry [29] in 2009 and has been drastically improved upon by followup publications [12, 11, 28, 18, 19] . However, the limitation of FHE schemes is that they add a substantial perfor-3 https://www.reuters.com/article/us-health-coronavirus-europe-telecom s/european-mobile-operators-share-data-for-coronavirus-fight-idUSKBN 2152C2 4 https://www.sciencemag.org/news/2020/03/cellphone-tracking-could-hel p-stem-spread-coronavirus-privacy-price 5 https://www.google.com/covid19/mobility/ mance penalty of several orders of magnitude compared to performing equivalent computations in plain. Nevertheless, recent publications have shown, that highly optimized protocols can leverage homomorphic encryption to increase privacy of complex tasks, like machine learning [23, 35, 21, 9, 8, 5, 37, 36] .

However, PIR protocols, by definition, do not hide individual database entries from clients. In this paper, we want to go one step further by using HE to additionally aggregate multiple database entries to keep the server data private as well.

We present a protocol for aggregated private information retrieval (APIR). In the use-case we will describe later, APIR would enable officials to get anonymous, aggregated information about movement records of Sars-Cov-2 patients. In the following, we first outline the main idea of our APIR protocol.

We assume without loss of generality that the first column of the database of the server consists of unique identifiers. The client's input to the protocol is a list of such ids, and the output is a vector representing the aggregated columns of the server's database with respect to the client's identifiers.

The threat model of the APIR protocol is similar to PIR protocols, i.e., the server should not know which elements were retrieved by the client. Additionally, in an APIR protocol, the client should also not be able to retrieve individual entries of the database, only receiving aggregated values is allowed. Note that the identifiers of the server are not considered to be private.

To achieve the privacy goals outlined above, we make heavy use of homomorphic encryption. In particular, the client homomorphically encrypts its identifiers before sending it to the server. Due to the nature of homomorphic encryption, the server can still perform the data aggregation. To prevent the client from learning individual entries, we make sure that the client's list of identifiers has a guaranteed minimum cardinality and that each identifier is unique. In addition, we apply differential privacy before releasing the output.

We report concrete benchmarks of our protocol for parameters suitable to create a Sars-Cov-2 hotspot map for small nations with millions of subscribers and tens of 1000s of base stations. In our concrete example this corresponds to an APIR protocol applied on a database consisting of ≈ 2 38 32-bit entries. We report multithreaded runtimes of 30 minutes for the standard APIR protocol, and 1 hour when extra steps are applied to ensure the input vector is not malicious. Our benchmarks show that it is feasible to use our APIR protocol on relatively large databases to support disease analysis of small nation states or large cities while still maintaining the privacy of infected people. We also show that costs for doing the same for larger nation states can still be expected to be very modest.

After introducing preliminaries in Section 2, we describe our APIR protocol more formally in Section 3. In Section 4, we illustrate an application of our APIR protocol, which can have a significant impact on the understanding where Sars-Cov-2 infection hotspots can be found. Afterward, in Section 5, we will have a look at the details of our implementation as well as the benchmarks.

In this section, we cover the preliminaries required for the rest of the paper. We will first introduce the notations we use in the rest of the paper, before we describe homomorphic encryption, private information retrieval and differential privacy.

We follow the widespread convention to write vectors in bold lower case letters and matrices in upper case letters. We use x i to access the i-th element of vector x. For m ∈ N and x ∈ Z, let x m be defined as the vector of powers of x: x m = (x, x 1 , ..., x m ). We denote by c • d the element-wise multiplication (Hadamard product) of the vectors c and d. For a positive integer p, we identify Z p = Z ∩ [−p/2, p/2).

The concept of homomorphic encryption (HE) has often been considered to be the holy grail in cryptography since it allows us to work on encrypted data without requiring the secret decryption key. It was first introduced by Rivest et al. [47] and partially HE schemes, i.e. schemes which allow performing a limited set of operations on encrypted data, have been known for years: The RSA [48] encryption scheme is homomorphic for multiplication and Paillier's cryptosystem [44] is homomorphic for addition. However, it was not until Gentry's groundbreaking work from 2009 [29] that we were able to construct the first fully homomorphic encryption (FHE) scheme, a scheme which in theory can evaluate an arbitrary circuit on encrypted data. His construction is based on ideal lattices and is deemed to be too impractical ever to be used, but it led the way to construct more efficient schemes in many following publications [12, 11, 28, 18, 19] .

Modern HE schemes are based on the learning with errors (LWE) [46] hardness assumption, and its variant over polynomial rings, the ring learning with error (RLWE) [42] hardness assumption. In the following, we recall one variant of the RLWE hardness assumption, the Decision-RLWE hardness assumption.

Definition 1 (Decision-RLWE). For security parameter λ, let f (x) be a cyclotomic polynomial Φ m (x) with deg(f ) = ϕ(m) depending on λ and set R = Z[x]/(f (x)). Let q = q(λ) ≥ 2 be an integer. For a random element s ∈ R q and a distribution χ = χ(λ) over R, denote with A (q) s,χ the distribution obtained by choosing a uniformly random element a ← R q and a noise term e ← χ and outputting (a, [a · s + e] q ). The Decision-RLWE d , q, χ problem is to distinguish between the distribution A (q) s,χ and the uniform distribution U (R 2 q ). The Decision-RLWE d , q, χ assumption is that the Decision-RLWE d , q, χ problem is infeasible.

During the encryption of a plaintext in RLWE based schemes, random noise is introduced into the ciphertext. This noise grows with the evaluation of homomorphic operations, negligible for addition, but significantly for homomorphic multiplication. Once this noise becomes too large and exceeds a threshold, the ciphertext cannot be decrypted correctly anymore. We call such a scheme a somewhat homomorphic encryption scheme (SHE), a scheme that allows evaluating an arbitrary circuit over encrypted data up to a certain depth. The specific depth depends on the choice of encryption parameters, and choosing parameters for larger depths comes, in general, with a considerable performance penalty.

In his work [29] , Gentry introduced the novel bootstrapping technique, a procedure that reduces the noise in a ciphertext and can turn a (bootstrappable) SHE scheme into an FHE scheme. However, this bootstrapping operation comes with high computational complexity. In many practical applications it is, therefore, faster to omit bootstrapping and choose a SHE scheme with large enough parameters to evaluate the desired circuit. In this work, we use the BFV [11, 28] SHE scheme to homomorphically encrypt the inputs of our protocol.

We rely on HE instead of other privacy-preserving protocols, such as secure multi-party computation (MPC), due to several considerations:

-Homomorphic ciphertext-ciphertext multiplications are very costly in HE schemes, however, in our protocol we mainly rely on the cheaper plaintextciphertext multiplications. Therefore, all the operations involved in our protocol can be expressed relatively cheap using HE. -HE has the advantage of outsourcing computations. After the client sends the encrypted data to the server, the server can do the computations without further data exchange with the client. MPC protocols, in contrary, have a higher number of communication rounds and all parties have to participate in the computations. -MPC protocols usually have a higher communicational complexity, i.e., the amount of exchanged data is much higher. Especially over WAN networks, the communication alone can introduce significant delays.

In 1995, Chor et al. [20] introduced the notion of private information retrieval (PIR), server-client protocols which allow clients to hide queries to a database managed by one or many server, but still receive the correct elements. Thereby, the goal is to be more efficient than the straight-forward solution of letting the client download the whole database. In the literature we distinguish between two different types of PIR: information theoretic PIR (IT-PIR) [20, 4, 24, 25, 30] protocols which rely on multiple, non-colluding servers to ensure privacy; and computational PIR (CPIR) [43, 14, 26, 38, 40, 41, 52, 3] where a single server manages the database and encryption is used hide the query. Our approach is similar to the single-server CPIR variant, in which bestperforming CPIR protocols [43, 3] use homomorphic encryption to ensure that the server is still able to compute on the query to provide the final result. One major issue with CPIR protocols is that the server must process all entries of the database to answer a single query. If this would not be the case, the server must somehow have learned, which entries the client is not interested in, and thus, leaked information about the query. The computation cost of a server is, therefore, linear in the database size, which limits the practicability of CPIR to small databases. CPIR using homomorphic encryption boils down to essentially evaluate a vast vector-matrix multiplication on a one-hot encoded input vector. State-of-the-art CPIR implementation employs many optimizations to make retrieving single elements very feasible for limited database sizes. In this work, however, we are not interested in retrieving only one item; in fact, we want to ensure that single database entries are kept hidden from the client as well. We, therefore, employ homomorphic encryption to retrieve multiple aggregations of many database entries at once and invalidate responses for malicious queries. Furthermore, we want to implement use cases involving a huge database consisting of ≈ 2 38 32-bit entries.

Another thing to consider when designing a privacy-preserving data analytics protocol is that the result which the client obtains in plain can still leak information about the underlying dataset. In our case, an aggregated PIR protocol, the data aggregation can leak information about single database entries. We can use the well-established notion of differential privacy to help protect against such kind of information leakage. Let us recall the definition of -differential privacy [27] :

Lemma 1 ( -Differential Privacy). A randomized mechanism A givesdifferential privacy if for any neighboring datasets D and D , and any S ∈

Since D and D are interchangeable, Lemma 1 implies:

An established technique to achieve -differential privacy is the Laplace mechanism, i.e., to add Laplacian noise to the final result of the computation. The noise is, thereby, calibrated with the privacy budget and the global sensitivity ∆q of the computation q: ∆q = max D,D ||q(D) − q(D )|| for all neighboring D and D . In other words, the global sensitivity represents the maximum possible value of each element in the dataset. The Laplace distribution for a scale factor b is given as Lap(x|b) = 1 2b e − |x| b , in the Laplace mechanism a scale factor of b = ∆q is used.

We can apply differential privacy to our APIR protocol by adding Laplacian noise to each element of the final result of the matrix multiplication v T · Z:

We note that in general, each column of the matrix multiplication can represent a different dataset with a different sensitivity ∆q i . However, in the use case we describe in our work, each ∆q i is equal.

In Section 3.1, we give a high-level description of our protocol below. A formal definition of the protocol is provided in Figure 1 . In addition, we illustrate our protocol in Figure 2 including an interactive proof to protect against malicious queries described later in this section. Then we discuss in Section 3.5 the concrete privacy and a generalization of our APIR protocol in Section 3.6.

We use a HE scheme HE = (HE.KGen, HE.Enc, HE.Dec, HE.Eval). Before executing the protocol, the client runs KGen to obtain an evaluation key evk and a private key sk. We assume that the server knows evk before running the protocol.

First, observe that according to our assumptions, the server is allowed to share the identifiers with the client. We can exploit this fact by server not only sending the identifiers but also the indices of the identifiers in the database. In this case, the client not only gets to know which identifiers it has in common with the server but also their exact location in the database of server. This information is stored in the vector v = v 1 · · · v N . Sending v to the server would stand in contradiction to our threat model because the server would directly learn the identifiers of the client. Therefore, the client homomorphically encrypts the vector v before sending it to the server. The server can not learn anything from the encrypted vector provided that the HE scheme is semantically secure.

Remark. In some applications the list of identifiers in the server database might also be considered private information. In such cases, one can augment this step by using a variant of Private Set Intersection (PSI) with associated data transfer. In particular, the labeled PSI protocol in [17] provides the required features while conveniently being based on Fully Homomorphic Encryption.

Public inputs: Client and server agree on parameters N, n, W, s ∈ N, ∈ R>0. Private inputs: The client has input (x1, . . . , xν ) ∈ {0, 1} n×ν with ν ∈ N, and keys (evk, sk) for a homomorphic encryption scheme. The server has input (y1, . . . , yN ) ∈ {0, 1} n×N , Z = (zij) ∈ Z N ×k with k ∈ N and knows evk. 

and sends µ = µ1 · · · µs 0 · · · 0 T ∈ Z k p to the client Else:

Server chooses r $ ← Z k p , and computes

5. If interactive == TRUE: Client decrypts ρ ← HE.Decsk(µ), and sends ρ to the server. 6. If interactive == TRUE:

If ρ = ρ = ρ1 · · · ρs 0 · · · 0 T server aborts, otherwise server sets µ ← 0. As mentioned before, we want to ensure that the client does not get the values of individual rows. Therefore, the server checks that the hamming weight of c is not too low. Depending on the specific use of the protocol, the threshold W can vary. This precaution alone is not enough against a malicious client who could also send an arbitrary non-binary vector. As an example, by sending a vector of all zeros except one coordinate equal to the minimum hamming weight W , the client would still learn the values of this particular row. To prevent such abuse, the server also checks that the vector c is binary. Only if both conditions are satisfied, the following data aggregation is valid. Technically, we developed two variants of how the server can check the correctness of the query vector v.

One can detect a malicious client before doing the costly matrix multiplication by introducing one additional round of communication, the other invalidates the result of our APIR protocol for malicious queries. For details see Section 3.3 and Section 3.4.

The data aggregation is done by computing a large matrix multiplication. The vector c contains the information, which rows should part of the sum. Since c is encrypted, according to the security guarantees of the HE scheme it does not reveal any information to the server, however, by using Eval and evk, the server can still perform the matrix multiplication. We add a small random noise to the result of the homomorphic matrix multiplication for privacy reasons (see Section 2.4). By decrypting h * with its private key sk, the client gets the final output of the protocol.

One assumption of the protocol is that it requires the input vector to be binary. If this is not the case, the client can arbitrarily modify the contribution of a single person to the overall aggregated result, which can leak private information. Since the server only receives an encryption of the input vector, simply checking for binary values is not an option. However, we can use similar techniques to the ones used in Bulletproofs [13] to provide assurance that the query vector c is made up of binary elements, i.e., ∀i : c i ∈ {0, 1}.

Lemma 2. Let c ∈ Z N be the query vector and let d = c − 1. If the following condition holds, c is a binary vector c ∈ Z N 2 :

Note in our scenario, d can be computed by the server. However, the server needs to compare a homomorphically encrypted vector to zero to check Lemma 2, which is not trivially possible. Therefore, we want to encode this comparison as a mask µ bin ∈ Z k p , which will either be zero if c ∈ Z N 2 , random otherwise. Adding µ bin to the response of the server will invalidate the result for malicious input vectors, but do not affect correct queries.

The result of c • d can be aggregated into a single value by calculating the inner product c, d , which will again be zero if c ∈ Z N 2 . The server then adds a random value y $ ← Z p to reduce the probability for the client to cheat by letting several entries of c cancel each other out during the inner product. A final randomization with a vector r $ ← Z k p , produces the final mask:

For the generic case of a vector b and a randomly chosen y, b, y N = 0 will hold for a b = 0 only with probability N/p [13] . Using a 60 bit prime, i.e. p ≈ 2 60 , for N = 2 23 , this probability is P r ≈ 2 23 2 60 = 2 −37 , i.e. a soundness error of 37 bits. 

For i = 1 to N : For i = 1 to N :

For i = 1 to N :

For j = 1 to k:

For j = 1 to k: For j = 1 to k:

Output h1 h2 · · · hk Fig. 2 : Illustration of our APIR protocol with interactive proof for an honest query vector.

Another problem of our protocol is that the client can target the values of single people by querying the server with an input vector of hamming weight W = 1. Again, since the query is encrypted, the server can not trivially check the hamming weight of the input vector. However, we can apply similar techniques as in the previous section to incorporate a hamming weight check into a masking value. Again let c be the query vector and let W be its announced hamming weight. On the server-side, calculate c, 1 N , which is equal to the hamming weight of c. Therefore, with a random vector r $ ← Z k p , the following mask µ HW ∈ Z k p is zero if c has the announced hamming weight W , random otherwise:

We then combine the two separate proofs in equation 3 and equation 4 into a combined masking value µ, adding a random value z $ ← Z p to avoid the separate masks canceling each other out:

µ is now equal to 0 k if c is a binary vector with hamming weight W , random otherwise. A vector c not fulfilling these conditions will result in a masking value of 0 k only with probability N +1 p .

The masks calculated in the previous section is finally added to the result of the matrix multiplication to invalidate the result for malicious queries. However, one could also perform an additional round of interaction between the client and the server as follows: Calculate a challenge value like the previous mask, however, instead of using a random vector r $ ← Z k p , we multiply the mask with a random scalar r $ ← Z p and also add a final random value ρ $ ← Z p :

The server then sends this ciphertext as a challenge to the client, who decrypts it and sends back the result. Only if the sent value is equal to ρ, the matrix multiplication is carried out and the result is sent back to the client. This approach has multiple advantages: First, at the cost of one more round of interaction and the associated communication, the server can detect malicious input vectors before carrying out the expensive matrix multiplication. Furthermore, this approach can also be repeated in parallel to boost the soundness.

With a soundness error of N +1 p , repeating the challenge t times reduces the probability to cheat to N +1 p t . Therefore, we require at least s challenges in parallel to achieve a soundness error of at least the security level λ (in bits), with s being:

For the privacy of APIR, the selection of parameters is of outermost importance. By using the APIR protocol in an application, the parameters n, k, and N get fixed by the specific data sets, whereas the parameters and W can be freely chosen. The concrete privacy -in particular for the items in the client's database -is highly dependent on those two parameters. An anonymization expert should choose them with knowledge of the specific data sets and the actual application.

The aim here is that the parameters are selected in a way that, from the output, no re-identification is possible but correctness and usefulness of the aggregated results is maintained.

The data aggregation of our APIR protocol is presented in a straightforward fashion. We use the simplest form of aggregations; namely, just adding all the information requested from the client, which corresponds to a single big matrix multiplication. Depending on the use-case, one could think of more involved methods. The computation on the server-side can consist of any combination of (matrix) multiplications, (matrix) additions as well as rotations of single rows or columns. In this light, APIR can be seen not only as a protocol but rather as a useful framework for privacy-preserving data analytics.

Our APIR protocol can help to learn more about the behavioral patterns of people tested positive with Sars-CoV-2. Note that the whole objective of our approach is to gather this information without violating the privacy of patients or enabling the tracking of individuals. In this section, we describe how the generic protocol in Figure 1 can be used to create a private heat map of Sars-CoV-2 positive people at the time of their infection. Such a heat map could lead to a better understanding of Sars-CoV-2 hotspots, i.e., places where there is a significantly higher risk of getting infected with Sars-CoV-2. This, in turn, could be valuable knowledge in the further containment of Sars-CoV-2.

The parties involved in creating a Sars-CoV-2 heat map are a mobile network operator (M ) and a health authority (H). An illustration of this application can be found in Figure 3 . M takes the role of the server, where the identifiers are mobile phone numbers. Let N be the number of all mobile phone numbers registered by M and n the length to represent a mobile phone number. The matrix Z contains location data. More precisely, each row is the location data for one particular mobile phone number. For simplification, we assume that the location data is represented as follows. The value z ij is the amount of time spent by person i in the cell site j. Therefore, k is the number of cell towers operated by M . On the other side, the role of the client is taken by H. The input of H is a list of ν mobile phone numbers of Sars-CoV-2 patients. The output of the protocol is a vector h 1 · · · h k , where h j is the total amount of time spent by all the Sars-CoV-2 patients in the cell site j.

Note that, as we have already mentioned in Section 3.6, our APIR protocol should be seen more as a framework. This means that with little adaptation, our APIR protocol can be used for many different analytical methods. In particular, we can exchange the data aggregation (matrix multiplication) by different computations to support a larger class of epidemiologically useful statistics. The heat map is only one concrete instantiation of this framework.

We now want to have a look at the privacy features of the application inherited from them generic protocol. Recall, the generic protocol has two checks to ensure that H does not learn the values of any row, i.e., H should not learn the movement pattern of any individual. More concretely, H is not allowed to query the location data for less than W different people. W has to be chosen in such a way that the data aggregation provides anonymity and its exact value will highly depend on the actual underlying data, which is the reason why we do not give a generic value in this paper. We urge users of our protocol to rely on an expert in the field of de-anonymization with sufficient knowledge of location data to choose a suitable value. To sum up, by using our protocol correctly, highly sensitive personal data is protected. In particular, M does not learn which people are infected by Sars-CoV-2. On the other side, H does not get the movement record of individual patients. 

Preprocessing. A preprocessing phase on the side of the mobile network provider could increase the value of the heat map as well as provide even more privacy. First, M could try to find the place of residence by assuming people sleep at home. Therefore, they can look at the location of people at night and remove those places for the heat map creation. Once M has completed this step, the heat map will additionally be more informative in the sense of H getting a better picture, which public places might bear a high risk of getting infected with Sars-CoV-2. Secondly, to further enhance the privacy of patients, M should filter out isolated profiles. Such filtering of the data obviously does not affect the meaningfulness of the heat map, since crowded public places would not be affected. Note that both preprocessing steps are done on the side of the mobile network operator. That means those steps can be done on non-encrypted data and therefore are negligible when it comes to the runtime of our protocol.

Applying Differential Privacy. An attack vector for H in our protocol can be to abuse the heatmap to track individual people by querying for isolated profiles. For example, they want to query for a person living in the westmost area of the country. To overcome the hamming weight check on the server-side, they additionally query for W − 1 people living on the east side of the country. The location data of the targeted person is clearly shown on the resulting heatmap; H just has to look at the western part of the result. Applying differential privacy with suitable parameters will protect against such an attack since the overall goal of differential privacy is to decrease the statistical dependence of the final result to a single database entry as much as possible. In our use-case, therefore, differential privacy achieves that it is infeasible to distinguish between heatmaps, in which we include a single person in the accumulation and heatmaps in which we do not.

A general issue of differential privacy is that the privacy guarantees get weaker the more queries the client is allowed to do. This, however, is no issue in our use case, since the location data changes every couple of days, and we can limit H to query a heatmap to, e.g., once a day.

Choosing proper parameters, however, highly depends on the underlying dataset. On the one hand, the chosen should be small enough to satisfy privacy concerns; on the other hand, it should be big enough not to overflow the result with noise. In our protocol, accumulations of a sufficient amount of people should not be affected by the noise, i.e., the noise on its own should not be able to create hotspots. Again, we urge users of our protocol to rely on an expert in the field of de-anonymization with sufficient knowledge of location data to choose suitable values for the application of differential privacy.

is an actively developed open-source library maintained by Microsoft Research compatible with all major operating systems, including Windows, Linux, and OS X.

The computationally most expensive phase in the protocol is the Data Aggregation phase, in which the server multiplies a huge matrix to a homomorphically encrypted input vector. Therefore, the main objective of our implementation is to perform this huge matrix multiplication as efficiently as possible.

Modern HE schemes allow for packing a vector of n plaintexts into only one ciphertext. Performing an operation on this ciphertext then is implicitly applied to each slot of the encrypted vector, similar to single-instruction-multiple-data (SIMD) instructions on modern CPU's (AVX, SSE2, etc.). However, the size of the ciphertext does not depend on the exact number (≤ n) of plaintexts encoded. The HE schemes support a variety of SIMD operations, including slotwise addition, subtraction and multiplication, and slot-rotation. However, one can not directly access a specific slot of the encoded vector. We can use the SIMD encoding to speed up the matrix multiplication of our protocol significantly.

In the BFV scheme (and its implementation in the SEAL library), the number of available SIMD slots is equal to the degree of the cyclotomic reduction polynomial (x n + 1); thus, it is always a power of two. In the ciphertexts, the n slots are arranged as matrix of dimensions (2 × n/2). A ciphertext rotation affects either all rows, or all columns of the matrix simultaneously. Therefore, we can think of the inner matrix as two rotatable vectors, which can be swapped. Figure 4 highlights the inner structure of the ciphertext and the rotation operations.

The SIMD encoding can be used to efficiently speed up matrix multiplication by using the diagonal method introduced by Halevi and Shoup in [32] . They have shown that a matrix-vector multiplication of a matrix Z ∈ Z m×m and vector v ∈ Z m can be expressed by m elementwise vector-vector multiplications, m − 1 rotations, and m − 1 additions, operations that can easily be evaluated in an HE scheme: · · · n − 1 n by applying the baby-step giant-step algorithm [33, 34] :

where m = m 1 ·m 2 and diag (Z, i) = rot (diag(Z, i), − i/m 1 · m 1 ). 6 Note, that rot(v, j) only has to be computed once for each j < m 1 , therefore, equation 9 only requires m 1 + m 2 − 2 rotations of the vector v in total.

Trivially, we can use the following equation to implement a v T · Z multiplication, like we use in our protocol:

In our protocol we want to homomorphically evaluate v T · Z, where v ∈ {0, 1} N and Z ∈ Z N ×k p , for big parameters N and k. As described in Section 5.1, the inner structure of the BFV ciphertext consists of two vectors of size n/2 each, and it does not allow a cyclic rotation over the whole input vector of size n. However, a rotation over the whole input vector is required by the baby-step giant-step algorithm. Therefore, we only can perform a baby-step giant-step multiplication with a (n/2 × n/2) matrix using this packing. Fortunately, we can use the remaining n/2 slots (i.e., the second vector in the inner structure of the BFV ciphertext) to perform a second (n/2 × n/2) matrix multiplication simultaneously. Therefore, after a homomorphic baby-step giant-step matrix multiplication, the result is a ciphertext, where each of the two inner vectors encodes the result of a (1 × n/2) × (n/2 × n/2) vector-matrix multiplication. The sum of those two vectors can easily be obtained by rotating the columns of the ciphertext and adding it to the first result:

Thus, we can use one (n/2 × n/2) baby-step giant-step matrix multiplication and equation 11 to implement a homomorphic (1 × n) × (n × n/2) = (1 × n/2) vector-matrix multiplication. Taking this into account, we split the huge (N × k) matrix into n v · n o submatrices of size (n × n/2), with n v = N n and n o = 2k n , padding the submatrices with zeros if necessary. We split the input vector v into n v vectors of size n (padding the last vector with zeros if necessary) and encrypt each of these vectors to get n v ciphertexts c i . The final result of the v T · Z matrix multiplication can be computed with the following equation:

where, SubMat(M, j, i) returns the submatrix of Z with size (n×n/2), starting at row n·j and column n 2 ·i, and MatMul(Z, c) performs the homomorphic baby-step giant-step matrix multiplication Z · c followed by equation 11. Equation 12 produces n o ciphertextsc i , with the final results being located in the first n/2 slots of the ciphertexts. Overall, our algorithm to homomorphically calculate v T · Z requires n v · n o baby-step giant-step matrix multiplications and the total multiplicative depth is 1 plaintext-ciphertext multiplication.

To calculate the masking value (equation 5), or the challenge (equation 6), we need to calculate the inner product of two homomorphically encrypted ciphertexts c and d. After an initial multiplication c · d, the inner product requires log 2 (n/2) rotations and addition, followed by equation 11 to produce a ciphertext, where the result is encoded in each of the n slots.

Our implementation uses rejection sampling and the SHAKE128 algorithm to cryptographically secure sample all the required random values in Z p . The total multiplicative depth to homomorphically evaluate the final mask (equation 5 or equation 6) is 1 ciphertext-ciphertext multiplication and 2 plaintext-ciphertext multiplications.

If we want to boost soundness in the interactive version of the challenge mask, we can encode multiple values of different µ chal into the SIMD slots of one ciphertext, filling the remaining slots with zeros: µ = (µ chal,0 , . . . , µ chal,s−1 , 0, . . . , 0)

We can evaluate equation 13 by multiplying a vector r = (r i , 0, . . . , 0) instead of a scalar r in equation 6 to get µ chal,i and add rotations of those individual masks:

In BFV, one can choose three different parameters which greatly impact the runtime and the available noise budget (i.e. how much further noise can be introduced until decryption will fail):

-Plaintext modulus p: p defines the Ring Z p to which the homomorphic operations correspond to. Every result encoded in the ciphertext vector will be an element of Z p . Therefore, one has to make sure that p is big enough, such that no computation overflows. On the other hand, a big p has a bad impact on the ciphertext noise, where the noise cost of homomorphic operations is higher for bigger p. Additionally, the size of p will also affect the runtime of homomorphic operations. In general, SEAL allows arbitrary plaintext moduli t ≥ 2 ∈ Z; however, if we want to enable SIMD-packing (Section 5.1), then the plaintext modulus has to be a prime p and congruent to 1 (mod 2n). -Ciphertext modulus q: q defines the available noise budget. Therefore, a bigger q allows for a bigger depth in homomorphic operations. However, bigger q's have an adverse effect on the security of the encryption scheme. Additionally, q also influences the runtime of homomorphic operations; more specifically, the number of primes q is composed of. The more primes, the longer the computation times.

-Degree n of the reduction polynomial: In BFV in SEAL n is always a power of two and has a direct impact on the runtime of the scheme. A bigger n drastically increases the time a homomorphic operation needs for evaluation. On the other hand, a bigger n also increases the security of the scheme and, therefore, allows for a bigger ciphertext modulus q to increase the noise budget.

We test our implementation for a security level of λ = 128 bit and λ = 80 bit. We use the LWE estimator [2] by Albrecht et al. to find suitable BFV parameters which provide 80 bit security against known attacks; for 128 bit security SEAL already provides parameters for different reduction polynomial degrees n. See Appendix A for more details on the used parameters.

Multithreading. Since in our use cases N is much bigger than k, we implemented multithreading, such that the threads split the number of rows in the matrix (more specifically, the number of submatrices in the rows n v ) equally amongst all available threads. Therefore, each thread has to perform at most nv #threads · n o MatMul evaluations, which will be combined at the end by summing up the intermediate results. In case we want to add the mask to the result, an extra thread will perform the mask-evaluation in parallel to the matrix multiplication. In the other case, the case of the interactive challenge round, we leverage multithreading, such that we distribute the computation of the masks µ bin equally amongst all available threads, i.e., at most s #threads evaluations of µ bin per thread. In parallel, an additional thread performs the evaluation of µ HW before the results are combined with new random values z and r for each slot of the final mask.

Benchmark Platform. Our implementation 7 is compatible to Linux and Windows; however, we ran our benchmarks on a Linux cluster with two Intel Xeon E5-2699 v4 CPU's (total of 44 cores @ 2.2 GHz, 88 threads) and 512 GB RAM available.

Runtime. The runtime of our protocol is O(n v n o ), i.e., it scales linearly in the number of MatMul evaluations. This can be seen in Table 1 in which we summarize the runtime of the homomorphic matrix multiplication for different matrix dimensions using only one thread. For better comparability, we evaluate the different sizes with the same BFV parameter set. For real-world matrix dimensions, some added runtime has to be expected due to thread synchronization and the accumulation of the intermediate thread results. Real World Matrix Dimension. At the time of writing, approximately 8.9 million people live in Austria 8 and the total number of cell cites is 18389 9 (June 2019). In our benchmarks, we want to capture the Austrian use case and set the matrix dimensions to N = 2 23 and k = 2 15 . In Table 2 we list the runtime for a homomorphic (1 × 2 23 ) × (2 23 × 2 15 ) matrix multiplication, for different BFV parameters, using 88 threads. We also provide the total number of MatMul evaluations and the (maximum) number of evaluations per thread. Additionally, we give performance results for parameters also capable of evaluating the masking value. As Table 2 shows, a matrix multiplication takes approximately 1 hour for a 33 bit plaintext modulus with 128 bit security and 1.5 hour for the bigger 60 bit modulus. The noise budget for n = 8192 and a 60 bit modulus, however, is not sufficient to evaluate the masking value, which has a bigger multiplicative depth than the matrix multiplication. Increasing n leads to a performance drop, more specifically, the evaluation with a 60 bit plaintext modulus takes 4.5 hours.

Reducing the security level to λ = 80 bit allows us to use a smaller n for the evaluation of the mask with a 60 bit prime, and the matrix multiplication with a 33 bit plaintext modulus, splitting the respective runtimes in half. Unfortunately, n can not be reduced for the 33 bit prime with 80 bit security when masking is applied, increasing the runtime of the matrix multiplication compared to the 128 bit security case. This is due to the fact that in the 80 bit security case q is composed of more distinct primes q i . We recommend, therefore, to always use 128 bit security parameters for the 33 bit prime when the masking value has to be evaluated.

Data Transmission. In Table 3 , we list the sizes of all the data, which has to be transmitted between the server and the client. Each row corresponds to a different parameter set from Table 2 . The sizes were obtained by storing each of the described elements on the file system on the cluster. The table lists the size of the ciphertexts (ct), Galois keys (gk), and relinearization keys (rk). Galois keys are required to perform homomorphic rotations, each rotation index requires one Galois key, plus an additional key for rotating the columns. When using the babystep giant-step algorithm, we need a key for the index 1 to calculate rot(v, j), and a key for the indices k · m 1 , ∀0 < k < m 2 . Furthermore, when masking is applied, we need the keys for the power-of-2 indices to calculate the inner product of two ciphertexts. The relinearization key is required to linearize the result of a ciphertext-ciphertext multiplication. Since we only have to perform such a multiplication when we calculate the masking values, we can omit to send the relinearization key when the mask is not applied.

In addition to the values described in Table 3 , the client has to announce the used BFV parameters and the hamming weight of the input vector. These values have a combined size of less than 300 bytes.

As Table 3 shows, client-to-server communication is significantly more extensive than the response of the server. The main parts of the communication are the initial ciphertexts; however, especially when masking has to be applied, the Galois keys have a significant size as well. The plaintext modulus p has little to no effect on the number of bytes, contrary to the reduction polynomial degree n, which influences the communication cost significantly. The response of the server is very small in comparison to the ciphertexts he receives from the client. One reason for that is the small parameter k compared to N . The other reason is, that our implementation performs a so-called modulus-switch to level 0 after the computation, reducing the ciphertext modulus q to only one of the moduli q i it is composed of. Challenge Mask Compuation Time. In Table 4 , we list the runtime of the evaluation of the interactive challenge mask for different BFV parameters with a soundness error of at least λ bits. As Table 4 shows, the evaluation of the challenge mask is significantly faster than performing the matrix multiplication, allowing the server to detect and filter malicious inputs before engaging in the costly matrix evaluation. As we already stated, calculating the masking value for a 33 bit prime is slower with parameters for 80 bit security than with parameters for 128 bit security (given multithreading with enough threads) since q is composed of more primes q i and n stays the same for both parameter sets. We note that we require additional s Galois keys for performing the challenge mask creation, increasing the communicated bytes. However, the additional keys require a comparably small amount of data, e.g., less than 25 MiB for entry nr. 3 in Table 3 .

In this section, we want to give an estimate of the costs of deploying our system to create a corona heatmap for a larger country, more specifically, for Germany.

At the time of writing, about 83.1 million people live in Germany 10 , and a total of 74280 cell sites are deployed 11 . With the BFV parameters of entry Nr. 1 in Table 2 , i.e., n = 8192, this corresponds to a total number of of n v · n o = 192755 MatMul evaluations. To get 96 MatMul evaluations per thread, we would have to acquire 21 CPU's capable of handling 96 threads each. To get a estimate for doing such computations used current market prizes 12 the cost of one CPU capable of handling 96 threads is ∼ 1.5 $ per hour. Taking an additional overhead by handling so many threads and combining intermediate results, we estimate the cost of evaluating the homomorphic matrix multiplication for the German use case using AWS to ∼ 60 $. While noting that a trivial outsourcing of such computations is not part of our proposal, this estimate still shows that it is likely very feasible to create a heatmap once a day to gain valuable insight into the spread of the disease, even for larger countries.

Our protocol shows that concrete privacy-preserving data analytics is possible even on a national scale.

For an interested audience with little security and cryptography background, we created a webpage 13 that describes our approach, and basically has the following message: Even in times of crisis where it is tempting to (temporarily) lower data protection standards for purposes of big data analytics, there are technical methods to keep data protection standards high. And those technical methods are practical and available.

An actual roll-out needs to consider a number of other aspects that are out of scope for this article. This includes legal aspects, e.g. making sure there is a consent in using collected telephone numbers for such a purpose. There are also parameters of our system that need to be chosen in view of a particular dataset, potentially in coordination with data-protection authorities such as fixing the minimum Hamming weight of the query vector c and differential privacy parameters.

In this section we list the BFV parameters used in our implementation.

In our benchmarks, we use two different plaintext moduli, one with a size of 33 bits, the other with a size of 60 bits. Table 5 lists the used moduli. 

In this section we list all the ciphertext moduli used for different security levels λ and reduction polynomial degrees n. In SEAL the ciphertext modulus q is the product several primes q i : q = i q i . n = 4096, λ = 80: The ciphertext modulus q is composed of 3 primes with a total size of 162 bit, which we list in Table 6 . Table 6 : Primes composing the ciphertext modulus for n = 4096, λ = 80 in hexadecimal notation and their size in bits.

i qi log2(qi) 1 0x3ffffffffd6001 54 2 0x3ffffffffd2001 54 3 0x3ffffffffbe001 54 n = 8192, λ = 80: The ciphertext modulus q is composed of 7 primes with a total size of 329 bit, which we list in Table 7 .

n = 8192, λ = 128: The ciphertext modulus q is composed of 5 primes with a total size of 218 bit, which we list in Table 8 . n = 16384, λ = 128: The ciphertext modulus q is composed of 9 primes with a total size of 438 bit, which we list in Table 9 . 

Pandemic contact tracing apps: Dp-3t, pepp-pt ntk, and robert from a privacy perspective

On the concrete hardness of learning with errors

PIR with compressed queries and amortized query processing

Breaking the o(n1/(2k-1)) barrier for information-theoretic private information retrieval

Homomorphic training of 30, 000 logistic regression models

Assessing disease exposure risk with location data: A proposal for cryptographic preservation of privacy

Contra corona: Contact tracing against the coronavirus by bridging the centralized-decentralized divide for stronger privacy

ngraph-he2: A highthroughput framework for neural network inference on encrypted data

Fast homomorphic evaluation of deep discretized neural networks

Decentralized contact tracing using a dht and blind signatures

Fully homomorphic encryption without modulus switching from classical gapsvp

(leveled) fully homomorphic encryption without bootstrapping

Bulletproofs: Short proofs for confidential transactions and more

Computationally private information retrieval with polylogarithmic communication

Anonymous collocation discovery: Harnessing privacy to tame the coronavirus

Pact: Privacy sensitive protocols and mechanisms for mobile contact tracing

Labeled PSI from fully homomorphic encryption with malicious security

Homomorphic encryption for arithmetic of approximate numbers

Faster fully homomorphic encryption: Bootstrapping in less than 0.1 seconds

Private information retrieval

Faster cryptonets: Leveraging sparsity for real-world encrypted inference

Applicability of mobile contact tracing in fighting pandemic (covid-19): Issues, challenges and solutions. Cryptology ePrint Archive

CHET: an optimizing compiler for fully-homomorphic neural-network inferencing

RAID-PIR: practical multi-server PIR

Optimally robust private information retrieval

A fast single server private information retrieval protocol with low communication cost

Differential privacy. In: ICALP (2)

Somewhat practical fully homomorphic encryption. IACR Cryptology ePrint Archive

Fully homomorphic encryption using ideal lattices

Improving the robustness of private information retrieval

Apple and google's exposure notification system

Algorithms in helib

Bootstrapping for helib

Faster homomorphic linear transformations in helib

Deep neural networks classification over encrypted data

Secure outsourced matrix computation and application to neural networks

GAZELLE: A low latency framework for secure neural network inference

Optimal rate private information retrieval from homomorphic encryption

Looking beyond contact tracing to stop the spread

First CPIR protocol with data-dependent computation

A simpler rate-optimal CPIR protocol

On ideal lattices and learning with errors over rings

XPIR : Private information retrieval for everyone

Public-key cryptosystems based on composite degree residuosity classes

Hashomer -a proposal for a privacy-preserving bluetooth based contact tracing scheme for hamagen

On lattices, learning with errors, random linear codes, and cryptography

On data banks and privacy homomorphisms. Foundations of Secure Computation

A method for obtaining digital signatures and public-key cryptosystems

Decentralized privacy-preserving proximity tracing

Epione: Lightweight contact tracing with strong privacy

Single-database private information retrieval from fully homomorphic encryption

This work was supported by EU's Horizon 2020 project Safe-DEED, grant agreement n • 825225, and EU's Horizon 2020 project TRUSTS, grant agreement n • 871481, and by the "DDAI" COMET Module within the COMET -Competence Centers for Excellent Technologies Programme, funded by the Austrian Federal Ministry for Transport, Innovation and Technology (bmvit), the Austrian Federal Ministry for Digital and Economic Affairs (bmdw), the Austrian Research Promotion Agency (FFG), the province of Styria (SFG) and partners from industry and academia. The COMET Programme is managed by FFG.