key: cord-143847-vtwn5mmd
authors: Ryffel, Th'eo; Pointcheval, David; Bach, Francis
title: ARIANN: Low-Interaction Privacy-Preserving Deep Learning via Function Secret Sharing
date: 2020-06-08
journal: nan
DOI: nan
sha: 
doc_id: 143847
cord_uid: vtwn5mmd

We propose ARIANN, a low-interaction framework to perform private training and inference of standard deep neural networks on sensitive data. This framework implements semi-honest 2-party computation and leverages function secret sharing, a recent cryptographic protocol that only uses lightweight primitives to achieve an efficient online phase with a single message of the size of the inputs, for operations like comparison and multiplication which are building blocks of neural networks. Built on top of PyTorch, it offers a wide range of functions including ReLU, MaxPool and BatchNorm, and allows to use models like AlexNet or ResNet18. We report experimental results for inference and training over distant servers. Last, we propose an extension to support n-party private federated learning.

The massive improvements of cryptography techniques for secure computation over sensitive data [15, 13, 28] have spurred the development of the field of privacy-preserving machine learning [45, 1] . Privacy-preserving techniques have become practical for concrete use cases, thus encouraging public authorities to use them to protect citizens' data, for example in covid-19 apps [27, 17, 38, 39] .

However, tools are lacking to provide end-to-end solutions for institutions that have little expertise in cryptography while facing critical data privacy challenges. A striking example is hospitals which handle large amounts of data while having relatively constrained technical teams. Secure multiparty computation (SMPC) is a promising technique that can efficiently be integrated into machine learning workflows to ensure data and model privacy, while allowing multiple parties or institutions to participate in a joint project. In particular, SMPC provides intrinsic shared governance: because data are shared, none of the parties can decide alone to reconstruct it. This is particularly suited for collaborations between institutions willing to share ownership on a trained model.

Use case. The main use case driving our work is the collaboration between healthcare institutions such as hospitals or clinical research laboratories. Such collaboration involves a model owner and possibly several data owners like hospitals. As the model can be a sensitive asset (in terms of intellectual property, strategic asset or regulatory and privacy issues), standard federated learning [29, 7] that does not protect against model theft or model retro-engineering [24, 18] is not suitable.

to data centers, but are likely to remain online for long periods of time. Last, parties are honestbut-curious, [20, Chapter 7.2.2] and care about their reputation. Hence, they have little incentive to deviate from the original protocol, but they will use any information available in their own interest.

Contributions. By leveraging function secret sharing (FSS) [9, 10] , we propose the first lowinteraction framework for private deep learning which drastically reduces communication to a single round for basic machine learning operations, and achieves the first private evaluation benchmark on ResNet18.

• We build on existing work on function secret sharing to design compact and efficient algorithms for comparison and multiplication, which are building blocks of neural networks. They are highly modular and can be assembled to build complex workflows. • We show how these blocks can be used in machine learning to implement operations for secure evaluation and training of arbitrary models on private data, including MaxPool and BatchNorm. We achieve single round communication for comparison, convolutional or linear layers. • Last, we provide an implementation 1 and demonstrate its practicality both in LAN (local area network) and WAN settings by running secure training and inference on CIFAR-10 and Tiny Imagenet with models such as AlexNet [31] and ResNet18 [22] .

Related work. Related work in privacy-preserving machine learning encompasses SMPC and homomorphic encryption (HE) techniques.

HE only needs a single round of interaction but does not support efficiently non-linearities. For example, nGraph-HE [5] and its extensions [4] build on the SEAL library [44] and provide a framework for secure evaluation that greatly improves on the CryptoNet seminal work [19] , but it resorts to polynomials (like the square) for activation functions.

SMPC frameworks usually provide faster implementations using lightweight cryptography. MiniONN and DeepSecure [34, 41] use optimized garbled circuits [50] that allow very few communication rounds, but they do not support training and alter the neural network structure to speed up execution.

Other frameworks such as ShareMind [6] , SecureML [36] , SecureNN [47] or more recently FALCON [48] rely on additive secret sharing and allow secure model evaluation and training. They use simpler and more efficient primitives, but require a large number of rounds of communication, such as 11 in [47] or 5 + log 2 (l) in [48] (typically 10 with l = 32) for ReLU. ABY [16] , Chameleon [40] and more recently ABY 3 [35] mix garbled circuits, additive or binary secret sharing based on what is most efficient for the operations considered. However, conversion between those can be expensive and they do not support training except ABY 3 . Last, works like Gazelle [26] combine HE and SMPC to make the most of both, but conversion can also be costly.

Works on trusted execution environment are left out of the scope of this article as they require access to dedicated hardware [25] . Data owners which cannot afford these secure enclaves might be reluctant to use a cloud service and to send their data.

Notations. All values are encoded on n bits and live in Z 2 n . Note that for a perfect comparison, y + α should not wrap around and become negative. Because y is in practice small compared to the n-bit encoding amplitude, the failure rate is less than one comparison in a million, as detailed in Appendix C.1.

Security model. We consider security against honest-but-curious adversaries, i.e., parties following the protocol but trying to infer as much information as possible about others' input or function share. This is a standard security model in many SMPC frameworks [6, 3, 40, 47] and is aligned with our main use case: parties that would not follow the protocol would face major backlash for their reputation if they got caught. The security of our protocols relies on indistinguishability of the function shares, which informally means that the shares received by each party are computationally indistinguishable from random strings. A formal definition of the security is given in [10] .

About malicious adversaries, i.e., parties who would not follow the protocol, as all the data available are random, they cannot get any information about the inputs of the other parties, including the parameters of the evaluated functions, unless the parties reconstruct some shared values. The later and the fewer values are reconstructed, the better it is. As mentioned by [11] , our protocols could be extended to guarantee security with abort against malicious adversaries using MAC authentication [15] , which means that the protocol would abort if parties deviated from it.

Our algorithms for private equality and comparison are built on top of the work of [10] , so the security assumptions are the same as in this article. However, our protocols achieve higher efficiency by specializing on the operations needed for neural network evaluation or training.

We start by describing private equality which is slightly simpler and gives useful hints about how comparison works. The equality test consists in comparing a public input x to a private value α.

Evaluating the input using the function keys can be viewed as walking a binary tree of depth n, where n is the number of bits of the input (typically 32). Among all the possible paths, the path from the root down to α is called the special path. Figure 1 illustrates this tree and provides a compact representation which is used by our protocol, where we do not detail branches for which all leaves are 0. Evaluation goes as follows: two evaluators are each given a function key which includes a distinct initial random state (s, t) ∈ {0, 1} λ × {0, 1}. Each evaluator starts from the root, at each step i goes down one node in the tree and updates his state depending on the bit x[i] using a common correction word CW (i) ∈ {0, 1} 2(λ+1) from the function key. At the end of the computation, each evaluation outputs t. As long as x[i] = α[i], the evaluators stay on the special path and because the input x is public and common, they both follow the same path. If a bit x[i] = α[i] is met, they leave the special path and should output 0 ; else, they stay on it all the way down, which means that x = α and they should output 1.

The main idea is that while they are on the special path, evaluators should have states (s 0 , t 0 ) and (s 1 , t 1 ) respectively, such that s 0 and s 1 are i.i.d. and t 0 ⊕ t 1 = 1. When they leave it, the correction word should act to have s 0 = s 1 but still indistinguishable from random and t 0 = t 1 , which ensures t 0 ⊕ t 1 = 0. Each evaluator should output its t j and the result will be given by t 0 ⊕ t 1 . The formal description of the protocol is given below and is composed of two parts: first, in Algorithm 1, the KeyGen algorithm consists of a preprocessing step to generate the functions keys, and then, in Algorithm 2, Eval is run by two evaluators to perform the equality test. It takes as input the private share held by each evaluator and the function key that they have received. They use G : {0, 1} λ → {0, 1} 2(λ+1) , a pseudorandom generator, where the output set is {0, 1} λ+1 ×{0, 1} λ+1 , and operations modulo 2 n implicitly convert back and forth n-bit strings into integers.

Intuitively, the correction words CW (i) are built from the expected state of each evaluator on the special path, i.e., the state that each should have at each node i if it is on the special path given some initial state. During evaluation, a correction word is applied by an evaluator only when it has t = 1. Hence, on the special path, the correction is applied only by one evaluator at each bit. 

Algorithm 1: KeyGen: key generation for equality to α If at step i, the evaluator stays on the special path, the correction word compensates the current states of both evaluators by xor-ing them with themselves and re-introduces a pseudorandom value s (either s R 0 ⊕ s R 1 or s L 0 ⊕ s L 1 ), which means the xor of their states is now (s, 1) but those states are still indistinguishable from random. On the other hand, if x[i] = α[i], the new state takes the other half of the correction word, so that the xor of the two evaluators states is (0, 0). From there, they have the same states and both have either t = 0 or t = 1. They will continue to apply the same corrections at each step and their states will remain the same with t 0 ⊕ t 1 = 0. A final computation is performed to obtain shared [[T ]] modulo 2 n of the result bit t = t 0 ⊕ t 1 ∈ {0, 1} shared modulo 2.

From the privacy point of view, when the seed s is (truly) random, G(s) also looks like a random bit-string (this is a pseudorandom bit-string). Each half is used either in the cw or in the next state, but not both. Therefore, the correction words CW (i) do not contain information about the expected states and for j = 0, 1, the output k j is independently uniformly distributed with respect to α and 1−j , in a computational way. As a consequence, at the end of the evaluation, for j = 0, 1, T j also follows a distribution independent of α. Until the shared values are reconstructed, even a malicious adversary cannot learn anything about α nor the inputs of the other player.

Function keys should be sent to the evaluators in advance, which requires one extra communication of the size of the keys. We use the trick of [10] to reduce the size of each correction word in the keys, from 2(1 + λ) to (2 + λ) by reusing the pseudo-random λ-bit string dedicated to the state used when leaving the special path for the state used for staying onto it, since for the latter state the only constraint is the pseudo-randomness of the bitstring.

Our major contribution to the function secret sharing scheme is regarding comparison (which allows to tackle non-polynomial activation functions for neural networks): we build on the idea of the equality test to provide a synthetic and efficient protocol whose structure is very close from the previous one. Instead of seeing the special path as a simple path, it can be seen as a frontier for the zone in the tree where x ≤ α. To evaluate x ≤ α, we could evaluate all the paths on the left of the special path and then sum up the results, but this is highly inefficient as it requires exponentially many evaluations.

Our key idea here is to evaluate all these paths at the same time, noting that each time one leaves the special path, it either falls on the left side (i.e., x < α) or on the right side (i.e., x > α). Hence, we only need to add an extra step at each node of the evaluation, where depending on the bit value x[i], we output a leaf label which is 1 only if x[i] < α[i] and all previous bits are identical. Only one label between the final label (which corresponds to x = α) and the leaf labels can be equal to one, because only a single path can be taken. Therefore, evaluators will return the sum of all the labels to get the final output.

The full description of the comparison protocol is detailed in Appendix A, together with a detailed explanation of how it works.

We now apply these primitives to a private deep learning setup in which a model owner interacts with a data owner. The data and the model parameters are sensitive and are secret shared to be kept private. The shape of the input and the architecture of the model are however public, which is a standard assumption in secure deep learning [34, 36] .

All our operations are modular and follow this additive sharing workflow: inputs are provided secret shared and are masked with random values before being revealed. This disclosed value is then consumed with preprocessed function keys to produce a secret shared output. Each operation is independent of all surrounding operations, which is known as circuit-independent preprocessing [11] and implies that key generation can be fully outsourced without having to know the model architecture. This results in a fast runtime execution with a very efficient online communication, with a single round of communication and a message size equal to the input size for comparison. Preprocessing is performed by a trusted third party to build the function keys. This is a valid assumption in our use case as such third party would typically be an institution concerned about its image, and it is very easy to check that preprocessed material is correct using a cut-and-choose technique [51] .

Matrix Multiplication (MatMul). As mentioned by [11] , multiplication fit in this additive sharing workflow. We use Beaver triples [2] ]. Matrix multiplication is identical but uses matrix Beaver triples [14] .

ReLU activation function is supported as a direct application of our comparison protocol, which we combine with a point wise multiplication.

Convolution can be computed as a single matrix multiplication using an unrolling technique as described in [12] and illustrated in Figure 3 in Appendix C.2.

Argmax operator used in classification to determine the predicted label can also be computed in a constant number of rounds using pairwise comparisons as shown by [21] . The main idea here is, given a vector (x 0 , . . . , x m−1 ), to compute the matrix M ∈ R m−1×m where each row M i = (x i+1 mod m , . . . , x i+m+1 mod m ). Then, each element of column j is compared to x j , which requires m(m − 1) parallel comparisons. A column j where all elements are lower than x j indicates that j is a valid result for the argmax.

MaxPool can be implemented by combining these two methods: the matrix is first unrolled like in Figure 3 and the maximum of each row in then computed using parallel pairwise comparisons. More details and an optimization when the kernel size k equals 2 are given in Appendix C.3.

BatchNorm is implemented using a approximate division with Newton's method as in [48] : given an input x = (x 0 , . . . , x m−1 ) with mean µ and variance σ 2 , we return γ ·θ · ( x − µ) + β. Variables γ and β are learnable parameters andθ is the estimate inverse of √ σ 2 + with 1 and is computed iteratively using: θ i+1 = θ i · (3 − (σ 2 + ) · θ 2 i )/2. More details can be found in Appendix C.4. More generally, for more complex activation functions such as softmax, we can use polynomial approximations methods, which achieve acceptable accuracy despite involving a higher number of rounds [37, 23, 21] . Table 1 summarizes the online communication cost of each operation, and shows that basic operations such as comparison have a very efficient online communication. We also report results from [48] which achieve good experimental performance.

These operations are sufficient to evaluate real world models in a fully private way. To also support private training of these models, we need to perform a private backward pass. As we overload operations such as convolutions or activation functions, we cannot use the built-in autograd functionality of PyTorch. Therefore, we have developed a custom autograd functionality, where we specify how to compute the derivatives of the operations that we have overloaded. Backpropagation also uses the same basic blocks than those used in the forward pass.

This 2-party protocol between a model owner and a data owner can be extended to an n-party federated learning protocol where several clients contribute their data to a model owned by an orchestrator server. This approach is inspired by secure aggregation [8] but we do not consider here clients being phones which means we are less concerned with parties dropping before the end of the protocol. In addition, we do not reveal the updated model at each aggregation or at any stage, hence providing better privacy than secure aggregation.

At the beginning of the interaction, the server and model owner initializes its model and builds n pairs of additive shares of the model parameters. For each pair i, it keeps one of the shares and sends the other one to the corresponding client i. Then, the server runs in parallel the training procedure with all the clients until the aggregation phase starts. Aggregation for the server shares is straightforward, as the n shares it holds can be simply locally averaged. But the clients have to average their shares together to get a client share of the aggregated model. One possibility is that clients broadcast their shares and compute the average locally. However, to prevent a client colluding with the server from reconstructing the model contributed by a given client, they hide their shares using masking. This can be done using correlated random masks: client i generates a seed, sends it to client i + 1 while receiving one from client i − 1. Client i then generates a random mask M i using its seed and another M i−1 using the one of client i − 1 and publishes its share masked with M i − M i−1 . As the masks cancel each other out, the computation will be correct.

We follow a setup very close to [48] and assess inference and training performance of several networks on the datasets MNIST [33] , CIFAR-10 [30] , 64×64 Tiny Imagenet and 224×224 Tiny ImageNet [49, 42] , presented in Appendix D.1. More precisely, we assess 5 networks as in [48] : a fully-connected network (Network-1), a small convolutional network with maxpool (Network-2), LeNet [32] , AlexNet [31] and VGG16 [46] . Furthermore, we also include ResNet18 [22] which to the best of our knowledge has never been studied before in private deep learning. The description of these networks is taken verbatim from [48] and is available in Appendix D.2.

Our implementation is written in Python. To use our protocols that only work in finite groups like Z 2 32 , we convert our input values and model parameters to fixed precision. To do so, we rely on the PySyft library [43] protocol. However, our inference runtimes reported in Table 2 compare favourably with existing work including [34-36, 47, 48] , in the LAN setting and particularly in the WAN setting thanks to our reduced number of communication rounds. For example, our implementation of Network-1 is 2× faster than the best previous result by [35] in the LAN setting and 18× faster in the WAN setting compared to [48] . For bigger networks such as AlexNet on CIFAR-10, we are still 13× faster in the WAN setting than [48] . Results are given for a batched evaluation, which allows parallelism and hence faster execution as in [48] . For larger networks, we reduce the batch size to have the preprocessing material (including the function keys) fitting into RAM.

Test accuracy. Thanks to the flexibility of our framework, we can train each of these networks in plain text and need only one line of code to turn them into private networks, where all parameters are secret shared. We compare these private networks to their plaintext counterparts and observe that the accuracy is well preserved as shown in Table 3 . If we degrade the encoding precision, which by default considers values in Z 2 32 , and the fixed precision which is by default of 3 decimals, performance degrades as shown in Appendix B.

Training. We can either train from scratch those networks or fine tune pre-trained models. Training is an end-to-end private procedure, which means the loss and the gradients are never accessible in plain text. We use stochastic gradient descent (SGD) which is a simple but popular optimizer, and support both hinge loss and mean square error (MSE) loss, as other losses like cross entropy which is used in clear text by [48] cannot be computed over secret shared data without approximations. We report runtime and accuracy obtained by training from scratch the smaller networks in Table 4 . Note that because of the number of epochs, the optimizer and the loss chosen, accuracy does not match best known results. However, the training procedure is not altered and the trained model will be strictly equivalent to its plaintext counterpart. Training cannot complete in reasonable time for larger networks, which are anyway available pre-trained. Note that training time includes the time spent building the preprocessing material, as it cannot be fully processed in advance and stored in RAM.

Discussion. For larger networks, we could not use batches of size 128. This is mainly due to the size of the comparison function keys which is currently proportional to the size of the input tensor, with a multiplication factor of nλ where n = 32 and λ = 128. Optimizing the function secret sharing protocol to reduce those keys would lead to massive improvements in the protocol's efficiency.

Our implementation actually has more communication than is theoretically necessary according to Table 1 , suggesting that the experimental results could be further improved. As we build on top of PyTorch, using machines with GPUs could also potentially result in a massive speed-up, as an important fraction of the execution time is dedicated to computation.

Last, accuracies presented in Table 3 and Table 4 do not match state-of-the-art performance for the models and datasets considered. This is not due to internal defaults of our protocol but to the simplified training procedure we had to use. Supporting losses such as the logistic loss, more complex optimizers like Adam and dropout layers would be an interesting follow-up.

One can observe the great similarity of structure of the comparison protocol given in Algorithm 3 and 4 with the equality protocol from Algorithm 1 and 2: the equality test is performed in parallel with an additional information out i at each node, which holds a share of either 0 when the evaluator stays on the special path or if it has already left it at a previous node, or a share of α[i] when it leaves the special path. This means that if α[i] = 1, leaving the special path implies that x[i] = 0 and hence x ≤ α, while if α[i] = 0, leaving implies x[i] = 1 so x > α and the output should be 0. The final share out n+1 corresponds the previous equality test.

Note that in all these computations modulo 2 n , while the bitstrings s 

j · CW (i) ) = ((state j,0 , state j,1 ), (state j,0 , state j,1 )) 9 Parse s 

We have studied the impact of lowering the encoding space of the input to our function secret sharing protocol from Z 2 32 to Z 2 k with k < 32. Finding the lowest k guaranteeing good performance is an interesting challenge as the function keys size is directly proportional to it. This has to be done together with reducing fixed precision from 3 decimals down to 1 decimal to ensure private values aren't too big, which would result in higher failure rate in our private comparison protocol. We have reported in Table 5 our findings on Network-1, which is pre-trained and then evaluated in a private fashion. Table 5 : Accuracy (in %) of Network-1 given different precision and encoding spaces What we observe is that 3 decimals of precision is the most appropriate setting to have an optimal precision while allowing to slightly reduce the encoding space down to Z 2 24 or Z 2 28 . Because this is not a massive gain and in order to keep the failure rate in comparison very low, we have kept Z 2 32 for all our experiments.

C Implementation details

Our comparison protocol can fail if y + α wraps around and becomes negative. We can't act on α because it must be completely random to act as a perfect mask and to make sure the revealed x = y + α mod 2 n does not leak any information about y, but the smaller y is, the lower the error probability will be. [11] suggests a method which uses 2 invocations of the protocol to guarantee perfect correctness but because it incurs an important runtime overhead, we rather show that the failure rate of our comparison protocol is very small and is reasonable in contexts that tolerate a few mistakes, as in machine learning. More precisely, we quantify it on real world examples, namely on Network-2 and on the 64×64 Tiny Imagenet version of VGG16, with a fixed precision of 3 decimals, and find respective failure rates of 1 in 4 millions comparisons and 1 in 100 millions comparisons. Such error rates do not affect the model accuracy, as Table 3 shows. Figure 4 illustrates how MaxPool uses ideas from matrix unrolling and argmax computation. Notations present in the figure are consistent with the explanation of argmax using pairwise comparison in Section 4.3. The m × m matrix is first unrolled to a m 2 × k 2 matrix. It is then expanded on k 2 layers, each of which each shifted by a step of 1. Next, m 2 k 2 (k 2 − 1) pairwise comparisons are then applied simultaneously between the first layer and the other ones, and for each x i we sum the result of its k − 1 comparison and check if it equals k − 1. We multiply this boolean by x i and sum up along a line (like x 1 to x 4 in the figure) . Last, we restructure the matrix back to its initial structure. In addition, when the kernel size k is 2, rows are only of length 4 and it can be more efficient to use a binary tree approach instead, i.e. compute the maximum of columns 0 and 1, 2 and 3 and the max of the result: it requires log 2 (k 2 ) = 2 rounds of communication and only approximately (k 2 − 1)(m/s) 2 comparisons, compared to a fixed 3 rounds and approximately k 4 (m/s) 2 .

Interestingly, average pooling can be computed locally on the shares without interaction because it only includes mean operations, but we didn't replace MaxPool operations with average pooling to avoid distorting existing neural networks architecture.

The BatchNorm layer is the only one in our implementation which is a polynomial approximation. Moreover, compared to [48] , the approximation is significantly coarser as we don't make any costly initial approximation and we reduce the number of iterations of the Newton method from 4 to only 3. Typical relative error can be up to 20% but as the primary purpose of BatchNorm is to normalise data, having rough approximations here is not an issue and doesn't affect learning capabilities, as our experiments show. However, it is a limitation for using pre-trained networks: we observed on AlexNet adapted to CIFAR-10 that training the model with a standard BatchNorm and evaluating it with our approximation resulted in poor results, so we had to train it with the approximated layer.

This section is taken almost verbatim from [48] .

We select 4 datasets popularly used for training image classification models: MNIST [33] , CIFAR-10 [30] , 64×64 Tiny Imagenet and 224×224 Tiny ImageNet [49] .

MNIST MNIST [33] is a collection of handwritten digits dataset. It consists of 60,000 images in the training set and 10,000 in the test set. Each image is a 28×28 pixel image of a handwritten digit along wit a label between 0 and 9. We evaluate Network-1, Network-2, and the LeNet network on this dataset.

CIFAR-10 CIFAR-10 [30] consists of 50,000 images in the training set and 10,000 in the test set. It is composed of 10 different classes (such as airplanes, dogs, horses etc.) and there are 6,000 images of each class with each image consisting of a colored 32×32 image. We perform private training of AlexNet and inference of VGG16 on this dataset.

Tiny ImageNet Tiny ImageNet [49] consists of two datasets of 100,000 training samples and 10,000 test samples with 200 different classes. The first dataset is composed of colored 64×64 images and we use it with AlexNet and VGG16. The second is composed of colored 224×224 images and is used with ResNet18.

We have selected 6 models for our experimentations.

Network-1 A 3-layered fully-connected network with ReLU used in SecureML [36] .

Network-2 A 4-layered network selected in MiniONN [34] with 2 convolutional and 2 fullyconnected layers, which uses MaxPool in addition to ReLU activation.

LeNet This network, first proposed by LeCun et al. [32] , was used in automated detection of zip codes and digit recognition. The network contains 2 convolutional layers and 2 fully connected layers.

AlexNet AlexNet is the famous winner of the 2012 ImageNet ILSVRC-2012 competition [31] . It has 5 convolutional layers and 3 fully connected layers and it can batch normalization layer for stability and efficient training.

VGG16 VGG16 is the runner-up of the ILSVRC-2014 competition [46] . VGG16 has 16 layers and has about 138M parameters.

ResNet18 ResNet18 [22] is the runner-up of the ILSVRC-2015 competition. It is a convolutional neural network that is 18 layers deep, and has 11.7M parameters. It uses batch normalisation and we're the first private deep learning framework to evaluate this network.

Model architectures of Network-1 and Network-2, together with LeNet, and the adaptations for CIFAR-10 of AlexNet and VGG16 are precisely depicted in Appendix D of [48] . Note that in the CIFAR-10 version AlexNet, authors have used the version with BatchNorm layers, and we have kept this choice. For the 64×64 Tiny Imagenet version of AlexNet, we used the standard architecture from PyTorch to have a pretrained network. It doesn't have BatchNorm layers, and we have adapted the classifier part as illustrated in Figure 5 . Note also that we permute ReLU and Maxpool where applicable like in [48] , as this is strictly equivalent in terms of output for the network and reduces the number of comparisons. More generally, we don't proceed to any alteration of the network behaviour except with the approximation on BatchNorm. This improves usability of our framework as it allows to take a pre-trained neural network from a standard deep learning library like PyTorch and to encrypt it generically with a single line of code. 

Privacy-preserving machine learning: Threats and solutions

Efficient multiparty protocols using circuit randomization

Optimizing semi-honest secure multiparty computation for the internet

nGraph-HE2: A high-throughput framework for neural network inference on encrypted data

nGraph-HE: a graph compiler for deep learning on homomorphically encrypted data

Sharemind: A framework for fast privacypreserving computations

Towards federated learning at scale: System design

Practical secure aggregation for privacy-preserving machine learning

Function secret sharing

Function secret sharing: Improvements and extensions

Secure computation with preprocessing via function secret sharing

High performance convolutional neural networks for document processing

Faster fully homomorphic encryption: Bootstrapping in less than 0.1 seconds

Private Image Analysis with MPC. Accessed 2019-11-01

Multiparty computation from somewhat homomorphic encryption

Aby-a framework for efficient mixed-protocol secure two-party computation

A survey of secure multiparty computation protocols for privacy preserving genetic tests

Model inversion attacks that exploit confidence information and basic countermeasures

Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy

Foundations of Cryptography

Deep residual learning for image recognition

Accuracy and stability of numerical algorithms

Deep models under the gan: information leakage from collaborative deep learning

Chiron: Privacy-preserving machine learning as a service

{GAZELLE}: A low latency framework for secure neural network inference

An efficient multi-party scheme for privacy preserving collaborative filtering for healthcare recommender system

Overdrive: making spdz great again

Federated learning: Strategies for improving communication efficiency

The CIFAR-10 dataset

Imagenet classification with deep convolutional neural networks

Gradient-based learning applied to document recognition

MNIST handwritten digit database

Oblivious neural network predictions via minionn transformations

Aby3: A mixed protocol framework for machine learning

Secureml: A system for scalable privacy-preserving machine learning

An improved newton iteration for the generalized inverse of a matrix, with applications

Information technology-based tracing strategy in response to COVID-19 in South Korea-privacy controversies

Privacy-preserving contact tracing of covid-19 patients

Chameleon: A hybrid secure computation framework for machine learning applications

Deepsecure: Scalable provably-secure deep learning

Imagenet large scale visual recognition challenge

A generic framework for privacy preserving deep learning

Privacy-preserving deep learning

Very deep convolutional networks for large-scale image recognition

Securenn: Efficient and private neural network training

Falcon: Honest-majority maliciously secure framework for private deep learning

Tiny imagenet challenge

How to generate and exchange secrets

The cut-and-choose game and its application to cryptographic protocols

We would like to thank Geoffroy Couteau, Chloé Hébant and Loïc Estève for helpful discussions throughout this project. We are also grateful for the long-standing support of the OpenMined community and in particular its dedicated cryptography team, including Yugandhar Tripathi, S P Sharan, George-Cristian Muraru, Muhammed Abogazia, Alan Aboudib, Ayoub Benaissa, Sukhad Joshi and many others.This work was supported in part by the European Community's Seventh Framework Programme (FP7/2007-2013 Grant Agreement no. 339563 -CryptoCloud) and by the French project FUI ANBLIC. The computing power was graciously provided by the French company ARKHN.