key: cord-0144203-cpokxqbv authors: Xuanyuan, Han; Vargas, Francisco; Cummins, Stephen title: Efficient Representations for Privacy-Preserving Inference date: 2021-10-15 journal: nan DOI: nan sha: 45bc5d9ee3bb76202785d5879d8c8de47b8543d1 doc_id: 144203 cord_uid: cpokxqbv Deep neural networks have a wide range of applications across multiple domains such as computer vision and medicine. In many cases, the input of a model at inference time can consist of sensitive user data, which raises questions concerning the levels of privacy and trust guaranteed by such services. Much existing work has leveraged homomorphic encryption (HE) schemes that enable computation on encrypted data to achieve private inference for multi-layer perceptrons and CNNs. An early work along this direction was CryptoNets, which takes 250 seconds for one MNIST inference. The main limitation of such approaches is that of compute, which is due to the costly nature of the NTT (number theoretic transform)operations that constitute HE operations. Others have proposed the use of model pruning and efficient data representations to reduce the number of HE operations required. In this paper, we focus on improving upon existing work by proposing changes to the representations of intermediate tensors during CNN inference. We construct and evaluate private CNNs on the MNIST and CIFAR-10 datasets, and achieve over a two-fold reduction in the number of operations used for inferences of the CryptoNets architecture. In recent years, deep neural networks have achieved state-of-the-art accuracy for tasks such as image recognition. They have been deployed in a range of sectors, powering a wide variety of applications such as recommendation systems, medical diagnosis, and content filtering. Machine Learning as a Service (MLaaS) is a framework in which cloud services apply machine learning algorithms on usersupplied data to produce an inference result which is then returned to the user. Cloud systems are an attractive platform for deploying pretrained models due to the relatively low cost and the availability of remote servers. However, data has to be decrypted before inference, which allows a server-side adversary to have access to the user's information. Homomorphic encryption (HE), can be applied to enable inference to be performed on encrypted data, enabling the result to be delivered to the user without risk of the server accessing the original data or the inference result. CRYPTONETS (Gilad-Bachrach et al., 2016) was the first application of HE to secure neural network inference, and leveraged the YASHE' scheme to perform MNIST classifications. CRYPTONETS suffers from a high number of homomorphic operations (HOPs), with a single MNIST inference requiring ∼290, 000 homomorphic multiplications and ∼250 seconds of inference latency. Subsequent works such as FASTER CRYPTONETS (Chou et al., 2018) used neural network surgery and a faster encryption scheme to reduce the inference latency of CRYPTONETS. Later works utilised ciphertext rotations as opposed to the SIMD packing scheme, enabling convolutional and fully-connected layers to be computed using much fewer HOPs (Juvekar et al., 2018; Mishra et al., 2020) . This has been shown to reduce inference latency of MNIST models by more than an order of magnitude, bringing confidence that private inference can be practical. LOLA (Brutzkus et al., 2019) proposed novel representations for intermediate tensors and their MNIST model requires only 2.2 seconds for one inference. One drawback of their representations is poorly scalability to harder datasets such as CIFAR-10 due to the limited number of slots per ciphertext acting as a barrier to the size of tensors that are practical. The limitated set of operations supported by HE schemes prevents the secure computation of nonpolynomial activation functions which impedes model training due to the problem of exploding gradients (Chou et al., 2018) . To address this, others have proposed the use of secure multi-party computation to enable secure computation of non-polynomial activations using multiple parties (Juvekar et al., 2018; Mishra et al., 2020) . Despite enabling use of popular non-polynomial activations such as ReLU, relying on multi-party computation incurs large amounts of data transfer between parties and the requirement for the parties involved to be online and have feasibly fast data transfer rates. For example, GAZELLE (Juvekar et al., 2018) requires ∼1 GB of data transfer per inference for their CIFAR-10 model, and DELPHI (Mishra et al., 2020) requires ∼2 GB per inference with a ResNet-32 model. Single-party approaches often choose to approximate the ReLU activation using a second-degree polynomial (Gilad-Bachrach et al., 2016; Chou et al., 2018; Brutzkus et al., 2019) . In this work, we introduce a framework for secure inference on secure CNNs, designed to reduce the number of HOPs required per inference whilst preserving prediction accuracy. Our contributions can be summarised as follows: • We integrate the convolution-packing method from LOLA with the fast matrix-vector product method introduced by Halevi and Shoup (Halevi & Shoup, 2019) and utilised by (Juvekar et al., 2018) in their multi-party computation framework. Intermediate convolutions are converted into fully-connected layers and computed as matrix-vector products. We show that utilising the Halevi-Shoup method allows the use of rotations and ciphertext packing to scale better compared with the representations in LOLA, when applied to larger convolutional layers. • We compare our framework against LOLA by constructing models for MNIST and CIFAR-10. Our main evaluation criteria is the number of HOPs required per inference for a model. With the same layer parameters as LOLA, we are able to obtain over a two-fold reduction in the number of HOPs per inference. Our CIFAR-10 model achieves similar accuracy to that of LOLA's but uses far fewer operations. Our threat model concerns that of the machine learning as a service paradigm (MLaaS), in which the user first sends data to a server, which then performs machine learning inference on the received data using some model. The inference result is then delivered back to the user. For example, consider an online machine learning service which claims to detect the probability of a person having COVID-19 from an audio recording of their cough. Suppose that Alice decides to send a recording of her cough to this service, in the hopes of receiving a diagnosis. There are two key threats in this scenario: (i) the risk of an adversary eavesdropping on the data transmission, and (ii) the risk of the MLaaS provider performing unauthorised access on the user's data -in this case, the recording produced by Alice. The first threat can be mitigated using standard cryptographic protocols. However, the second risk is harder to address, especially if the user data is decrypted before inference Bae et al. (2018) . The use of HE mitigates both risks. The data is encrypted using HE, which is sufficient to prevent an adversary from eavesdropping. In addition, the provider is only able to perform computations on the encrypted data and will output the inference result without being able to decrypt. Several recent HE schemes such as BFV (Brakerski & Vaikuntanathan, 2011) and CKKS (Cheon et al., 2017) are based on the RLWE problem and support SIMD ciphertext operations. On a high level, such schemes establish a mapping between real vectors and a plaintext space. The plaintext space is usually the polynomial ring R = Z[X]/(X N + 1). In particular, this is a cyclotomic is the M -th cyclotomic polynomial and M = 2N is a power of two. The decoding operation maps an element in R to a vector that is either real or complex, depending on the scheme used. The encoding operation performs the reverse. Plaintext polynomials are encrypted into ciphertext polynomials using a public key. The operations of addition and multiplication can be performed over ciphertexts using an evaluation key. Since each ciphertext corresponds to a vector of real (or complex) values, a single homomorphic operation between two ciphertexts constitute an element-wise operation between two vectors. In addition, such schemes support rotations of the slots within a ciphertext, with use of Galois automorphisms. The first convolutional layer in a CNN can be represented using convolution-packing (Brutzkus et al., 2019) . The convolution of an input image f with a filter g of width w, height h and depth d is Observe that the parallelism inherent in this computation enables it to be vectorized as where For an input image I ∈ R cin×din×din feature maps and kernel of window size k × k, the input image is represented as k 2 · c in vectors v 1 , . . . , v k 2 ·cin , where v i contains all elements convolved with the i-th value in the filter. Denote corresponding ciphertexts as ct 1 , . . . , ct k 2 ·cin . The process of producing the j-th output feature map is now reduced to ciphertext-plaintext multiplication of each ct i with the i-th value in the j-th filter. In total, the process requires k 2 · c in ciphertext-scalar multiplications per output feature map, leading to a total of k 2 · c in · c out multiplications. Halevi & Shoup (2019) introduced a method of computing the encrypted matrix-vector product A · v where A ∈ R m×n and v ∈ R n using O(m) HOPs. We choose to use a variant of this method similar to the one proposed in GAZELLE (Juvekar et al., 2018) . Note that all row positions are in modulo m. Then each d i is left-padded with i zeros to align values belonging to the same row of A, and finally each padded diagonal is multiplied with with rotations of v. The ciphertexts are summed, and the last stage is to apply a rotate-and-sum procedure on the resulting ciphertext. Overall, this procedure requires O(m) multiplications and O(m + log 2 n) rotations. In this section, we present our method for achieving privacy-preserving CNN inference with low numbers of HOPs. In summary, we adopt the fast convolution method from LOLA but compute the intermediate convolutional layers in a network using the Halevi-Shoup (HS) matrix-vector product method instead. This enables large convolutions to be performed in far fewer ciphertext rotations than LOLA's approach of computing a rotate-and-sum procedure for each row in the weight matrix. In section 3.1 we explain the approach we use. In section 3.2, we perform an analysis to show the improvements made by our modifications compared with the approach from LOLA. In section 4, we apply our approach to models for the MNIST and CIFAR-10 datasets. Consider the convolution of an image I with a filter f . For simplicity, assume that both the image and filter are square, the vertical and horizontal strides of the filter are equal. Let I ∈ R din×din×cin , and f ∈ R k×k×cin . Denote the stride as s and padding as p. Now, the output feature map J is such that J ∈ R dout×dout where A full convolutional layer that outputs c out feature maps will require a convolution of the input image with each of the c out filters. Consider the vector v obtained by flattening each output feature map row-wise. This can be expressed as a matrix-vector product of the form v = A · w ∈ R d 2 out where A ∈ R d 2 out ·cout×d 2 in ·cin and w is the flattened representation of I. Remark 1. Let N denote the ciphertext slot count, and n = d 2 in · c in be the size of the convolution input, and m = d 2 out · c out be the size of the output. The basic Halevi-Shoup method, which takes m − 1 + log 2 m+n−1 m rotations, requires the condition that N ≥ m + n − 1. If this does not hold, but it is the case that m = 2 l , 0 ≤ l ≤ log 2 N , then m − 1 + log 2 N m rotations are required. Proof. Applying the Halevi-Shoup technique requires m ciphertext diagonals d (1) , . . . , d (m) of length n to be extracted, rotated and summed. Note that d (i) j = A i+j,j . Now, if N ≥ m+n−1 then N is sufficiently large for all rotations to be performed without wrapping around the ciphertext. If N < m + n − 1, then wrap-around will occur for at least one of the diagonal ciphertexts during rotation. For any slot d j is from the row of A with index r = i + j (mod m), and so must be shifted into an index that is equivalent to r in modulo m, in order for the rotate-and-sum algorithm to be used. If N is an integer multiple of m, then we see k ≡ r (mod m) indeed holds. Otherwise, wrap-around will cause the diagonals to be misaligned when summed together. Since N is a power of 2, the requirement that m divides N is satisfied whenever d out and c out are also powers of 2 such that 2 log 2 d out + log 2 c out ≤ log 2 N . Based on Remark 1, we utilise a procedure to ensure that the method can be applied for all choices of (d in , d out , c in , c out ) where d 2 out · c out ≤ N : if the condition that N ≥ m + n − 1 does not hold, then we 'round' m to the closest power of 2 not less than itself, and add corresponding rows filled with 0's to the weight matrix. LOLA uses a sparse representation for their intermediate ciphertext. They perform matrix-vector products Av 1 = v 2 via a multiplication per row of the matrix and summing the elements inside each product vector. Let A ∈ R m×n . The output is m ciphertexts ct 1 , . . . , ct m such that every element of ct i is the i-th element of v 2 . This sparse representation of a vector is then used to compute a subsequent matrix-vector product Bv 2 by element-wise multiplication of the i-th column of B with the i-th ciphertext, and summing all the product ciphertext. The first matrix-vector product uses m multiplications and m · log 2 n rotations, whilst the second requires only m multiplications and no rotations. In addition, LOLA proposes an stacked representation that utilises the full ciphertext slots available by packing k copies of v into a single ciphertext where k = N/δ(n) where δ(n) = 2 log 2 n is the smallest power of 2 greater than or equal to n. It can be shown that for any n > 1, LOLA-stacked uses fewer rotations than LOLA-sparse. However, LOLA-sparse can be used to compute two layers instead of one. For large inputs and output layers (relative to N ), however, both methods require significantly more rotations than HS. LOLA-stacked relies on m · δ(n)/N being small whereas LOLA-sparse relies on log 2 n being small. With HS, even if n = N , log 2 n is insignificant compared to m. In general, we believe that having the number of rotations be linear to only m is beneficial since neural networks typically down-sample or pool the data to produce denser, higher-level representations, and so can expect n ≥ m generally. It should be noted that there are exceptions -such as bottleneck layers. We conduct experiments with the MNIST and CIFAR-10 datasets. We first apply our approach to the CRYPTONETS architecture used in LOLA, to create a model CRYPTONETS-HS. The same architecture was shown to achieve close 98.95% accuracy by Gilad-Bachrach et al. (2016) , and we observe similar performance using their training parameters. Training is conducted using TensorFlow (Abadi et al., 2015) . The architecture is then converted into a sequence of homomorphic operations. We use the SEAL library (Chen et al., 2017) for this. For reference, the original CRYPTONETS architecture is shown in Figure 2 . LOLA uses a combination of their stacked and interleaved representations for their intermediate convolutions. We opt to use the efficient approach instead. To reduce consumption of instruction depth, linear layers without activations between them are composed together. For instance, each convolution-pooling block of CRYPTONETS-HS is a single linear layer. The models are implemented using operations provided by the SEAL library, and inference is performed on a standard desktop processor. We utilise only a single thread. We measure the number of homomorphic operations required per single inference, as well as model test accuracy. Using the proposed improvements, we also construct models ME and CE, for MNIST and CIFAR-10, respectively. The proposed architectures have reduced memory requirements and layer parameters that are better suited towards the HS method: • Model ME has similar test accuracy to CRYPTONETS-HS but is designed in consideration of the way we are computing the layers. We note that applying the HS method in computing layer 6 of CRYPTONETS-HS requires setting m = 100 and n = 845. Specifically, we reduce the kernel size of the first convolution from 5 × 5 to 3 × 3, and the size of the first dense layer from 100 to 32. To account for the reduced representational power of the first convolution, the stride is reduced from 2 to 1. This notably does not add any homomorphic operations to the computation. We are able to achieve 98.7% test accuracy using ME after 100 epochs of training with the Adam optimiser. • Model CE is larger, and requires more operations to compute. We initially construct a model with a second convolutional layer with an output tensor of size 8 × 8 with 64 channels, and then follow the approach from Lu et al. (2021) and use SVD to factorise this layer into a smaller sub-convolution followed by a 1×1 convolution which is efficient to compute using the convolution-packing method described in LOLA. In homomorphic encryption applied to neural networks, the most expensive homomorphic operation is rotation, with a worst case time complexity of performing both a number theoretic transform (NTT) and an inverse NTT on vectors of length N . We observe that CRYPTONETS-HS requires a total of 122 rotations, as shown in Table 2a . LOLA-MNIST requires a total of 380 rotations 1 for the same architecture. Additionally, we notice that at both the convolution-pooling blocks in the CRYP-TONETS architecture, the HS implementation has fewer rotations than LOLA. Now, the optimised ME architecture requires only 56 rotation per inference, but still achieves a high test accuracy of 98.7%, which is close to the 98.95% test accuracy achieved by LOLA-MNIST, suggesting that careful selection of layer sizes can lead to great reductions in inference latency. In terms of latency, the CRYPTONETS-HS model requires 2.7 seconds, whereas ME requires only 0.97 seconds. For reference, LOLA-MNIST requires 2.2 seconds per inference; however, they utilise 8 cores on a server CPU whereas our homomorphic operations are run on a single core of a desktop processor. For the CE model, we provide a model with sightly lower test accuracy than LoLa (73.1% vs. 74.1%) but requiring less than 10% the number of rotations. This due to both ensuring that intermediate tensor sizes fit well into the number of available slots, and also the use of the HS method. The number of operations for each layer is shown in 2b. Table 2 : Break-down of the types of operations performed by the low latency models. The notation 'layer1-layer2' denotes the layers between and including layer1 and layer2. Privacy-preserving inference using homomorphic encryption is largely constrained by the computational requirements of the operations. We propose improvements over LOLA to achieve lower latencies computing intermediate convolutions, resulting in over a two-fold reduction in the number of rotations for the same MNIST architecture. It is clear that further improvements can made along this direction, especially in the topic of automatically selecting suitable layer parameters to set the trade-off between inference latency and model accuracy (Lou et al., 2020) . TensorFlow: Large-scale machine learning on heterogeneous systems Security and privacy issues in deep learning Efficient fully homomorphic encryption from (standard) lwe Low latency privacy preserving inference Simple encrypted arithmetic library -seal v2.1. In Financial Cryptography and Data Security Homomorphic encryption for arithmetic of approximate numbers Faster cryptonets: Leveraging sparsity for real-world encrypted inference Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy Algorithms in helib Gazelle: a low latency framework for secure neural network inference Autoprivacy: Automated layer-wise parameter selection for secure neural network inference Fast factorized neural network inference on encrypted data Delphi: A cryptographic inference service for neural networks