key: cord-0464877-4pfy7ytz authors: Zhai, Yujia; Ibrahim, Mohannad; Qiu, Yiqin; Boemer, Fabian; Chen, Zizhong; Titov, Alexey; Lyashevsky, Alexander title: Accelerating Encrypted Computing on Intel GPUs date: 2021-09-29 journal: nan DOI: nan sha: 4b17dcd4546168675bcf7308e6d8f8d78b15191a doc_id: 464877 cord_uid: 4pfy7ytz Homomorphic Encryption (HE) is an emerging encryption scheme that allows computations to be performed directly on encrypted messages. This property provides promising applications such as privacy-preserving deep learning and cloud computing. Prior works have been proposed to enable practical privacy-preserving applications with architectural-aware optimizations on CPUs, GPUs and FPGAs. However, there is no systematic optimization for the whole HE pipeline on Intel GPUs. In this paper, we present the first-ever SYCL-based GPU backend for Microsoft SEAL APIs. We perform optimizations from instruction level, algorithmic level and application level to accelerate our HE library based on the Cheon, Kim, Kimand Song (CKKS) scheme on Intel GPUs. The performance is validated on two latest Intel GPUs. Experimental results show that our staged optimizations together with optimizations including low-level optimizations and kernel fusion accelerate the Number Theoretic Transform (NTT), a key algorithm for HE, by up to 9.93X compared with the na"ive GPU baseline. The roofline analysis confirms that our optimized NTT reaches 79.8% and85.7% of the peak performance on two GPU devices. Through the highly optimized NTT and the assembly-level optimization, we obtain 2.32X - 3.05X acceleration for HE evaluation routines. In addition, our all-together systematic optimizations improve the performance of encrypted element-wise polynomial matrix multiplication application by up to 3.10X. The COVID-19 pandemic boosts the rapidly growing demand of enterprises on cloud computing. By 2021, 50% of enterprise workloads are deployed to public clouds and this percentage is expected to breach 57% in the next 12 months [1] . Although outsourcing data processing to cloud resources enables enterprises to relieve the overhead of deployment and maintenance for their private servers, it raises security and privacy concerns of the potential sensitive data exposure. Adopting traditional encryption schemes to address this privacy concern is less favorable because a traditional encryp-©Notice: Copyright ©2021, Intel Corporation. All Rights Reserved. Intel TM Notice: Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No product or component can be absolutely secure. Performance/Benchmarking Disclaimer: Performance varies by use, configuration and other factors. Learn more at www.intel.com/performanceindex. Intel technologies may require enabled hardware, software or service activation. Your results may vary. No product or component can be absolutely secure. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. tion scheme requires decrypting the data before the computation, which presents a vulnerability and may destroy the data privacy. In contrast, Homomorphic Encryption (HE), an emerging cryptographic encryption scheme, is considered to be one of the most promising solutions to such issues. HE allows computations to be performed directly on encrypted messages without the need for decryption. Therefore, this encryption scheme protects private data from both internal malicious employees and external intruders, while assuming honest computations. In 1978, Rivest, Adleman, and Dertouzous [2] , first introduced the idea of computing on encrypted data through the use of "privacy homomorphisms". Since then, several HE schemes have been invented, which can be categorized by the types of encrypted computation they support. Partial HE schemes enable only encrypted additions or multiplications. The famous RSA cryptosystem is, in fact, the first HE scheme, supporting encrypted modular multiplications. In contrast, the Paillier cryptosystem [3] is a partial HE scheme that supports only modular additions. Levelled HE schemes, on the other hand, support both encrypted additions and multiplications, but only up to a certain circuit depth L determined by the encryption parameters. The Brakerski/Fan-Vercauteren (BFV) [4] and Brakerski-Gentry-Vaikuntanathan (BGV) [5] schemes are two popular leveled HE schemes used today, which support exact integer computation. In [6] , Cheon, Kim, Kim and Song presented the CKKS scheme, which treats the encryption noise as part of approximation errors that occur during computations within floatingpoint numerical representation. This imprecision requires a refined security model [7] , but provides faster runtimes than BFV/BGV in practice. Fully HE schemes enable an unlimited number of encrypted operations, typically by adding an expensive bootstrapping step to a levelled HE scheme, as first detailed by Craig Gentry [8] . TFHE [9] improves the runtime of bootstrapping, but requires evaluating circuits on binary gates, which becomes expensive for standard 32-bit or 64-bit arithmetic. The improved capabilities and performance of these HE schemes have enabled a host of increasingly sophisticated real-world privacy-preserving applications. Early applications included basic statistics and logistic regression evaluation [10] . More recently, HE applications have expanded to a wide vari-ety of applications, including privatized medical data analytics, privacy-preserving machine learning, homomorphic dynamic programming, private information retrieval, and sorting on encrypted data [11] - [19] . The memory and runtime overhead of HE is a major obstacle to immediate real-world deployments. Several HE libraries support efficient implementations of multiple HE schemes, including Microsoft SEAL [20] (BFV/CKKS), HElib [21] (BFV/BGV/CKKS), and PAL-ISADE [22] (BGV/BFV/CKKS/TFHE). In [23] , Intel published HEXL, accelerating HE integer arithmetic on finite fields by featuring Intel Advanced Vector Extensions 512 (Intel AVX512) instructions. Since GPUs deliver higher memory bandwidth and computing throughput with lower unit power consumption, researchers presented libraries such as cuHE [24] , TFHE [9] and NuFHE [25] to accelerate HE using CUDA-enabled GPUs. Since GPUs deliver higher memory bandwidth and computing throughput with lower unit power consumption, researchers have also accelerated HE using CUDA-enabled GPUs, notably in the cuHE [24] and NuFHE [25] libraries. Although HE optimizations on CPUs and CUDA-enabled GPUs have been extensively studied, an architecture-aware HE study investigating Intel GPUs has not been available. In addition, since the Number Theoretic Transform (NTT) is a crucial algorithm enabling fast HE computations, previous works on accelerating HE libraries majorly focus on optimizing computing kernels of NTT and inverse NTT (iNTT). However, a HE library supports basic HE primitives such encoding, encryption and evaluation. Such a comprehensive pipeline requires systematic optimizations, rather than only optimizing for computing kernels. In this paper, we present a HE library optimized for Intel GPUs based on the CKKS scheme. We not only provide a set of highly optimized computing kernels such as NTT and iNTT, but also optimize the whole HE evaluation pipeline at both the instruction level and application level. More specifically, our contributions include: • To the best of our knowledge, we design and develop the first-ever SYCL-based GPU backend for Microsoft SEAL APIs, which is also the first HE library based on the CKKS scheme optimized for Intel GPUs. • We provide a staged implementation of NTT leveraging shared local memory of Intel GPUs. We also optimize NTT by employing strategies including high-radix algorithm, kernel fusion, and explicit multiple-tile submission. • From the instruction level, we enable low-level optimizations for 64-bit integer modular addition and modular multiplication using inline assembly. We also provide a fused modular multiplication-addition operation to reduce the number of costly modular operations. • From the application level, we introduce the memory cache mechanism to recycle freed memory buffers on device to avoid the run-time memory allocation overhead. We also design fully asynchronous HE operators and asynchronous end-to-end HE evaluation pipelines. • We benchmark our HE library on two latest Intel GPUs. Experimental results show that our NTT implementations reaches up to 79.8% and 85.7% of the theoretical peak performance on both experimental GPUs, faster than the naive GPU baseline by 9.93X and 7.02X, respectively. • Our NTT and assembly-level optimizations accelerate five HE evaluation routines under the CKKS scheme by 2.32X -3.06X. In addition, the polynomial element-wise matrix multiplication applications are accelerated by 2.68X -3.11X by our all-together systematic optimizations. The rest of the paper is organized as follows: we introduce background and related works in Section II, and then detail the asynchronous design and systematic optimization approaches in Section III. Evaluation results are given in Section IV. We conclude our paper and present future work in Section V. In this section, we briefly introduce the basics of the CKKS HE scheme. We then introduce the general architecture of Intel GPUs. Meanwhile, we summarize prior works of NTT optimizations on both CPUs and GPUs. The CKKS scheme was first introduced in [6] , enabling approximation computation on complex numbers. This approximate computation is particularly suitable for real-world floating-point data, which is already inexact. Further work improved CKKS to support a full residue number system (RNS) [26] and bootstrapping [27] . In this paper, we select CKKS as our FHE scheme, as implemented in Microsoft SEAL [20] . Many HE schemes, including CKKS, BFV, and BGV, derive their security from the hardness of the ring learning with errors (RLWE) problem, which relates polynomials in a quotient ring. Let Φ M [x] be the M 'th cyclotomic polynomial, with polynomial modulus degree N = φ(M ). Then, given a ciphertext modulus q, the ring R q = Z q [x]/Φ M [x] consists of degree N − 1 polynomials with integer coefficients in Z q = {0, 1, . . . , q − 1}. For efficient implementation, M is typically chosen to be a power of two, in which case Φ M [x] = x M/2 + 1 = x N + 1 with N also a power of two. CKKS has the message space consisting of N/2dimensional complex vectors C N/2 , plaintext space R q , and ciphertext space R 2 q . 1 We note multiplication in R includes a reduction by X N + 1 which ensures the products is also a degree N − 1 polynomial. CKKS also takes in a precision argument ∆ used to scale inputs. The ciphertext modulus is chosen as q = q L = ∆ L · q 0 . We denote q l = ∆ l q 0 . The CKKS scheme is composed of following basic primitives: KeyGen, Encode, Decode, Encrypt, Decrypt, Add, Multiply (Mul), Relinearize (Relin) and Rescale (RS). We assume that the reader is already familiar with these logical blocks, and therefore we provide below only cursory descriptions, while deeper descriptions can be found elsewhere [6] . KeyGen(λ)→{pk ∈ R 2 q ,sk ∈ R q ,evk ∈ R 2 P ·q } takes a security parameter as an input to generate a set of keys for CKKS scheme, namely the secret key (sk), public key (pk), and evaluation key (evk). pk and sk are used for encryption and decryption, respectively, while evk is used for homomorphic evaluation operations, which are accelerated on Intel GPUs in this paper and will be elaborated on later. Encode(z ∈ C N/2 , ∆) → m ∈ R q encodes an input z to a message m. Since the encoding scheme might destroy some significant numbers, encoding includes a multiplication by ∆ to keep a precision of 1 ∆ . Decode(m ∈ R q l , ∆) → z ∈ C N/2 is the inverse operation of Encode. It includes a division by the scaling factor ∆. Encrypt(m ∈ R q L , pk)→ c ∈ R 2 q L takes an input plaintext m and a public key pk. By sampling a vector v and two Gaussian-distributed vectors e 0 , e 1 , the encrypted message is generated as c = (v · pk + (m + e 0 , e 1 )) mod q L . Decrypt(c ∈ R 2 q l , sk)→ m ∈ R q requires access to the secret key sk. Relin(c ∈ R 3 q l , evk)→ c ∈ R 2 q l decreases the size of the ciphertext back to (at least) 2 after multiplication. We have c = (c 0 , c 1 ) + P −1 · c[2] · evk mod q l , where P is a large auxiliary number used for basis conversion. RS l→l (c ∈ R 2 q l )→ c ∈ R 2 q l maintains the scale constant and reduces the noise in a ciphertext after Mul. However, it reduces the level l of the ciphertext, limiting the number of further homomorphic operations that can be performed. RS computes c = q l q l c mod q l Often, l = l − 1, in which case RS l→l−1 = ∆ −1 c mod q l−1 . As noted in [28] , the NTT can be exploited to accelerate multiplications in the polynomial ring R q = Z q [x]/(x N + 1). We represent polynomials using a coefficient embedding: Let ω be a primitive N -th root of unity in Z q such that ω N ≡ 1( mod q). In addition, let ψ be the 2N -th root of unity in Z q such that ψ 2 = ω. Further definingã = (a 0 , ψa 1 , ..., ψ N −1 a N −1 ) andb = (b 0 , ψb 1 , ..., ψ N −1 b N −1 ), one can quickly verify that for c = a · b ∈ Z N q , there holds the relationship c = Ψ −1 iNTT(NTT(ã) NTT (b)). Here denotes element-wise multiplication and Ψ −1 represents the vector (1, ψ −1 , ψ −2 , ..., ψ −(N −1) ). Therefore, the total computational complexity of ciphertext multiplication in R q is reduced from O(N 2 ) to O(N log N ). In practice, since polynomial coefficients in the ring space are big integers under modulus q, multiplying these coefficients becomes computationally expensive. The Chinese Remainder Theorem (CRT) is typically employed to reduce this cost by transforming large integers to the Residue Number System (RNS) representation. According to CRT, one can represent the large integer x mod q using its remainders (x mod p 1 , x mod p 2 , . . . , x mod p n ), where the moduli (p 1 , p 2 , ..., p n ) are co-prime such that Πp i = q. We note the CKKS scheme has been improved from the initial presentation in Section II-A to take full advantage of the RNS [26] . To summarize what we have discussed, to multiply polynomials a and b represented as vectors in Z N q , one needs to first perform the NTT to transform the negative wrapped a andb to the NTT domain. After finishing element-wise polynomial multiplication in the NTT domain, the iNTT is applied to convert the product to the coefficient embedding domain. When the polynomials are in RNS form, both the NTT and iNTT are decomposed to n concurrent subtasks. Finally, we compute the outer product result by merging the iNTTconverted polynomial with Ψ −1 . Due to the pervasive usage of NTT and iNTT in HE, prior researchers proposed optimized implementations for NTT on CPUs [23] , GPUs [29] - [32] and FPGAs [33] , [34] . To be more specific, Intel HEXL provides a serial CPU implementation of the radix-2 negacyclic NTT using Intel AVX512 instructions [23] and Harvey's lazy modular reduction approach [35] . GPU-accelerated NTT implementations typically adopt the hierarchical algorithm first presented by Microsoft Research for the Discrete Fourier Transform (DFT) [36] . Among which, [31] implements the hierarchical NTT with twiddle factors cached in shared memory. Rather than caching twiddle factors, [32] computes some twiddle factors on-the-fly to reduce the cost of modular multiplication and the memory access number of NTT. [30] further considers the built-in warp shuffling mechanism of CUDA-enabled GPUs to optimize NTT. The hierarchical NTT implementation computes the NTT in three or four phases [30] , [36] . An N -point NTT sequence is first partitioned into two dimensions N = N α · N β and then N α NTT workloads are proceeded simultaneously, where each workload computes an N β -point NTT. After this column-wise NTT phase is completed, all elements are multiplied by their corresponding twiddle factors and stored to the global memory. In the next phase, N β simultaneous row-wise N α -point NTTs are computed followed by a transpose before storing back to the global memory. N α and N β are selected to fit the size of shared memory on GPUs. Considering both the RNS representation of NTT and the batched processing opportunities in real-world applications can provide us with sufficient parallelisms, we adopt the staged NTT implementation rather than the hierarchical NTT implementation in this paper. We use the Intel Gen11 GPU as an example to present the architecture Intel GPUs [37]. Intel GPUs consist of a hierarchical architecture of computing and storage units. Starting from the low level, an Intel GPU contains a set of execution units (EU), where each EU supports up to seven simultaneous hardware threads, namely EU threads. In each EU, there is a pair of 128-bit SIMD ALUs, which support both floatingpoint and integer computations. Each of these simultaneous hardware threads has a 4KB general register file (GRF). So, an EU contains 7 × 4KB = 28KB GRF. Meanwhile, GRF can be viewed as a continuous storage area holding a vector of 16-bit or 32-bit floating-point or integer elements. For most Intel Gen11 GPUs, 8 EUs are aggregated into 1 Subslice. EUs in each Subslice can share data and communicate with each other through a 64KB highly banked data structureshared local memory (SLM). SLM is accessible to all EUs in a Subslice but is private to EUs outside of this Subslice. Not only supporting a shared storage unit, Subslices also possess their own thread dispatchers and instruction caches. Eight Subslices further group into a Slice, while additional logic such as geometry and L3 cache are integrated accordingly. In the CKKS scheme, an input message is first encoded and then encrypted to generate ciphertexts using the public key provided by the key generation primitive. We evaluate (compute) directly on the encrypted messages (ciphertexts). Once the all computations are completed, the results can be decrypted and decode by the private key's owner. Figure 1 describes the control flow of our asynchronous HE library. Our HE library accelerates the HE evaluation using Intel GPUs while leaving other phases such as key generation, encoding, encryption, decryption and decoding on CPU. Once all the inputs and static data are sent to the GPU, the synchronization with the host becomes unnecessary. Since all host-device synchronizations take additional time, we developed a fully asynchronous execution pipeline to economize on synchronizations. As shown in Figure 2 , the computation on the GPU starts as soon as the first kernel of the computational graph is submitted. Meanwhile, GPU buffers are allocated and managed at runtime. GPU synchronizes with the host only after the buffers with the results are transferred back to the system memory. In the following contents of this section, we present optimizations of our library from three different angles: instruction, algorithm and application. Our HE library supports basic instructions such as addition, subtraction, multiplication and modular reduction -all are 64-bit integer (int64) operations. We explicitly select int64 because our goal has been to provide accelerated SEAL APIs on Intel GPUs almost transparently. This is the reason why our current top-level software does not exactly fit to drive 32-bit integer (int32) calculations, although we envision to support both int32 and int64 eventually. Among these operations, the most expensive are modulus-related operations such as modular addition and modular multiplication. Although we can accelerate modular reduction using the Barrett reduction algorithm, which transforms the division operation to the less expensive multiplication operation, modular computations remain costly since no modern GPUs support int64 multiplication natively. Such multiplications are emulated at software level with the compiler support. Based on these observations, we propose instruction-level optimizations from two aspects: 1) fusing modular multiplication with modular addition to reduce the number of modulo operations and 2) optimizing modular addition/multiplication from assembly level to remedy the compiler deficiency. 1) Fused modular multiplication-addition operation: Rather than eagerly applying modulo operation after both multiplication and addition, we propose to perform only one modulo operation after a pair of consecutive multiplication and addition operations, namely a mad mod operation.We store the output of int64 multiplication in an 128-bit array. The potential overflow issue introduced by cancelling a modulus after addition is not a concern when both operands of addition are integers strictly less than 64 bits. This assumption holds because to assure a faster NTT transform, using David Harvey's optimizations [35] following SEAL and discussed in the next section, all of our ciphertexts are in the ring space under a integer modulus less than 60 bits. Therefore, all ciphertext elements do not exceed 60 bits. 2) Optimizing HE arithmetic operations using inline assembly: Inline assembly is a feature available in some compilers allowing programmers to embed low-level assembly code into their program written in higher-level languages such as C/C++. Software engineers mainly resort to using inline assembly to optimize performance-sensitive parts of their programs. This is achieved through embedding a more efficient assembly implementation compared to what the compiler auto-generates. Additionally, it gives access to processor-specific or special instructions used to implement locking, synchronization and other specialized features that are not yet supported by the compiler. The first step the process is to study and carefully review how Intel's DPC++ compiler generates code for core operations. Investigating the compiler's generated code is key to low-level optimizations and could provide many opportunities for speedups and optimizations. We saw such opportunities in two of our core arithmetic operations: Unsigned Modular Addition, and Unsigned Integer Multiplication. The optimization for this operation is simple yet effective. The compiler-generated assembly sequence shown in Figure 3 (a) is replaced with the sequence shown in Figure 3 (b) eliminating one instruction from the sequence. It is important to note that even applying a small optimization to core operations at the assembly level enables direct benefits to other complex HE operations, since these core operations are ubiquitous in HE libraries. b) Unsigned Integer Multiplication (mul64): Other cases at which inline assembly can be useful is when the compiler's generated code sequence does not exactly or efficiently match the required functionality and as a result, uses more expensive operations or code sequences. An example of this case can be found in the compilergenerated code for int64 multiplication, which is a compilation deficiency related to variables' type-casting. By default, the compiler looks for ways to minimize the number of typecasting instructions. Although this would generally seem beneficial, it can be detrimental in some cases, considering that operands' types play a significant role in how an instruction is implemented. For example, an integer multiplication, where both operands are int32, can be executed with a single assembly instruction, compared to a lengthier, emulated multiplication implementation when the operands are of type int64. This presents an example of how knowledge of the underlying instruction set architecture can provide very specific low-level optimization opportunities. We target one of the multiplication instructions available in our hardware for this purpose -we will refer to this instruction as mul low high. The instruction receives two 32-bit operands and stores both low and high 32 bits of the result in a 64-bit destination [38] . At every multiplication where the compiler eagerly optimizes a int32 operands to int64, we utilize inline assembly to bypass this deficiency by forcing the compiler to keep the appropriate types and also utilize the mul low high instruction, as shown in Figure 4 This optimization yields a ∼60% reduction in instruction count from our original int64 multiplication implementation. (a) Compiler-generated 1: mul temp src2 src1 2: mulh temp1 src2 src1 3: mul temp2 src2 src1 4: add temp1 temp1 temp2 5: mul temp2 src2 src1 6: add temp1 temp1 temp2 7: mov dst_low temp 8: mov dst_high temp1 1: mul_low_high dst_low_high src1 src2 (b) Optimized with Inline Assembly Fig. 4 : Pseudo mul64 inline optimization As mentioned previously, applying low-level optimizations to core operations directly impact our algorithm and kernel performance. Optimizations aimed at our core arithmetic operations will greatly impact the performance of HE as will be discussed in more detail in Section IV. An efficient NTT implementation is crucial for HE computations since it accounts for a substantial percentage of the total HE computation time [32] , [39] , [40] . Figure 5 presents the relative execution time of five HE evaluation routines and the percentage of NTT in each routine that we benchmarked on two latest Intel GPUs, Device1 and Device2. We observe that NTT accounts for 79.99% and 75.64% of the total execution time in average on these two platforms. The profiling confirms the significance of optimizing NTT for HE on Intel GPUs. 1) Naive radix-2 NTT: We start NTT optimizations from the most naive radix-2 implementation. This reference implementation of NTT, as shown in Figure 6 , distributes rounds of radix-2 NTT butterfly operations among work items. In each round of NTT computation, all the work items compute their own butterfly operation and exchange data with other work-items using global memory. More specifically, the k-th element will exchange with the k + gap th element, while the exchanging gap sizes are halved after each round of NTT until it becomes equal to 1. Accordingly, the m-loop in Figure 6 is executed log(N ) times throughout the NTT computation, where for each iteration of the m-loop, one accesses the global memory 2N times. Here we multiply it by two because of both load and store operations. We ignore the twiddle factor memory access number in this semi-quantitative analysis. At the lowest level, the NTT butterfly computation is accelerated using Algorithm 1 [35] . Since the output X , Y of Algorithm 1 are both in [0, 4p), to ensure all elements of the output NTT sequence falls inside of the interval [0, p), a last round offsetting needs to be appended to the end of NTT computations. Therefore, the naive implementation of an Npoint NTT needs to access the global memory 2N log(N ) times for the NTT and 2N extra times for last round processing. A three-dimensional DPC++ kernel corresponding to the three layers of the for loop is invoked when deploying this algorithm on Intel GPUs. This kernel reaches only 10% of the peak performance for a 32K-point, 8-RNS-size, 1024-instance NTT. Input: 0 ≤ X, Y ≤ 4p, p < β/4, 0 < W < p, 0 < W β/p = W < β; Output: 2) Staged radix-2 NTT with shared local memory: Due to the low bandwidth and high latency of the global memory, the operational density of naive radix-2 NTT is far from optimal. Therefore, its performance is bounded by the global memory bandwidth. To address this issue, we keep data close to computing units by leveraging shared local memory (SLM) in Intel GPUs, a memory region that is accessible to all the work-items belonging to the same work-group. Given that the data exchanging gap size are halved after each round of NTT, there exists a certain round when the gap size becomes sufficiently small so that all data to exchange can be held in SLM. We call this threshold gap size TER_SLM_GAP_SZ, after which we retain the data in SLM for communication between work-items to avoid the expensive global memory latency. For example, in a 32K-point NTT, we first compute three rounds of NTT and exchange data using global memory and then the data exchanging gap size has decreased to 4K. We set the TER_SLM_GAP_SZ to 4K because the size of the SLM on most Intel GPUs is 64KB, which can hold 8K int64 elements. For the remaining rounds, the data are held in SLM until all computations are completed. Register 0: Register 1: 3) SIMD shuffling: In addition to introducing shared local memory, when the exchanging distance becomes sufficiently small that all data to exchange are held by work-items in the same subgroup, we perform SIMD shuffling directly among all the work-items in the same subgroup after NTT butterfly computations. In Figure 7 , we present the rationale of two SIMD shuffling operations among three stages. When the SIMD width equals to 8, there are 8 work-items in a subgroup. For the radix-2 NTT implementation, each work-item holds two elements of the NTT sequence in registers, namely one slot. We denote two local registers of each work-item as TER_SLM_GAP_SZ work-items communicate through the global memory by invoking global radix 2 ntt kernel. After reaching the SLM threshold, one computes NTT butterfly operations and exchanges data through SLM by calling slm radix 2 ntt kernel until the gap size equals the threshold to exchange data using SIMD shuffling inside subgroups, which is handled by the kernel simd radix 2 ntt kernel. It is worth mentioning that the SIMD kernel is fused with the aforementioned last round processing operation to reduce all NTT elements to [0, p). auto shift_idx = (lane_id >> log_gap); auto tmp1 = (( shift_idx + 1 ) & 1); auto tgt_idx = lane_id + ((( tmp1 <<1)-1) << log_gap); int slot_idx = 0; for (; slot_idx < LOCAL_REG_SLOTS; slot_idx++){ auto reg_idx = tmp1 + (slot_idx << 1); data[reg_idx] = sg.shuffle(data[reg_idx], tgt_idx); } Fig. 9 : NTT SIMD shuffling pseudo code 4) More aggressive register blocking: Intel GPUs typically consist of 4KB GRF for each EU thread. When the SIMD width equals 8, that indicates 8 work-items are bounded executing as an EU thread in the SIMD manner. For the radix-2 NTT implementation, each work-item needs four registers, where two of them are used to hold NTT data elements and the other two are for twiddle factors. Therefore, the NTT-related computation consumes 4 · 8 · 8B= 256B GRF for each EU thread -6.25% of total GRF, indicating that the hardware is significantly underutilized at the register level. Rather than initializing only one slot of registers, one can assign more workloads to each work-item. For example, we can allocate 4 slots of registers for each work-item to hold NTT element data for each work-item. For a subgroup of size 8, there are 8 · 4 = 32 NTT elements being held in registers in the SIMD kernel. This is referred as SIMD (32, 8) in following discussions. Figure 9 shows the pseudo code of the multi-slot SIMD shuffling operation. When LOCAL_REG_SLOTS is set to 1, this pseudo code degrades to the single-slot implementation. Compared with the single-slot implementation, the multi-slot SIMD implementation results in fewer accesses to shared local memory, but suffers from higher register pressure and the inregister data exchange overhead. In practice, the efficiencies of both 2-slot SIMD (16, 8) and 4-slot SIMD (32, 8) implementations are worse than the 1-slot SIMD (8, 8) , suggesting that negative aspects dominate the performance. 5) High-radix NTT: Prior optimization trials indicate that the cost of data exchanging can offset the acceleration from introducing SLM. With this observation in mind, we seek optimizations to directly reduce the number of data exchanges and communication among work-items so that we can reduce the number of SIMD shuffling and memory accesses. Therefore, we adopt high-radix NTT implementations. Compared with the radix-2 implementation, a high-radix NTT aggregates more data elements into a group, and performs calculations with all the data held in registers. We use radix-8 as an example to demonstrate high-radix NTT algorithms. Each work-item allocates 8 registers to hold NTT elements and 8 more registers to hold root power and root power quotients as for the twiddle factors. For a specific round where the exchanging gap size is gap, one loads eight NTT elements from either global or shared local memory, indexing at {k, k + gap, k + 2 · gap, ..., k + 7 · gap}. There are three internal rounds of NTT for each radix-8 kernel. ]} so that Algorithm 1 can be leveraged. In the last internal round of the radix-8 kernel, each two consecutive elements are paired and fed into the 2-point butterfly algorithm. After all in-register computations are completed, we store results back to either global memory or shared local memory, depending on whether it is a global memory kernel or a shared local memory kernel. Last round processing is fused with shared local memory high-radix NTT kernels. Figure 10 shows the parallelism of the staged NTT implementation. For a 32K-point NTT shared local memory kernel, we assign 4K NTT elements to each work-group. Therefore, 8 work-groups, which are mapped to SubSlices, are activated. Meanwhile, both RNS and the batched number of polynomials can provide us with additional parallelisms since the RNS size can be up to several dozens [32] while the batch size can be up to tens of thousands in real-world deep learning tasks [42] . In addition to instruction-level and algorithm-level optimizations, we also optimize our GPU-accelerated HE library from the application level. 1) Memory cache: To reduce the overhead introduced by runtime memory allocation, we design a memory cache mechanism to our HE library, as shown in Figure 11 . Similar to Microsoft SEAL, we introduce a memory pool to reuse allocated GPU memory buffers in the HE pipeline. A request for a new GPU memory buffer is routed through the memory cache for any existing free buffer with a capacity larger than the current request. If such buffer is found, this existing buffer is reused instead of allocating a new one. Upon freeing such a buffer, it is moved back to the free pool for potential reuse. 2) Inter-device scaling: At this time, DPC++ does not implicitly support the multi-tile submission. As such, the workloads cannot automatically be distributed over all the computing units of a multi-tile Intel GPU. Therefore, we explicitly submit workloads to the multi-tile device through multiple queues. We refer a reader to [43] for more details of the DPC++ multi-queue implementation. We evaluate our optimizations on two Intel GPUs with the latest microarchitecture. Due to confidentiality requirements, at this time, we do not disclose hardware specifications of these GPUs. For the same purpose, we present performance data by showing normalized execution time rather than the absolute elapsed time or showing performance in the unit of GFLOPS. The first Intel GPU, denoted as Device1 in following discussions, is a multi-tile GPU while the second Intel GPU, Device2, is a single-tile GPU. We utilize up to 2 tiles in the multi-tile Device1 for performance benchmarking and efficiency estimation. Both GPU devices are connected with 24-core Intel Icelake server CPUs, whose boost frequency is up to 4 GHz. The associated main memory systems are both 128GB at 3200 MHz. We compile programs using Intel Data Parallel C++ (DPC++) Compiler 2021.3.0 with the optimization flag -O3. A. Optimizing NTT on Intel GPUs Figure 5 shows that NTT accounts for at least 70% of the total execution time of HE evaluations routines on Intel GPUs. Therefore, we start optimizations from this decisive algorithm. 1) First Trial: optimizing NTT using SLM and SIMD: Figure 12 (a) compares the speedup of our first batch of NTT trials using the staged NTT implementation over the naive GPU implementation of NTT described in Figure 6 . We use SIMD(TER_SIMD_GAP_SZ,SIMD_WIDTH) to denote the different implementation variants. The number of register slots for each work-item can be computed by dividing TER_SIMD_GAP_SZ over SIMD_WIDTH. For example, each work-item holds a pair of NTT elements in registers for SIMD (8, 8) , and 4 pairs of NTT elements for SIMD (32, 8) . With the shared local memory as well as the SIMD shuffling for data exchanging among work-items included, we observe that SIMD (8, 8) is faster than the baseline by up to 28%. Meanwhile, SIMD (16, 8) is slightly slowed down compared with SIMD (8, 8) but remains up to 19% faster than baseline. This indicates that the non-negligible cost of SIMD shuffling leads to the unfavorable performance. Accordingly, SIMD (32, 8) , which more aggressively performs SIMD shuffling and inregister data exchange than previous two variants, becomes even slower than the baseline. Figure 12 (a) compares the efficiency of each NTT variant on Device1. The efficiency is computed by dividing the performance of each NTT implementation over the computed int64 peak performance, both in the unit of GFLOPS. The naive NTT reaches only 10.08% of the peak performance for a 32Kpoint NTT with 1024 instances executed simultaneously. The best one, SIMD (8, 8) obtains an efficiency up to 12.93%. Since SIMD shuffling together with the SLM data communication fail to provide us with a high efficiency, we deduce that the cost of data communication is so high that radix-2 NTT cannot fully utilize the device. 2) Second Trial: optimizing high-radix NTT using SLM: Figure 13 (a) compares the speedup of high-radix NTT implementations with shared local memory against the naive GPU baseline. High-radix NTT implementations re-use more data at the register-level, reducing the communication among workitems through either global memory or SLM. With the shared local memory also included, this time we obtain an up to 4.23X acceleration over the naive baseline. In Figure 13 (b), we see that the efficiency reaches its optimum, 34.1% of the peak performance, at radix-8 NTT with 1024 instances instantiated for 32K-point NTT computations. The radix-16 NTT, though it brings more aggressive register-level data re-use and requires less data exchange among work-items, leads to the register spilling issue so its performance becomes significantly slower than radix-8 NTT. 3) Assembly-level optimizations -add mod/mul mod: We further introduce assembly-level optimizations to improve the speed of the int64 add mod and mul mod ops. As shown in Figure 14 (a), these low-level optimizations improve the NTT performance by 35.8% -40.7%, increasing the efficiency of our radix-8 SLM NTT to 47.1%. The inline assembly lowlevel optimization provides a relatively stable acceleration percentage for different NTT sizes and instance numbers. This is because assembly-level optimization directly improves the clock cycle of the each 64-bit integer multiplication and addition operation, which is independent of the number of active EUs at runtime. 4) Explicit dual-tile submission through DPC++: With the low-level optimization, we observe that our NTT saturates only up to 47.1%, less than half of the peak performance. We observe this low efficiency because DPC++ runtime does not implicitly support multi-tile execution such that only half of the machine has been utilized . To address this issue, we explicitly submit workloads through multiple queues to enable a full utilization of our multi-tile GPU. In regards to the dualtile peak performance, we manage to reach 79.8% of the peak performance through explicit dual-tile submission. Meanwhile, our most optimized NTT is 9.93-fold faster than the naive baseline for the 32K-point, 1024-instance batched NTT. The most naive NTT needs to access the global memory for each round of the NTT computation. Therefore, its total memory access number can be computed as 2N log 2 (N ). Here we multiply it by 2 because of both load and store operations at each round of NTT. Also, we do not count the memory access of last round NTT processing to simplify the analysis and because it is negligible. Table I summarizes the number of ALU operations for each NTT variant. Butterfly refers to the ALU operations for the NTT butterfly computations while other denotes other necessary ALU operations such as index and address pointer computations. The radix-2 NTT performs 48 integer operations for each work-item in a single round of NTT, indicating that the naive NTT consumes N/2 · 48 · log 2 (N ) ALU operations throughout the whole computation process. Further dividing the total ALU number over the total memory access number, one can find that the operational density of naive NTT is equal to 1.5 for 64-bit integer NTT. This low operational density, as plotted in Figure 15 , suggests that the naive NTT implementation is bounded by the global memory bandwidth and can never reach the int64 peak performance. On the other hand, for high-radix NTT such as radix-8 staged NTT implementation, we need only two rounds of global memory access for an instance of 32K-point NTT computation. We first perform one round of radix-8 NTT to reduce the data exchanging gap size from 16K to 2K. Right now each work-group computes a 2K×2=4K-point NTT with all NTT sequence elements held in shared local memory. Therefore, the total global memory access number becomes 2(LD/ST)×2(rounds)×N=4N. Considering its total ALU operation number equal to 456(/round)× log 8 (N )(rounds), one can compute that the operational density of shared local memory radix-8 NTT equals 8.9, pushing the overall performance to the limits of int64 ALU throughput on Device1. The operational density of other NTT variants are computed similarly. In addition, a sound operational density with respect to the global memory access does not guarantee satisfactory overall performance. Although the SLM+SIMD radix-2 NTT is no longer bounded by the global memory bandwidth, its practical efficiency remains far from the green line -int64 peak performance. According to the Figure 15 , we conclude that radix-8 shared-local memory NTT with last round kernel fusion enables a sufficient operational density, which allows the performance to be shifted from memory bound to compute-bound. Additionally, the shared local memory utilization together with the low-level optimization for int64 multiplication and DPC++ multi-tile submission, pushes the performance of radix-8 NTT to the ceiling of int64 ALU throughput on Device1. Here MulLin denotes a multiplication followed by a relinearization; MulLinRS denotes a multiplication followed by relinearization and rescaling. Relinearization, as we have introduced before, decreases the length of a ciphertext back to 2 after a multiplication. Rescaling is a necessary step for multiplication operation with the goal to keep the scale constant, and also reduce the noise present in the ciphertext. In addition, SqrLinRS refers to a ciphertext square computation with relinearization and rescaling followed. MulLinRSModSwAdd computes a ciphertext multiplication, then relinearizes and rescales it. After this, we switch the ciphertext modulus from (q 1 , q 2 , ..., q L ) down to (q 1 , q 2 , ..., q L−1 ) and scale down the message accordingly. Finally this scaled messaged is added with another ciphertext. The last benchmarked routine, Rotate, rotates a plaintext vector cyclically. We count the GPU kernel time exclusively for routine-level benchmarks. All evaluated ciphertexts are represented as tuples of vectors in Z N q L where N = 32K and the RNS size is L = 8. We present the impact of NTT optimizations to HE evaluation routines in four steps. We first substitute the naive NTT with our radix-8 NTT with SLM. We then employ assembly-level optimizations to accelerate the clock cycle of int64 add mod and mul mod. Finally we enable implicit dualtile submission through the OpenCL backend of DPC++ to fully utilize our wide GPU. The baseline is the naive GPU implementation where no presented optimizations are adopted for the comparison purpose. The radix-8 NTT with data communications through SLM improves the routine performance by 43.5% in average. It is worth mentioning that we do not benchmark batched routines and our wide GPU is not fully utilized such that the NTT acceleration is not as dramatic as the results in previous sections. The inline assembly optimization provides a further average 27.4% improvement in compared with the previous step. Meanwhile, the non-NTT computations show less sensitivity to the inline assembly optimization than the NTT because their computations are typically not as computeintensive as NTT. We finally submit the kernels through multiple queues to enable the full utilization of our multi-tile GPUs, further improving the performance by 49.5% to 78.2% from the previous step, up to 3.05X faster than the baseline. We benchmark our optimizations on another GPU -De-vice2 which is a single-tile GPU consisting of fewer EUs than Device1. Similar to Device1, the naive radix-2 NTT starts at a ∼15% efficiency of the peak performance, while the shared local memory SIMD implementation either fails to provide significant improvement, but reach only 20.95%-24.21% efficiencies. After adopting the radix-8 shared local memory implementation, we manage to obtain a up to 66.8% of the peak performance, where we are up to 5.47X faster than the baseline at this step. Since Device1 is a single-tile GPU, our final optimization here is to introduce inline assembly to optimize int64 mul. Further improving the performance by 28.48% in average from the previous step, we reach an up to 85.75% of the peak performance, 7.02X faster than the baseline for 32K-point, 1024-instance NTT. Figure 18 benchmarks the normalized execution time of HE evaluation routines on Device2. SIMD (8, 8) denotes the radix-2 NTT with data exchanging through SLM and SIMD shuffling, where each work-item holds one slot of NTT elements in registers. opt-NTT refers to the optimal NTT variant, radix-8 NTT with data exchanging through SLM, shown in Figure 17 . The last step is to further employ inline assembly to optimize modular addition and modular multiplication from instruction level. When substituting the naive NTT using SIMD (8, 8) NTT, we observe the execution time of the NTT part is improved by 34% in average while the overall routines are accelerated by 29.6%. When switching to our optimal NTT variant, we observe the overall performance becomes faster than the baseline by 1.92X in average. Further enabling assembly-level optimizations, we manage to reach 2.32X -2.41X acceleration for all five HE evaluation routines on this single-tile Intel GPU. Besides the algorithmic-level optimizations, we also demonstrate our instruction-level and application-level optimizations, which are modulus fusion, inline assembly for int64 multiplication and memory cache using a representative application of HE, encrypted element-wise polynomial matrix multiplication. In Figure 19 , matMul_mxnxk denotes a matrix multiplication C+=A * B, where C is m-by-n, A is m-by-k and B is k-by-n. Each matrix element is an 8K-element plaintext polynomial so each element-wise multiplication of matMul is a polynomial multiplication. Modulo operations are always applied at the end of each multiply or addition between polynomial elements. Before starting matMul, we need to allocate memory, initialize, encode and encrypt input sequences. Once matMul is completed, we decrypt the computing results and count the elapsed time of the whole process. Figure 19 compares our instruction-level and applicationlevel optimizations for matMul on both Device1 and Device2. The fused MAD_MOD and inline assembly accelerate the both 100x100x1 and 10x9x8 polynomial matrix multiplications by 11.8% and 28.2%, respectively, in average on Device1. With the memory cache introduced, both matMul applications are further improved by ∼90%. Fig. 19 : Element-wise polynomial multiplication accelerate two matMul benchmarks 2.68X and 2.79X, respectively, compared with the baseline on Device1. In regards to Device2, we observe a similar trend. These three optimizations together provide us with 3.11X and 2.82X acceleration for two matMul tests over the baseline on this smaller GPU. In this paper, we design and develop the first-ever SYCLbased GPU backend for Microsoft SEAL APIs. We accelerate our HE library for Intel GPUs spanning assembly level, algorithmic level and application level optimizations. Our optimized NTT is faster than the naive GPU implementation by 9.93X, reaching up to 85.1% of the peak performance. In addition, we obtain up to 3.1X accelerations for HE evaluation routines and the element-wise polynomial matrix multiplication application. Future work will focus on extending our HE library to multi-GPU and heterogeneous platforms. On data banks and privacy homomorphisms Public-key cryptosystems based on composite degree residuosity classes Somewhat practical fully homomorphic encryption (leveled) fully homomorphic encryption without bootstrapping Homomorphic encryption for arithmetic of approximate numbers On the security of homomorphic encryption on approximate numbers A fully homomorphic encryption scheme. Stanford university Tfhe: fast fully homomorphic encryption library over the torus Can homomorphic encryption be practical Private predictive analysis on encrypted medical data Secure dna-sequence analysis on encrypted dna nucleotides ngraph-he: a graph compiler for deep learning on homomorphically encrypted data Cryptflow2: Practical 2-party secure inference Ml confidential: Machine learning on encrypted data Pir with compressed queries and amortized query processing Labeled psi from homomorphic encryption with reduced computation and communication Homomorphic sorting with better scalability Accelerating sorting of fully homomorphic encrypted data Simple encrypted arithmetic library 2.3. 1 Design and implementation of a homomorphicencryption library Palisade lattice cryptography library user manual Intel HEXL (release 1.2) cuhe: A homomorphic encryption accelerator library A full rns variant of approximate homomorphic encryption Bootstrapping for approximate homomorphic encryption Speeding up the number theoretic transform for faster ideal lattice-based cryptography Highperformance fv somewhat homomorphic encryption on gpus: An implementation using cuda Accelerating number theoretic transform in gpu platform for fully homomorphic encryption Accelerating number theoretic transformations for bootstrappable homomorphic encryption on gpus Hardware architecture of a number theoretic transform for a bootstrappable rns-based homomorphic encryption scheme Heax: An architecture for computing on encrypted data Faster arithmetic for number-theoretic transforms High performance discrete fourier transforms on graphics processors Fpga-based high-performance parallel architecture for homomorphic computing on encrypted data Heaan demystified: Accelerating fully homomorphic encryption through architecture-centric analysis and optimization On large-batch training for deep learning: Generalization gap and sharp minima