key: cord-0570419-2zk0tq4u authors: Zhao, Dongfang title: Rache: Radix-additive caching for homomorphic encryption date: 2022-01-12 journal: nan DOI: nan sha: 54226411113b818444aeadefec858b9898170882 doc_id: 570419 cord_uid: 2zk0tq4u One of the biggest concerns for many applications in cloud computing lies in data privacy. A potential solution to this problem is homomorphic encryption (HE), which supports certain operations directly over the ciphertexts. Conventional HE schemes, however, exhibit significant performance overhead and are hardly applicable to real-world applications. This paper presents Rache, a caching optimization for accelerating the performance of HE schemes. The key insights of Rache include (i) caching some homomorphic ciphertexts before encrypting the large volume of plaintexts; (ii) expanding the plaintexts into a summation of powers of radixes; and (iii) constructing the ciphertexts with only homomorphic addition. The extensive evaluation shows that Rache exhibits almost linear scalability and outperforms Paillier by orders of magnitude. e.g., Symmetria [19] , are implemented as a symmetric operation for the scenarios where a secret key can be securely shared among parties, which is not always possible in cloud computing. The second type of HE schemes, e.g., Paillier [17] , are implemented as an asymmetric operation that overcomes the limitation of a symmetric one, and yet introduces significant performance overhead, making it impractical to encode a large volume of sensitive data. Although a hybrid scheme can be used to encrypt the secret key using asymmetric encryption one time, this process works only for a single session and the asymmetric encryption would have to be invoked many times in a production environment. This paper presents a new caching method, namely radix-additive caching for homomorphic encryption (Rache), for accelerating the performance of asymmetric homomorphic encryption represented by Paillier [17] . The key insights of Rache include: (i) precomputing and caching some homomorphic ciphertexts before encrypting the large volume of plaintexts; (ii) expanding a plaintext into a summation of additive radix entries; and (iii) constructing the ciphertexts with only homomorphic addition (without touching on any homomorphic encryption). The third insight is inspired by our conjecture that a homomorphic addition is much cheaper than a homomorphic encryption, which we will justify in §4.3.1. This paper makes the following technical contributions. • We present Rache, a new caching method for accelerating homomorphic encryption. ( § §3.1, 3.2) • We analyze the theoretical complexity of Rache and derive the worst-case optimal radix. ( § §3.3, 3.4) • We implement a system prototype of Rache with C and OpenSSL. ( §4.1) We evaluate Rache with three benchmarks (e.g. TPC-H [22] ) and three real-world applications (e.g., Covid-19 [4]) on CloudLab [5] . Experimental results show that: The source code of Rache will be released under Apache License 2.0. Homomorphic encryption (HE) is a specific type of encryption where certain operations between operands can be performed directly on the ciphertexts. For example, if an HE scheme ℎ (·) is additive, then the plaintexts with + operations can be translated into a homomorphic addition ⊕ on the ciphertexts. Formally, if and are plaintexts, then the following holds: As a concrete example, let ℎ ( ) = 2 , and we temporarily release the security requirement of ℎ (·). In this case, ℎ ( + ) = 2 + = 2 × 2 = ℎ ( ) × ℎ ( ), meaning that ⊕ is the arithmetic multiplication ×. An HE scheme that supports addition is said to be additive. Popular additive HE schemes include Symmetria [19] and Paillier [17] . The former is applied to symmetric encryption, meaning that a single secret key is used to both encrypt and decrypt the messages. The latter is applied to asymmetric encryption, where a pair of public and private keys are used for encryption and decryption. Due to the expensive arithmetical operations performed by the asymmetric encryption, Paillier is orders of magnitude slower than Symmetria. However, Paillier is particularly useful when there is no secure channel to share the secret key among users, which is required by symmetric encryption schemes. One notable extension of Symmetria is to incrementally encrypt plaintexts with the additive homomorphism property, as illustrated in [24] . An HE scheme that supports multiplication is said to be multiplicative. Symmetria [19] is also multiplicative using a distinct scheme than the one for addition. Other well-known multiplicative HE schemes include RSA [18] and ElGamal [6] . Similarly, a multiplicative HE scheme guarantees the following equality, where ⊗ denotes the homomorphic multiplication over the ciphertexts. An HE scheme that supports both addition and multiplication is called a fully HE scheme. This requirement should not be confused with specific addition and multiplication parameters, such as Symmetria [19] and NTRU [9] . That is, the addition and multiplication must be supported homomorphically under exactly the same scheme ℎ (·): It turned out to be extremely hard to construct fully HE schemes until Gentry [7] demonstrated such a scheme using lattice theory. The main issue with fully HE schemes is their performance; although extensive research and development have been carried out, current implementations incur impractical overhead for most real-world applications. One notable attempt to boost the performance of fully HE schemes is to distribute the computation [23] . Two popular open-source libraries of fully HE schemes are IBM HElib [8] and Microsoft SEAL [20] . The Rache scheme presented in this paper needs only the additive property and does not require a fully HE. In a positional numeral system, a number can be written as a summation of terms, each of which is a product of two factors-one is the integral power of radix and the other is the coefficient ranging from 0 to − 1. Formally, where (0 ≤ < ) indicates the coefficient of a specific radix entry. Eq. 1 can be further expanded into an expression with only additions: where < . This purely additive form in Eq. 2 allows us to apply additive homomorphic encryption inside the radix entries rather than the original number , as we will start discussing in the next section. Alg. 1 formalizes the encoding procedure with C-like pseudocode. Lines 1-4 initialize the cached entries of the integral powers of radix for future construction of ciphertexts. Lines 5-10 encode the plaintexts, each of which is computed directly over the cached ciphertexts that are initialized at the beginning of the algorithm. We will discuss the algorithm's correctness, complexity, and choice of in the remainder of this section. Input: An array of plaintexts [] of length ; A homomorphic encryption scheme ℎ (·) s.t. [ ] 10 end We denote the homomorphic summation over the ciphertexts. The correctness of Alg. 1 can be verified by direct computation as follows. The first equality is due to Line 9 of Alg. 1. The second equality is due to Line 3 of Alg. 1. The third equality is due to the definition of homomorphic encryption. The fourth equality is due to the fact that variable does not show up in the term . The fifth equality is due to Line 7 of Alg. 1. The sixth equality is, again, due to the definition of homomorphic encryption. The last equality is due to the definition of radix expansion. We denote the time cost of homomorphically encrypting a number. We denote ℎ the time cost of homomorphically adding two ciphertexts. We will soon see that in practice is much larger (i.e., more than two orders of magnitude) than ℎ in §4.3.1. Line 1 takes O (1) if we assume the system caches the maximal plaintext when reading [] into the memory. Lines 2-4 take O ( log ). Lines 6-8 take O (log ). Lines 9 takes O ( ℎ log ). Therefore, Lines 5-10 take O ( · (log + ℎ log )) = O ( ℎ log ). The overall time cost of Alg. 1 is thus O ( log + ℎ log ). In practice, is a small number (in fact, the following section will show that = 2 is an optimal radix in the worst case). The cost of homomorphic addition ℎ is on par with that of regular arithmetic operation and can be considered as 1. The factor log is a small number as well; for example, for encrypting a 1,000,000,000 number, 30 is good enough with radix 2. Consequently, a more practical upper bound of Alg. 1 is O ( + 2 ), where denotes a constant. That is, Alg. 1 costs time that is about constant folds of the homomorphic encryption and the overall number of ciphertexts. Recall that the time cost of Paillier is simply O ( ), which is the multiplication of the homomorphic encryption cost and the overall number of ciphertexts. We thus expect Alg. 1 will outperform Paillier by orders of magnitude. Before we demonstrate the performance superiority of Alg. 1, i.e., Rache, we conclude this section with a discussion on the optimal choice of radix in the worst case. Let ≥ 2 denote the maximal number to be encrypted in the application. Let ≥ 2 denote the radix or base of the homomorphic encryption. Obviously, given an arbitrary number , where 0 ≤ ≤ , there are +1 radix entries: 0 , 1 , . . . , , where = ⌊log ⌋. Let 0 ≤ ≤ . In the worst case, each radix-entry incurs −2 times of homomorphic addition, i.e., when computing ( − 1) · . Since one more homomorphic addition needs to be taken for the summation of each radix, the overall times of homomorphic addition, in the worst case when is one less than the next integral power of (i.e., ⌊log ⌋ = log +1 −1), is We will find out the optimal that minimizes ( ). We take the first-order derivative of ( ) as follows. The stationary point is therefore the solution to ′ ( ) = 0, which yields = 1. Since we require ≥ 2, we need to find another qualified radix. First, we calculate (2): Then, let ≥ 3, therefore ln > 1, which yields: Note that by definition, the following equation holds: If we assume ≥ 2, then ln( + 1) > 0. Both (ln ) −2 and −1 factors are obviously positive. Therefore, ′ ( ) is always positive, meaning that ( ) is a monotonically increasing function. It follows that the minimal qualified radix = 2 leads to the minimum number of homomorphic additions. We implement Rache with C and three key libraries: Open-MPI [15] (C binding), OpenSSL [16] , and homomorphic-c [10] . Specifically, the arbitrarily large numbers are managed with the BIGNUM structure. The baseline homomorphic encryption scheme is Paillier [17] , which is also implemented with C and OpenSSL. It is easy for memory to leak in C, our implementation makes sure that all memory allocation is appropriately released (see §4.6 for quantitative evaluation). At the time of writing this paper, the implementation consists of 11,839 lines of code. All experiments were carried out on the CloudLab testbed [5] . We use the c6420 instance, which is equipped with Intel Xeon Gold 6142 CPUs at 2.6 GHz, 384 GB ECC DDR4-2666 memory, and two Seagate 1 TB 7200 RPM 6G SATA HDDs. Each node has 32 physical cores and supports 64 hyperthreads. The operating system image is Ubuntu 20.04.3 LTS. The system is installed with the following notable libraries: gcc 9.3.0, Open-MPI 4.0.3, and OpenSSL 1.1.1. Our baseline encryption scheme is Paillier [17] , which is implemented in C and the OpenSSL library [16] . Most experiments adopt a strong-scaling mechanism, meaning that the given workload is split by a variety of cores, ranging from 1 to 32. We have evaluated Rache with three benchmarks and three real-world applications. • The first benchmark is a microbenchmark to quantify the cost of homomorphic encryption and homomorphic addition, respectively. For the former, a sequence of integers [0, 32,768) are homomorphically encrypted; for the latter, the ciphertexts stored at radix entries are homomorphically summed up in a round-robin fashion 32,768 times. • The second benchmark is TPC-H ver. 3.0.0 [22], a standard relational database benchmark. TPC-H allows the user to specify the scales of the generated data; in this paper we set scale as one, resulting in about one gigabyte of data. We will focus on the part table, which consists of 200,000 tuples. • The third benchmark is a dynamic set of random numbers used in INCHE [24] for homomorphic encryption. This benchmark is mainly used for the purpose of weak scaling, allowing for the scalability test ranging between 1,024 and 32,768 numbers. We repeat every performance experiment multiple times and report the averages and standard errors. Radix is set two in all experiments. The whole idea of Rache is built upon the assumption that homomorphic addition is a much cheaper operation than homomorphic encryption. Our first experiment, therefore, tries to confirm this assumption. The micro-benchmark carries out = 32, 768 operations for homomorphic encryption and homomorphic addition, respectively. Specifically, for encryption, the operation is ℎ ( ), 0 ≤ < ; for addition, the operation is ℎ (⌊log ⌋) ⊕ ℎ ((⌊log ⌋ + 1)%(⌊log ⌋ + 1)), where ⊕ denotes the homomorphic addition and % denotes the modular operation. Fig. 1 shows that the homomorphic addition is a much cheaper operation than homomorphic encryption. Regardless of the number of available cores, homomorphic encryption takes more than two orders of magnitude time than homomorphic addition. Indeed, one Rache encryption typically involves multiple homomorphic additions (plus the constant initialization cost). The question then becomes whether the multiplication of homomorphic additions incurred by Rache still outperforms the original homomorphic encryption. The answer is yes, as demonstrated by the following experiments. Fig. 2 . We report the execution time of initializing the radixes and that of encoding with radix cache, respectively. The former is referred to as Rache Init and the latter as Rache Exec in the figure (and also in other experiments to be discussed). Both Rache and Paillier exhibit good (strong) scalability due to the data parallelism from the message passing interface (MPI). The initialization time of Rache is roughly flattened, showing a marginal increase when more cores are involved due to the inter-process We observe that Rache consistently outperforms Paillier by more than four orders of magnitudes at all scales. The huge performance gap (even larger than most of the other experiments to be discussed) is partially due to the dataset itself: The part table in TPC-H has relatively small numbers (max 21) such that many of the new numbers to be encoded by Rache can be quickly (homomorphically) constructed by the cached ciphertexts. We will see how the plaintext affects Rache's performance in the following sections. We start with a more general setup: encoding a set of random numbers instead of small numbers. For random numbers, we take the same approach used in [24] . Essentially, random numbers are generated in a uniform distribution by modular . We report the results in Fig. 3 . The Rache overhead stays roughly constant for different numbers of cores, but not as low as TPC-H. This is because the largest number in this INCHE dataset is 1,024, which requires more radixes to be initialized. Despite the overhead, we observe that Rache's encoding time is about two orders of magnitude lower than Paillier at all scales. It should be noted that, however, the Rache Init overhead is a onetime thing. With a larger number of plaintexts, the overhead does not change as long as the maximal number is unchanged. We will see this in the following weak-scaling experiment, i.e., increasing the workloads. We evaluate the scalability of Rache in this section. For generality, we focus on the INCHE dataset of random numbers rather than, say, specific benchmarks or applications, which will be the emphases of consequent sections. Fig. 4 reports the conventional weak-scaling experiment. We control the workload to be proportional to the number of cores: 1,024 plaintexts for every core. That is, the workloads range from 1,024 to 32,768 plaintexts of uniformly distributed numbers. In each workload, the maximal value is roughly the number of plaintexts minus 1 or 2 due to the uniform distribution. This explains why the Rache overhead (i.e., Rache Init) increases proportionally to the number of cores or workloads. Rache outperforms Paillier by orders of magnitudes at all scales. However, Rache seems to exhibit a higher slope of encoding time. We stress that the absolute values of Rache performance are subseconds (and the -axis is logarithmic), therefore the overhead can be best explained by the IPC overhead. To confirm our conjecture, we conduct the following experiment, in which we fix the number of cores but increase the workloads. Fig. 5 shows the encoding time when we fix the number of cores as 32 but increase the number of plaintexts from 1,024 to 32,768. We observe that when the IPC overhead is fixed (for 32 cores), the encoding time is proportionally increased regarding the workload size. Notably, the Rache initialization overhead is much less noticeable than the previous weak-scaling experiment because the IPC overhead is gone and the only thing remaining is the larger number of radixes when working on more plaintexts. . The data set exhibits a large variety of numbers, from tens (e.g., number of affected states) to hundreds of millions (e.g., the total number of test results). This partially impacts the balance between the initialization (i.e., the overhead) and the encoding procedures of Rache: We observe that with few cores (e.g., 1 and 2) the overhead is smaller than the encoding cost, while with more cores (e.g., 16, 32) the per-core encoding is very efficient and takes less time than the overhead. Some of the overhead, i.e., precomputing and caching the large radixes, is unnecessary for those small values, and yet has to exist due to those extremely large values. We stress that the overhead is a one-time thing though: If there were, say, ten years of Covid-19 data, the overhead would look roughly the same and would be outweighed by the increased cost of encoding the data (cf. Fig. 5 ). As we have seen in the benchmarks, Range outperforms Paillier by almost two orders of magnitude. Fig. 7 reports the encoding performance of Rache and Paillier on a database of human genome [12] (hg38) that was last updated in March 2020, under the umbrella of the Augustus gene prediction project [2] . As expected, Rache outperforms Paillier at all scales by orders of magnitude. In sheer contrast to the Covid-19 dataset, the initialization overhead of Rache in hg38 is much less significant: Even at 32-core, the overhead is less than 30%. This is mainly due to a large number of plaintexts (172,120), whose encoding time greatly outweigh the initialization, which itself is not trivial either: totally 29 radixes for the largest possible value of 248,937,123. We apply Rache and Paillier to the historical trade volume of Bitcoin exchange since 2013 [3] . Fig. 8 shows that Rache outperforms Paillier by more than one order of magnitude, which is consistent with what we have found so far. The notable thing here is the large overhead incurred by Rache: on a single core, the overhead is on par with Rache's encoding time; on 32 cores, the overhead is on par with the Paillier processing time and orders of magnitude larger than Rache's encoding time. This phenomenon is due to two reasons. First, the Bitcoin trade volume consists of very large numbers-most are in the order of millions and the largest one is 4,956,849,516 requiring 34 radixes. Second, the number of plaintexts is relatively small: there are totally 1,086 plaintexts, each of which records the Bitcoin exchange for the last three days. Fig. 9 reports the memory footprint of Rache and Paillier when encoding the U.S. Covid-19 statistic. We only plot the results for the encoding with 32 cores because other scales incur almost the same memory consumption. It should be noted that the -axis is the normalized timeline: we probe the memory consumption at every tenth point of the encoding process. Rache's memory consumption is constant during the encoding process because the main allocation of memory is carried out at the beginning of the process, i.e., initialization of the radixes and the subsequent computations occur directly over the ciphertexts without requiring new memory. In contrast, Paillier's memory footprint is somewhat sensitive to the plaintexts because they are encrypted on-the-fly and some arbitrarily large plaintexts can cause abrupt allocation of new memory space. At the end of the encoding procedure, we observe that Rache consumes about 12% less memory than Paillier, i.e., 11.04 MB vs. 12.36 MB. This paper presents Rache, a caching optimization for accelerating the performance of homomorphic encryption. The key insights of Rache are caching some homomorphic ciphertexts before encrypting the large volume of plaintexts and constructing the ciphertexts with only homomorphic addition. The extensive evaluation shows that Rache exhibits almost linear scalability and outperforms Paillier by orders of magnitude. Our future work will focus on integrating Rache into a blockchain framework called BAASH [1] such that sensitive scientific results can be shared and reproduced in a verifiable manner. Another follow-up along this line of research is to distribute Rache encoding into a finer granularity, e.g., topology-aware parallelism [21] . BAASH: Lightweight, efficient, and reliable blockchain-as-a-service for hpc systems The design and operation of CloudLab A public key cryptosystem and a signature scheme based on discrete logarithms Fully homomorphic encryption using ideal lattices NTRU: A ring-based public key cryptosystem Banking in the cloud: Part 3 -contractual issues Privacy preservation in e-health cloud: taxonomy, privacy requirements, feasibility analysis, and opportunities Advanced encryption standard Public-key cryptosystems based on composite degree residuosity classes A method for obtaining digital signatures and public-key cryptosystems Efficient confidentialitypreserving data analytics over symmetrically encrypted datasets Topological modeling and parallelization of multidimensional data on microelectrode arrays Data confidentiality challenges in big data applications INCHE: high-performance encoding for relational databases through incrementally homomorphic encryption Privacy-preserving search for a similar genomic makeup in the cloud