Submitted 2 June 2020
Accepted 30 January 2021
Published 8 March 2021

Corresponding author
Yewon Kim, fdt150@kookmin.ac.kr

Academic editor
Gang Mei

Additional Information and
Declarations can be found on
page 26

DOI 10.7717/peerj-cs.404

Copyright
2021 Kim and Yeom

Distributed under
Creative Commons CC-BY 4.0

OPEN ACCESS

Accelerated implementation for testing
IID assumption of NIST SP 800-90B
using GPU
Yewon Kim1 and Yongjin Yeom1,2

1 Department of Financial Information Security, Kookmin University, Seoul, South Korea
2 Department of Information Security, Cryptology, and Mathematics, Kookmin University, Seoul, South Korea

ABSTRACT
In cryptosystems and cryptographic modules, insufficient entropy of the noise sources
that serve as the input into random number generator (RNG) may cause serious
damage, such as compromising private keys. Therefore, it is necessary to estimate the
entropy of the noise source as precisely as possible. The National Institute of Standards
and Technology (NIST) published a standard document known as Special Publication
(SP) 800-90B, which describes the method for estimating the entropy of the noise
source that is the input into an RNG. The NIST offers two programs for running the
entropy estimation process of SP 800-90B, which are written in Python and C++. The
running time for estimating the entropy is more than one hour for each noise source.
An RNG tends to use several noise sources in each operating system supported, and the
noise sources are affected by the environment. Therefore, the NIST program should be
run several times to analyze the security of RNG. The NIST estimation runtimes are
a burden for developers as well as evaluators working for the Cryptographic Module
Validation Program. In this study, we propose a GPU-based parallel implementation
of the most time-consuming part of the entropy estimation, namely the independent
and identically distributed (IID) assumption testing process. To achieve maximal GPU
performance, we propose a scalable method that adjusts the optimal size of the global
memory allocations depending on GPU capability and balances the workload between
streaming multiprocessors. Our GPU-based implementation excluded one statistical
test, which is not suitable for GPU implementation. We propose a hybrid CPU/GPU
implementation that consists of our GPU-based program and the excluded statistical
test that runs using OpenMP. The experimental results demonstrate that our method
is about 3 to 25 times faster than that of the NIST package.

Subjects Cryptography, Distributed and Parallel Computing, Security and Privacy
Keywords Parallel processing, GPU computing, Entropy estimator, NIST SP 800-90B, Random
Number Generator

INTRODUCTION
A random number generator (RNG) generates random numbers required to construct the
cryptographic keys, nonce, salt, and sensitive security parameters used in cryptosystems
and cryptographic modules. In general, an RNG produces random numbers (output) via
a deterministic algorithm, depending on the noise sources (input). If its input is affected
by the low entropy of the noise sources, the output may be compromised. It is easy to

How to cite this article Kim Y, Yeom Y. 2021. Accelerated implementation for testing IID assumption of NIST SP 800-90B using GPU.
PeerJ Comput. Sci. 7:e404 http://doi.org/10.7717/peerj-cs.404

https://peerj.com/computer-science
mailto:fdt150@kookmin.ac.kr
https://peerj.com/academic-boards/editors/
https://peerj.com/academic-boards/editors/
http://dx.doi.org/10.7717/peerj-cs.404
http://creativecommons.org/licenses/by/4.0/
http://creativecommons.org/licenses/by/4.0/
http://doi.org/10.7717/peerj-cs.404


find examples that show the importance of entropy in operating systems. Heninger et al.
(2012) describes the RSA/DSA private keys for some TLS/SSH hosts may be obtained
due to insufficient entropy of Linux pseudo-random number generator (PRNG) during
the key generation process. Ding et al. (2014) investigated the amount of the entropy of
Linux PRNG running on Android in boot-time. Kaplan et al. (2014) demonstrated an
IPv6 denial of service attack and a stack canary bypass with the weaknesses of insufficient
entropy in boot-time of Android. Kim, Han & Lee (2013) presented a technique to recover
PreMasterSecret (PMS) of the first SSL session in Android by 258 complexity since PMS
is generated from insufficient entropy of OpenSSL PRNG at boot-time. Ristenpart & Yilek
(2010), Bernstein et al. (2013), Michaelis, Meyer & Schwenk (2013), Schneier et al. (2015),
and Yoo, Kang & Yeom (2017) describe the attacks caused by weakness of entropy collectors
or incorrect estimations of the entropy that are exaggerated or too conservative.

Insufficient entropy of the noise source that is the input into the RNG may cause serious
damage in cryptosystems and cryptographic modules. Thus, it is necessary to estimate the
entropy of the noise source as precisely as possible. The United States National Institute
of Standards and Technology (NIST) Special Publication (SP) 800-90B (Barker & Kelsey,
2012; Sönmez Turan et al., 2016; Sönmez Turan et al., 2018) is a standard document for
estimating the entropy of the noise source. The general flow of the entropy estimation
process in SP 800-90B (Sönmez Turan et al., 2018) is to determine the track, estimate the
entropy according to the track, and then apply the restart test, as summarized in Fig. 1. In
this paper, determining the track is referred to as an independent and identically distributed
(IID) test. There are two different tracks: an IID track and a non-IID track. If it is determined
as the IID track, it is assumed that the samples of the noise source are IID; otherwise, the
samples are non-IID. The estimator depending on IID or non-IID track estimates the
entropy of the noise source. The restart test evaluates the estimated entropy using different
outputs from many restarts of the noise source to check the overestimate. This document
is currently used in the Cryptographic Module Validation Program (CMVP) and has
been cited as a recommendation for entropy estimation in an ISO standard document
ISO/IEC-20543 (2019) for test and analysis methods of RNGs. The principles of entropy
estimators in SP 800-90B have been investigated and analyzed theoretically (Kang, Park &
Yeom, 2017; Zhu et al., 2017; Zhu et al., 2019). However, it is difficult to find research on
the efficient implementation of the entropy estimation process of SP 800-90B.

NIST provides two programs (NIST, 2015) on GitHub for the entropy estimation process
of SP 800-90B. The first program is for the entropy estimation process of the second draft of
SP 800-90B (Sönmez Turan et al., 2016), written in Python. The second program is for the
entropy estimation process of the final version of SP 800-90B (Sönmez Turan et al., 2018),
written in C++. Table 1 displays the execution times of two single-threaded NIST programs
on the central processing unit (CPU). The noise source used as input is GetTickCount,
with a sample size of 8 bits. GetTickCount can be collected through the GetTickCount()
function in the Windows environment. Since GetTickCount is determined as the non-IID
by the IID test, the process of the IID-track estimation entropy does not run. The entropy
estimation process of the IID track takes approximately one second for both NIST programs

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 2/29

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.404


Figure 1 Flow of the entropy estimation process of SP 800-90B.
Full-size DOI: 10.7717/peerjcs.404/fig-1

Table 1 Execution time of each single-threaded NIST program for the entropy estimation process
(noise source: GetTickCount; noise sample size: 8 bits).

NIST program
written in Python

NIST program
written in C++

IID test 17 h 1 h 10 min
[IID track] Estimation entropy − −
[Non-IID track] Estimation entropy 15 min 20 s
Restart tests 2 s 2 min
Total execution time 17 h 16 min 1 h 13 min

if it is forcibly operated. In Table 1, the IID test consumes the majority of the total execution
time in both programs.

Developers of cryptosystems or cryptographic modules should estimate the entropy of
the noise sources to analyze the security of the RNG. Since the entropy estimation process
of SP 800-90B is representative, and modules for the CMVP shall be tested for compliance
with SP 800-90B (NIST & CSE, 2020), most developers use the method of SP 800-90B.
Furthermore, since CMVP Implementation Guidance (IG) gives the link of the NIST
programs (NIST & CSE, 2020), most developers use the NIST programs to reduce the time
required for implementation. As recommended by the CMVP, the RNG should use at least
one noise source. Since the NIST program estimates the entropy for one noise source,

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 3/29

https://peerj.com
https://doi.org/10.7717/peerjcs.404/fig-1
http://dx.doi.org/10.7717/peerj-cs.404


the developer should run the NIST program k times when the RNG uses k noise sources.
Since the noise sources are different for each operating system, the developer should
run the program k×s times if the developer’s cryptosystem or cryptographic module
supports s operating systems. The distribution of the noise source may be changed due
to mechanical or environmental changes or to the timing variations in human behavior
(NIST & CSE, 2020). The physical noise source is based on a dedicated physical process
(ISO/IEC-20543, 2019); it may be affected by the environment of the device in which
the RNG operates. Therefore, to claim that the noise source has an identical distribution
in any environment, the developer should perform the IID test and entropy estimation
in several environments or devices. If the developer performs analysis on d devices, the
developer should run the program k×s×d times. If k=10, s=2, and d =5, the developer
should run the NIST program 100 times. According to Table 1, the NIST program written
in C++ requires approximately 1 h to estimate the entropy of one noise source. If the
developer cannot run multiple NIST programs simultaneously, it takes about 100 hours or
approximately four days. Moreover, to find k noise sources that can be used as inputs of the
RNG in the environment, the developer should perform entropy estimation for k or more
collectible noise sources. Therefore, it may take more than 100 hours. The developer of the
cryptographic module for the CMVP should perform similar work for re-examination or
new examination every specific period since the module will be placed on the CMVP active
list for five years. The evaluator running checks based on the documentation submitted
by the developer for the CMVP may run the NIST program multiple times as well. As this
runtime may be burdensome for developers, it can be tempting to use an RNG without
security analysis. Thus, if the developer’s RNG is vulnerable, this vulnerability is likely to
affect the overall security of the cryptosystem or cryptographic module.

Graphics processing units (GPUs) are excellent candidates to accelerate the process of
SP 800-90B, especially the IID test. GPUs were initially designed for accelerating computer
graphics and image processing, but they have become more flexible, allowing them
to be used for general computations in recent years. The use of GPUs for performing
computations handled by CPUs is known as general-purpose computing on GPUs
(GPGPUs). New parallel computing platforms and programming models, such as the
computing unified device architecture (CUDA) released by NVIDIA, enable software
developers to leverage GPGPUs for various applications. GPGPUs are used in cryptography
as well as areas including signal processing and artificial intelligence. Numerous studies
have been conducted on the parallel implementations of cryptographic algorithms such as
AES, ECC, and RSA (Neves & Araujo, 2011; Li et al., 2012; Pan et al., 2016; Ma et al., 2017;
Li et al., 2019) and on the acceleration of cryptanalysis, including hash collision attacks
using GPUs (Stevens et al., 2017).

To process the entire IID test in parallel using GPU, approximately 9 GB or more of the
global memory of the GPU are required. Since the compression test used in the IID test
requires a different technique of implementation from the other statistical tests, a CUDA
version of the compression test is needed to implement the IID test in parallel. However,
bzip2 used in the compression test is not actively under development as a CUDA version
since it is unsuitable for GPU implementation. Therefore, we propose a GPU-based parallel

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 4/29

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.404


implementation of the IID test without the compression test using multiple optimization
techniques. The adaptive size of the global memory used in the kernel function can be set
so that maximal performance improvement can be obtained from the GPU specification in
use. Moreover, we propose a hybrid CPU/GPU implementation of the IID test that includes
the compression test. Our GPU-based implementation is approximately 12 times faster
than the multi-threaded NIST program without the compression test when determining
the noise source as the IID. It is approximately 25 times faster when determining the
noise source as the non-IID. Our hybrid CPU/GPU implementation is 3 and 25 times,
respectively, faster than the multi-threaded NIST program with the compression test when
determining the noise source as the IID and the non-IID, respectively. Most noise sources
are non-IID (Kelsey, 2012). The non-IID noise sources are disk timings, interrupt timings,
jitter (Müller, 2020), GetTickCount, and so on. Since the proposed hybrid CPU/GPU
implementation has better performance for the non-IID noise sources, we expect it to be
highly practical.

The remainder of this paper is organized as follows. ‘Preliminaries’ introduces the CUDA
GPU programming model, the OpenMP programming model, and the IID test of SP 800-
90B. ‘Proposed Implementations’ outlines our GPU-based parallel implementation of the
IID test and the hybrid CPU/GPU implementation of the IID test. In ‘Experiments and
performance evaluation’, the experimental results on the optimization and performance of
our methods are presented and analyzed. Finally, ‘Conclusions’ summarizes and concludes
the paper.

PRELIMINARIES
CUDA programming model
NVIDIA CUDA (NVIDIA, 2020b) is the most widely used programming model for GPUs.
CUDA uses the single instruction multiple thread (SIMT) model. A kernel is a function
that performs the same instruction on the GPU in parallel. A thread is the smallest unit
operating the instructions of the kernel function. Multiple threads are grouped into a
CUDA block, and multiple blocks are grouped into a grid.

A CUDA-capable GPU contains numerous CUDA cores, which are fundamental
computing units and execute the threads. CUDA cores are collected into groups called
streaming multiprocessors (SMs).

A kernel is launched from the host (CPU) to run on GPU and generate a collection of
threads organized into blocks. Each CUDA block is assigned to one of the SMs on the GPU
and executes independently on GPU. The mapping between blocks and SMs is done by
a CUDA scheduler (Vaidya, 2018). An SM can concurrently execute the smaller group of
threads, which is called a warp. All threads in a warp execute the same instruction, and
there are 32 threads in a warp on most CUDA-capable GPUs. Latency can occur, such as
data required for computation have not yet been fetched from global memory that the
access is slow. To hide the latency, an SM can execute context-switching, which transfers
control to another warp while waiting for the results.

The memory of CUDA-capable GPU includes global memory, local memory, shared
memory, register, constant memory, and texture memory. Table 2 shows the memory

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 5/29

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.404


Table 2 Memory of CUDA-capable GPU (NVIDIA, 2020a).

Memory Location
on/off chip

Access Scope Lifetime

Register On R /W 1 thread Thread
Local Off R /W 1 thread Thread
Shared On R /W All threads in block Block
Global Off R /W All threads +host Host allocation
Constant Off R All threads +host Host allocation
Texture Off R All threads +host Host allocation

types listed from top to bottom by access speed from fast to slow, and their principal
characteristics.

A basic frame of the program using the CUDA programming model is as follows: allocate
memory in the device (GPU) and transfer data from the host to the device (if necessary);
launch the kernel; transfer data from the device to the host (if required).

OpenMP programming model
Open Multi-Processing (OpenMP) (OpenMP, 2018) is an application programming
interface (API) for parallel programming on the shared memory multiprocessors. It
extends C, C++, and FORTRAN on many platforms, instruction-set architectures, and
operating systems, including Linux and Windows with a set of compiler directives, library
routines, and environment variables. OpenMP facilitates the parallelization of the sequential
program. The programmer adds parallelization directives to loops or statements in the
program.

OpenMP uses the fork-join parallelism (OpenMP, 2018). OpenMP program begins as
a single thread of execution, called an initial thread. When the initial thread encounters a
parallel construct, the thread spawns a team of itself and zero or more additional threads
as needed and becomes the master of the new team. The statements and functions in the
parallel region are executed in parallel by each thread in the team. All threads replicate
the execution of the same code unless a work-sharing directive (such as for dividing the
computation among threads) is specified within the parallel region. Variables default to
shared among all threads in parallel region.

Terms
A sample is data obtained from one output of the (digitized) noise source and the sample
size is the size of the (noise) sample in bits. For example, we collect a sample of the noise
source GetTickCount in Windows by calling the GetTickCount() function once. In this
case, the sample size is 32 bits. However, as certain estimators of SP 800-90B do not
support samples larger than 8 bits, it is necessary to reduce the sample size. GetTickCount
is the elapsed time (in milliseconds) since the system was started. Thus, it is thus easy
to conclude that the low-order bits in the sample of GetTickCount contain most of the
variability. Therefore, it would be reasonable to reduce the 32-bit sample to an 8-bit sample
by using the lowest 8 bits. The entropy estimation of SP 800-90B is performed on input

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 6/29

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.404


data consisting of one million samples, where each sample size is 8 bits. Furthermore, the
maximum of the min-entropy per sample is 8.

IID test for entropy estimation
The IID test of SP 800-90B consists of permutation testing and five additional chi-square
tests. Permutation testing identifies evidence against the null hypothesis that the noise
source is IID. Since the permutation testing is the most time-consuming step in the entire
IID test, we only focus on the permutation testing in this study.

Algorithm 1 Permutation testing (Sönmez Turan et al., 2018).
Require: S=(s1,...,sL), where si is the noise sample and L=1,000,000.
Ensure: Decision on the IID assumption.
1: for statistical test i do
2: Assign the counters Ci,0 and Ci,1 to zero.
3: Calculate the test statistic TESTINi on S.
4: end for
5: for j=1 to 10,000 do
6: Permute S using the Fisher–Yates shuffle algorithm.
7: Calculate the test statistic TESTShufflei on the shuffled data.
8: if (TESTShufflei > TEST

IN
i ) then

9: Increment Ci,0.
10: else if (TESTShufflei =TEST

IN
i ) then

11: Increment Ci,1.
12: end if
13: end for
14: if ((Ci,0+Ci,1≤5)or(Ci,0≥9,995)) for any i then
15: Reject the IID assumption.
16: else
17: Assume that the noise source outputs are IID.
18: end if

Algorithm 1 presents the algorithm of the permutation testing described in SP 800-90B.
The permutation testing first performs statistical tests on one million samples of the noise
source, namely the original data. We refer to the results of the statistical tests as the original
test statistics. Thereafter, permutation testing carries out 10,000 iterations, as follows:
In each iteration, the original data are shuffled, the statistical tests are performed on the
shuffled data, and the results are compared with the original test statistics. After 10,000
iterations, the ranking of the original test statistics among the shuffled test statistics is
computed. If the rank belongs to the top 0.05% or bottom 0.05%, the permutation testing
determines that the original data (input) are not IID. That is, it concludes that the original
data are not IID if Eq. (1) is satisfied for any i that is the index of the statistical test. For
any i, the counter Ci,0 is the number of j in step 5 of alg:alg1 satisfying the shuffled test
statistic TESTShufflei > the original test statistic TEST

IN
i . The counter Ci,1 is the number of

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 7/29

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.404


Algorithm 2 Permutation testing of NIST program written in C++.
Require: S=(s1,...,sL), where si is the noise sample and L=1,000,000.
Ensure: Decision on the IID assumption.
1: for statistical test i do
2: Assign the counters Ci,0 and Ci,1 to zero.
3: Calculate the test statistic TESTINi on S.
4: end for
5: for j=1 to 10,000 do
6: Permute S using the Fisher–Yates shuffle algorithm.
7: for statistical test i do
8: if statusi= true then
9: Calculate the test statistic TESTShufflei on the shuffled data.
10: if (TESTShufflei > TEST

IN
i ) then

11: Increment Ci,0.
12: else if (TESTShufflei =TEST

IN
i ) then

13: Increment Ci,1.
14: else
15: Increment Ci,2.
16: end if
17: if ((Ci,0+Ci,1 > 5)and(Ci,1+Ci,2 > 5)) then
18: statei= false.
19: end if
20: end if
21: end for
22: end for
23: if ((Ci,0+Ci,1≤5)or(Ci,0≥9,995)) for any i then
24: Reject the IID assumption.
25: else
26: Assume that the noise source outputs are IID.
27: end if
Algorithm 3 Fisher–Yates shuffle (Sönmez Turan et al., 2018).
Require: S=(s1,...,sL), where si is the noise sample and L=1,000,000.
Ensure: Shuffled S=(s1,...,sL).
1: for i from L downto 1 do
2: Generate a random integer j such that 1≤ j≤ i.
3: Swap sj and si.
4: end for

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 8/29

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.404


j satisfying TESTShufflei =TEST
IN
i , whereas the counter Ci,2 is the number of j satisfying

TESTShufflei < TEST
IN
i .(

Ci,0+Ci,1≤5
)
or

(
Ci,0≥9,995

)
(1)

Equivalently, the permutation testing determines that the original data are IID if Eq. (2)
is satisfied for all i that is the index of the statistical test.(
Ci,0+Ci,1 > 5

)
and

(
Ci,1+Ci,2 > 5

)
(2)

The NIST optimized the permutation testing of the NIST program written in C++ using
Eq. (2). Thus, even if each statistical test is not performed 10,000 times completely, the
permutation testing can determine that the input data are IID. Algorithm 2 is the improved
version of the permutation testing optimized by the NIST.

We briefly introduce the shuffle algorithm and the tests used in the permutation testing.
The shuffle algorithm is the Fisher–Yates shuffle algorithm presented in Algorithm 3. The
permutation testing uses 11 statistical tests, the names of which are as follows:

• Excursion test
• Number of directional runs
• Length of directional runs
• Number of increases and decreases
• Number of runs based on the median
• Length of runs based on the median
• Average collision test statistic
• Maximum collision test statistic
• Periodicity test
• Covariance test
• Compression test*

The aim of the periodicity test is to measure the number of periodic structures in the
input data. The aim of the covariance test is to measure the strength of the lagged correlation.
Thus, the periodicity and covariance tests take a lag parameter as input and each test is
repeated for five different values of the lag parameter: 1, 2, 8, 16, and 32 (Sönmez Turan et
al., 2018). Therefore, a total of 19 statistical tests are used in the permutation testing.

If the input data are binary (that is, the sample size is 1 bit), one of two conversions is
applied to the input data for some of the statistical tests. The descriptions of each conversion
and the names of the statistical tests using that conversion are as follows (Sönmez Turan et
al., 2018):

Conversion I
Conversion I divides the input data into 8-bit non-overlapping blocks and counts the
number of 1s in each block. If the size of the final block is less than 8 bits, zeroes
are appended. The numbers and lengths of directional runs, numbers of increases and
decreases, periodicity test, and covariance test apply Conversion I to the input data.

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 9/29

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.404


Conversion II
Conversion II divides the input data into 8-bit non-overlapping blocks and calculates
the integer value of each block. If the size of the final block is less than 8 bits, zeroes are
appended. The average collision test statistic and maximum collision test statistic apply
Conversion II to the input data.

For example, let the binary input data be (0,1,1,0,0,1,1,0,1,0,1,1). For Conversion I,
the first 8-bit block includes four 1s and the final block, which is not complete, includes
three 1s. Thus, the output data of Conversion I are (4,3). For Conversion II, the integer
value of first block is 102 and the final block becomes (1,0,1,1,0,0,0,0) with an integer
value of 88. Thus, the output of Conversion II is (102,88).

PROPOSED IMPLEMENTATIONS
Target of GPU-based parallel processing
Steps 5 to 22 of Algorithm 2, with 10,000 iterations, consume most of the processing time
of the permutation testing. The shuffle algorithm and 19 statistical tests are performed
on the data with one million samples of the noise source in each iteration. Hence, it is
natural to consider the GPU-based parallel implementation of 10,000 iterations, which are
processed sequentially in the permutation testing.

The implementation of the compression test* differs from those of the other statistical
tests used in the permutation testing. The compression test* uses bzip2 (Seward, 2019),
which compresses the input data using the Burrows–Wheeler transform (BWT), the
move-to-front (MTF) transform, and Huffman coding. There have been studies on the
parallel implementation of bzip2 using the GPU. In Patel et al. (2012), all three main
steps, namely the BWT, the MTF transform, and Huffman coding, were implemented
in parallel using the GPU. However, the performance was 2.78 times slower than that
of the CPU implementation. In Shastry et al. (2016), only the BWT was computed on
the GPU and a performance improvement of 1.4 times that of the standard CPU-based
algorithm was achieved. However, we couldn’t apply this approach, because our parallel
test should be implemented on the GPU together with other statistical tests. Moreover,
the compression test does not play a key role in Algorithm 2. That is, it is infrequent for a
noise source to be determined as the non-IID only by the compression test results among
the 19 statistical tests used in the permutation testing. Therefore, we design the GPU-based
parallel implementation of the permutation testing consisting of the shuffle algorithm and
18 statistical tests, without the compression algorithm. Moreover, we design the hybrid
CPU/GPU implementation of the permutation testing consisting of our GPU-based parallel
implementation and a maximum of 10,000 compression tests using OpenMP.

Overview of GPU-based parallel permutation testing
Approximately 9.3 GB (= 10,000 × one million bytes of data) of the global memory of
the GPU is required for the CPU to invoke a CUDA kernel to process 10,000 iterations of
the permutation testing in parallel on the GPU. Some GPUs do not have more than 9 GB
of global memory. Therefore, we propose the GPU-based parallel implementation of the

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 10/29

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.404


Figure 2 CPU/GPU workflow of GPU-based parallel implementation of permutation testing. (A)
Code running on the host/CPU. (B) Code running on the device/GPU.

Full-size DOI: 10.7717/peerjcs.404/fig-2

permutation testing, which processes N iterations in parallel on the GPU according to the
user’s GPU specification and repeats this process R=d10,000/Ne times.

Figure 2 presents the workflow of the CPU and GPU. The host refers to a general CPU
that executes the program sequentially, whereas the device refers to a parallel processor
such as a GPU. In steps 1 to 3 of Fig. 2, the host performs 18 statistical tests on one million
bytes of the input data (without shuffling) and holds the results. In step 4, the host calls a
function that allocates the device memory required to process N iterations in parallel on
the device. The use and size of the variables are listed in Table 3. In step 5, the input data
(No. 1 in Table 3), and the results of the statistical tests in steps 1 to 3 (No. 4 in Table
3) are copied from the host to the device. In step 6, the host launches a CUDA kernel
CurandInit, which initializes the N seeds used in the curand() function. The curand()
function that generates random numbers using seeds on the device is invoked by the CUDA
kernel Shuffling. When the host receives the completion of the kernel CurandInit, the
host proceeds to steps 7 to 13.10,000 iterations are divided into R rounds and each round
processes N iterations in parallel on the device. To process N iterations, the host launches
the CUDA kernel Shuffling (step 8) and then launches the CUDA kernel Statistical
test (step 9) as soon as the host receives the completion of the kernel Shuffling. When
the host receives the completion of the kernel Statistical test, in step 10, the counters

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 11/29

https://peerj.com
https://doi.org/10.7717/peerjcs.404/fig-2
http://dx.doi.org/10.7717/peerj-cs.404


Table 3 Use and size of variables allocated to GPU.

No. Use of variable Size of variable (bytes)

1 Original data (input) 1,000,000
2 N shuffled data N ×1,000,000
3 N seeds used by curand() function N× sizeof(curandState)=N ×48
4 18 Original test statistics 18 × sizeof(double)=144
5 Counter Ci,0,Ci,1,Ci,2 for 1≤ i≤18 18× sizeof(int)×3=216
6 N shuffled data after Conversion II

(Only used if the input is binary)
N ×125,000

Ci,0, Ci,1, and Ci,2 for i∈{1,2,...,18}, which indicate the indices of the statistical tests, are
copied from the device to the host. Following the operations in steps 17 to 19 of Algorithm
2, which correspond to those in steps 12 and 13 of Fig. 2, the host moves on to step 14 if
Eq. (2) is satisfied for all i. Finally, in step 14, the host determines whether or not the input
data are IID.

When the input data are binary, two conversions should be considered when designing
the CUDA kernels. Therefore, we describe the CUDA kernels designed to process N
iterations in parallel on the GPU depending on whether the input data are binary. The
descriptions of the CUDA kernels Shuffling and Statistical test for non-binary
noise sample are as follows:

CUDA kernel Shuffling
The kernel Shuffling generates N shuffled data by permuting one million bytes of the
original data N times in parallel. Thus, each of N CUDA threads permutes the original
data using the Fisher–Yates shuffle algorithm and then stores the shuffled data in the global
memory of the device. As the shuffle algorithm uses the curand() function, each thread
uses its unique seed that is initialized by the kernel CurandInit with its index, respectively.

CUDA kernel Statistical test
The kernel Statistical test performs 18 statistical tests on each of N shuffled data,
and compares the shuffled and original test statistics. The size of each shuffled data is
one million bytes and N shuffled data are stored in the global memory of the device. In
this section, we present two methods that can easily be designed to handle this process in
parallel on the GPU and propose an optimized method.

Parallelization
method 1

One CUDA thread performs 18 statistical tests sequentially on
one shuffled dataset. This method is illustrated in Fig. 3. If this
method is applied to the kernel Statistical test, B′=(N /T)
CUDA blocks are used when the number of CUDA threads is T .
However, because each thread runs 18 tests in sequence, room for
improvement is apparent in this method.

Parallelization
method 2

In this method, each block performs its designated statistical test
out of 18 tests on one shuffled dataset shared by 18 blocks. Thus,
for one shuffled set, 18 statistical tests are run in parallel, and this

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 12/29

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.404


method is a parallelization of the serial part in method 1 above.
This method is illustrated in Fig. 4, which indicates the kernel
Statistical test with B′=((N /T)×18) CUDA blocks and T
threads in a block.

Proposed optimiza-
tion

This method optimizes parallelization method 2 through two steps.
(Step 1) To hide the latency in accessing the slow global memory
of the GPU, we analyzed the runtime of 18 statistical tests from
an algorithmic perspective. We merged several statistical tests with
similar access patterns to the global memory into a single test.
Therefore, 9 merged statistical tests replace 18 statistical tests. (Step
2) When analyzed the execution time of nine merged tests, the
execution time of one longest test was similar to the sum of the
execution times of the remaining eight tests. We configured each
thread of a block to runs the longest test and each thread of the
other block to run eight merged tests so that the workload between
SMs is balanced. This method is depicted in Fig. 5, where the kernel
Statistical test uses B′=((N /T)×2) CUDA blocks, with T
threads in each block.

With slight modifications to the kernels Shuffling and Statistical test, which are
designed for non-binary samples, as described above, we can parallelize the permutation
testing when the input data are binary. If the noise sample size is 1 bit, one of two conversions
is applied to certain statistical tests. The data after Conversion I and data after Conversion
II can be stored separately in the global memory. Since the data after Conversion I are
the result of calculating the Hamming weight of the data following Conversion II, we
designed to minimize the use of global memory as follows: In the kernel Shuffling, N
CUDA threads first generate N shuffled data in parallel. Thereafter, each thread proceeds
to Conversion II for its own shuffled data and stores the results (No. 6 in Table 3) in the
global memory of the GPU. The kernel Statistical test runs nine merged tests. The
merged tests that required Conversion I calculate the Hamming weight of the data after
Conversion II. As in the optimized method for non-binary data, the thread in the block
executes at least one test so that the execution time of each block is similar. Therefore,
B′=(N /T)× 4 CUDA blocks are used when the number of CUDA threads is T .

Overview of hybrid CPU/GPU implementation of permutation testing
We implemented the GPU-based permutation testing, which comprised 18 statistical tests
without the compression algorithm and is parallel on the GPU. This section presents a
hybrid CPU/GPU implementation of permutation testing that includes the compression
algorithm.

As shown in Fig. 6, we designed the hybrid implementation to perform 10,000
shuffling and compression tests using OpenMP according to the result of our GPU-
based permutation testing. The noise source is determined as the non-IID if at least
one test does not satisfy Eq. (2), as shown in Algorithm 2. Therefore, if our GPU-based

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 13/29

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.404


Figure 3 General parallel method 1 of kernel Statistical test.
Full-size DOI: 10.7717/peerjcs.404/fig-3

Figure 4 General parallel method 2 of kernel Statistical test.
Full-size DOI: 10.7717/peerjcs.404/fig-4

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 14/29

https://peerj.com
https://doi.org/10.7717/peerjcs.404/fig-3
https://doi.org/10.7717/peerjcs.404/fig-4
http://dx.doi.org/10.7717/peerj-cs.404


Figure 5 Proposed optimization method of kernel Statistical test.
Full-size DOI: 10.7717/peerjcs.404/fig-5

program determined that the input noise source is non-IID, our hybrid program finally
determines that the input is non-IID, without compression tests. If our GPU-based
program determined that the input is IID, the noise source might be determined to be IID
or be determined to be non-IID only by the result of the compression test. Therefore, our
hybrid program performs at most 10,000 shuffling and compression tests in parallel using
OpenMP. If the results of the compression tests satisfy Eq. (2), the noise source is finally
determined as the IID; otherwise, it is determined as the non-IID.

EXPERIMENTS AND PERFORMANCE EVALUATION
In this section, we analyze the performance of the proposed methods and compare its
performance with the NIST program written in C++. The performance was evaluated
using two hardware configurations (Table 4).
There are two noise sources used in experiments. The first noise source is truerand
provided by the NIST. The second noise source, GetTickCount, could be collected through
the GetTickCount() function in the Windows environment. The sample size of each noise
source is 1, 4, or 8 bits. As a result of confirming whether the input data are IID by the
IID test, truerand was determined as the IID noise source; however, GetTickCount was
determined as the non-IID noise source.

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 15/29

https://peerj.com
https://doi.org/10.7717/peerjcs.404/fig-5
http://dx.doi.org/10.7717/peerj-cs.404


Figure 6 Proposed hybrid CPU/GPU program of permutation testing. (A) Process on the host/CPU.
(B) Process on the device/GPU.

Full-size DOI: 10.7717/peerjcs.404/fig-6

The experimental result is the average of the results repeated 20 times. The difference
between the results of the experiments repeated 20 times was within 5%. Since the GPU
Boost technology, which controls the clock speed according to extra power availability, is
used in NIVIDA GPU, the results are with the GPU Boost applied, unless otherwise noted.

GPU optimization concepts
We conducted experiments on the optimization concepts considered while GPU-based
parallelizing the permutation testing. The experimental data used in this section consisted
of one million samples collected from the noise source GetTickCount, where the sample
size was 8 bits. In the experiments, we set T , the number of threads per block used in the
CUDA kernel, to 256, a multiple of the warp size (=32). Since T is set to 256, we set N to
2,048, which is the multiple T , and used about 2 GB (= N×1,000,000 bytes) of the global
memory of the GPU.

Coalesced memory access
We used the memory coalescing technique (Fig. 7) to transfer data from slow global memory
to the registers efficiently. Table 5 displays the performance of our parallel implementation
of the permutation testing before and after using this technique. Permutation testing used

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 16/29

https://peerj.com
https://doi.org/10.7717/peerjcs.404/fig-6
http://dx.doi.org/10.7717/peerj-cs.404


Table 4 Configurations of experimental platforms.

Name Device A Device B

CPU model Intel(R) Core (TM) i7-8086K Intel(R) Core (TM) i7-7700
CPU frequency 4.00 GHz 3.60 GHz
CPU cores 6 4
CPU threads 12 8
Accelerator type NVIDIA GPU NVIDIA GPU
Models TITAN Xp GeForce GTX 1060
Multiprocessors (SMs) 30 10
CUDA cores/SM 128 128
CUDA capability major 6.1 6.1
Global memory 12,288 MB 6,144 MB
GPU Max clock rate 1,582 MHz 1,709 MHz
Memory clock rate 5,750 MHz 4,004 MHz
Registers/block 65,536 65,536
Threads/SM 2,048 2,048
Threads/block 1,024 1,024
Warp size 32 32
CUDA driver version 10.1 10.1

Figure 7 Memory coalescing technique.
Full-size DOI: 10.7717/peerjcs.404/fig-7

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 17/29

https://peerj.com
https://doi.org/10.7717/peerjcs.404/fig-7
http://dx.doi.org/10.7717/peerj-cs.404


Table 5 Performance of proposed GPU-based parallel implementation of permutation testing de-
pending on whether memory coalescing technique was used (the number of CUDA blocks = 16, the
number of threads per block = 256).

Before using memory
coalescing technique (s)

After using memory
coalescing technique (s)

Device A 27.2 19.0
Device B 54.1 33.9

the kernel Statistical test with our optimization method. As a result, we improved
performance by 1.5 times. All experiments after this section use the memory coalescing
technology.

Merging statistical tests
Our optimization method consists of a step in which tests are merged (Step 1) and a step
in which at least one test is allocated in the CUDA block so that the working time of each
thread is similar (Step 2). Therefore, we confirmed the validity of our merged tests.

We first designed new CUDA kernels for experimentation, where each of the N threads
performed one statistical test on one shuffled data. We measured the execution time of each
test kernel. Each test kernel used eight CUDA blocks since we set the number of threads
per block T to 256. The experimental results showing the execution time of each statistical
test on the GPU are shown in Table 5.

From Table 6, it takes approximately four seconds if one thread sequentially performs
18 statistical tests. However, if one thread performs nine merged tests, it can be expected
that it will take about 2.3 seconds. We improved the performance for all 18 statistical tests
by about 1.7 times by combining the tests.

We measured the execution time of the parallelization method 2 applied Step 2, and our
method. Referring to the results of Table 6, we designed each CUDA block of method 2
which Step 2 was applied to proceed with each of tests 1∼6, test 7, test 8, and tests 9∼18;
each block can complete its work in a similar time. The kernel Statistical test applying
this method uses 32 (=(N /T)×4) blocks; however, applying our proposed method uses
16 (= (N /T)×2) blocks. Table 7 presents the execution time of a kernel Statistical
test with each method applied. As a result, our method is about 1.5 times faster than the
parallelization method 2 applied Step 2.

Parallelism methods
We experimentally verified whether the proposed optimization method is better than
other methods. We first confirmed the difference in the operation time of each CUDA
thread in the kernel Statistical test, where each parallelization method is applied by
drawing a figure. Figure 8 displays the operation times of the CUDA threads, assuming
that the GPU had three SMs and considering the results of Table 6. It is the task of the
GPU scheduler to allocate the CUDA blocks to the SMs; however, these were assigned
arbitrarily for visualization in Fig. 8. As indicated in Table 6, the statistical tests had
different execution times. Therefore, we expressed the different lengths of the threads in

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 18/29

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.404


Table 6 Left: execution time of each statistical test on GPU; right: execution time of each merged statistical test on GPU (Device A, number of
CUDA blocks = 8, number of threads per block = 256).

No. Name of
statistical test

Execution
time (ms)

No. Name of merged
statistical test

Execution
time (ms)

1 Excursion test 214 1′ Excursion test 214
2 Number of directional runs 75 2′ Directional runs and number of inc/dec 90
3 Length of directional runs 81
4 Numbers of increases and decreases 38
5 Number of runs based on median 103 3′ Runs based on median 143
6 Length of runs based on median 128
7 Average collision test statistic 1,257 4′ Collision test statistic 1,258
8 Maximum collision test statistic 1,238
9 Periodicity test (lag=1) 50 5′ Per/Cov test (lag=1) 129
10 Covariance test (lag=1) 71
11 Periodicity test (lag=2) 94 6′ Per/Cov test (lag=2) 137
12 Covariance test (lag=2) 113
13 Periodicity test (lag=8) 93 7′ Per/Cov test (lag=8) 134
14 Covariance test (lag=8) 111
15 Periodicity test (lag=16) 93 8′ Per/Cov test (lag=16) 134
16 Covariance test (lag=16) 111
17 Periodicity test (lag=32) 93 9′ Per/Cov test (lag=32) 134
18 Covariance test (lag=32) 111

Table 7 Performance of parallelization method 2 applied Step 2 and our method (Device A, the num-
ber of threads per block = 256).

Number of
CUDA blocks

Execution
time (s)

Parallelization method 2 (18 tests) +Step 2 32 2.24
Our method (9 merged tests +Step 2) 16 1.51

the CUDA blocks running each statistical test, as illustrated in Fig. 8. In the proposed
method, several statistical tests were merged for optimization. The execution time of the
merged statistical test (Table 6) was equal to or slightly longer than each execution time of
the original statistical tests prior to merging (Table 6). Suppose that Test 1&2 is a merged
function of Test 1 and Test 2. The lengths of the threads in the block running Test 1&2
were slightly longer than those of the threads in the block running Test 1 or Test 2, as
indicated in Fig. 8. As illustrated in Fig. 8, we expected that our optimization outperformed
parallelization methods 1 and 2.
We measured the execution time of a kernel Statistical test according to the parallel

method. Table 8 shows the execution times of each kernel measured on both devices. If the
occupancy of the kernel in our parallelization method is calculated, it reaches 100%. It is
the occupancy per SM. Since our method uses a small number of blocks, there may be idle
SMs on a high-performance GPU with many SMs. However, if the host calls the test kernel

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 19/29

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.404


Figure 8 Operation times of CUDA threads in kernel Statistical test when applying each method
on device.

Full-size DOI: 10.7717/peerjcs.404/fig-8

Table 8 Execution time of kernel Statistical test according to parallel method (number of threads
per block = 256).

Execution time (s)

Method Number of
CUDA blocks

Device A Device B

Parallelization method 1 8 4.53 6.39
Parallelization method 2 144 2.77 6.33
Our optimization (Step 1) 72 1.62 2.94
Our optimization (Step 1&2) 16 1.51 2.76

for each noise source simultaneously using a multi-stream technique, we can use almost
full GPU capability.

Since 18 statistical tests were running in parallel, the parallelization method 2 was
improved by 1.6 times over method 1 in Device A; however, there was no improvement
in the performance in Device B. In Device B, the number of SMs was 10, and the number
of active blocks was calculated by eight. Thus, it is analyzed as the result derived since
the number of blocks generated by the kernel (=144) is more than the number of blocks
active in the device simultaneously (=80). Our method (Step 1) is about 1.7 and 2.1
times, respectively, faster than the parallelization method 2 in Device A and Device B. It is
analyzed as the results due to the merged statistical tests that improved the performance,

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 20/29

https://peerj.com
https://doi.org/10.7717/peerjcs.404/fig-8
http://dx.doi.org/10.7717/peerj-cs.404


Figure 9 Execution time of the GPU-based parallel implementation of permutation testing according
to parallel method (number of threads per block = 256).

Full-size DOI: 10.7717/peerjcs.404/fig-9

as confirmed in the previous section. Since the work of each CUDA block was adequately
balanced, it is analyzed that our method (Step 1&2) was slightly improved over our method
(Step 1). Furthermore, our method is 3 times and about 2.3 times, respectively, faster than
the parallelization method 1 in Device A and Device B.

Next, we analyzed how each method affected the performance of GPU-based
implementation of permutation testing. As shown in Algorithm 2, the permutation testing
has 10,000 iterations. Since implemented N iterations in parallel, the kernel CurandInit
is called once, and the kernel Shuffling and Statistical test are called d10,000/Ne
times. Since we set N to 2,048 and did not use Eq. (2) in this experiment, the permutation
testing consists of one CurandInit, five Shuffling and five Statistical test. Figure 9
shows the execution time of this permutation testing according to the parallelization
method. The permutation testing applied our method shows an improvement of about
1.8 times over the permutation testing applied method 1. Thus, our optimization method
outperformed parallelization methods 1 and 2.

Performance evaluation of GPU-based permutation testing according
to the parameter
Parameter N is the number of iterations of the permutation testing to be processed in
parallel. We measured the performance of the GPU-based parallel implementation of the
permutation testing according to the value of the parameter N .

As shown in Fig. 2, the kernel CurandInit is called once. The kernel Shuffling and
Statistical test are called at most d10,000/Ne times. The calling process repeated is
as follows: After the kernel Shuffling and the kernel Statistical test are sequentially
run once, if the results do not satisfy Eq. (2), each kernel is called again. If each kernel

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 21/29

https://peerj.com
https://doi.org/10.7717/peerjcs.404/fig-9
http://dx.doi.org/10.7717/peerj-cs.404


Table 9 Execution time of the GPU-based parallel permutation testing according to the value of the
parameter N .

Parameter N 1,000 2,000 2,500 5,000 10,000

Global memory (GB) 0.93 1.86 2.33 4.66 9.31
Execution time (s)

truerand 2.69 3.78 4.53 9.20 19.76
Device A GetTickCount 26.92 18.81 18.19 18.43 19.83

truerand 3.59 6.80 8.58 − −
Device B GetTickCount 35.75 33.97 34.49 − −

has been called d10,000/Ne times or the results satisfy Eq. (2), the call to each kernel is
aborted.

If the noise source is IID, there is little evidence against the null hypothesis that the noise
source is IID in the permutation testing. The probability of satisfying Eq. (2) increases,
and the number of the calls of the kernel decreases. On the other hand, if the noise
source is Non-IID, the probability of satisfying Eq. (2) decreases, and the number of
the calls increases, contrary to the IID noise source case. Therefore, we used truerand
and GetTickCount, which were determined as the IID and the non-IID, respectively, by
permutation testing. The sample size of each noise source is 8 bits.

Permutation testing performs 10,000 iterations, so we set N to be a factor of 10,000
and T to 250. Since the size of the global memory in Device A is 12 GB, we set N to
1,000, 2,000, 2,500, 5,000, and 10,000. In Device B, the size of the global memory is 6
GB, and so we set N to 1,000, 2,000, and 2,500. Table 9 presents the execution time of
the GPU-based parallel implementation of the permutation testing and the usage of global
memory (calculated by referring to Table 3), according to the value N .

When truerand was used as input data, each of the kernel Shuffling and Statistical
test was called once, and then the noise source was determined as the IID through the
test results. Therefore, in an environment (e.g., Hardware RNG) where the noise sources
are likely to be IID, it is analyzed that it is appropriate even if the user sets N to 1,000. In
GetTickCount, each kernel was called d10,000/Ne times and then was determined as the
non-IID. The execution time multiplied by d10,000/Ne, when truerand was the input,
gives a similar result to the execution time when GetTickCount was the input. As shown
in Table 9, in the case of GetTickCount, as N increases, the execution time decreases and
then increases again. Each thread used the global memory of 1 million bytes. Therefore, we
analyzed it as a result of the latency derived by increasing access to global memory as the
number of switching by the warp unit increases. It is appropriate to select N by considering
all of the global memory usages, execution time determined as an IID noise source, and
execution time determined as a non-IID noise source in a general environment. As a result
of the experiment, it is appropriate to set N to 2,500 when using Device A and to select N
to 2,000 when using Device B.

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 22/29

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.404


Table 10 Performances of our GPU-based program and NIST program written in C++ according to noise source (without the compression
test).

Execution time (s)

Name of noise source truerand GetTickCount

Sample size (bit) 1 4 8 1 4 8

NIST program (CPU single-thread) 43.42 77.52 24.94 434.42 485.58 638.89
Device A NIST program (CPU multi-thread) 37.53 54.91 23.66 331.76 339.79 347.68

Proposed program (GPU) 3.17 4.39 4.53 12.72 17.63 18.19
NIST program (CPU multi-thread) 41.35 50.15 23.18 361.23 347.15 353.52

Device B Proposed program (GPU) 4.60 5.91 6.80 23.01 29.58 33.97

Performance evaluation of GPU-based permutation testing with NIST
program according to noise source
For each noise source, we measured the performances of our GPU-based program and
the NIST program. Two noise sources, truerand and GetTickCount, were used in the
experiment and the sample size of each noise source is one of 1, 4, and 8 bits. We set N to
2,500 and 2,000, respectively, when using Device A and Device B, reflecting the result of
the previous experiment. We set T to 250.

The NIST program, written in C++, is compatible with OpenMP and can make
10,000 iterations work in a multi-threaded environment. In this experiment, the NIST
program running on the CPU used 12 CPU threads in Device A and eight CPU threads
in Device B (Table 4). Thus, we compared our performance with permutation testing in
the single-threaded and multi-threaded NIST programs. Since our GPU-based parallel
implementation of the permutation testing was designed without the compression
algorithm, we measured the performance of the NIST program without the compression
test.

Table 10 presents the execution times of the NIST program on the CPU and the
proposed program on the GPUs, measured for each noise source. For truerand, the
performance of the proposed program was approximately 17.6 times better than that of
the single-threaded NIST program. It was about 12.5 times better than the performance
of the multi-threaded NIST program. In the case of GetTickCount, the performance of
our program was improved by approximately 35.1 times and about 26.1 times over the
single-threaded and the multi-threaded NIST programs.

In Table 10, the minimum performance improvement of the proposed program for
truerand was not higher than that of the program for GetTickCount. As shown in Algorithm
2, the number of iterations (up to 10,000) in permutation testing varies depending on
whether Eq. (2) is satisfied. The NIST program on the CPU was executed as one statistical
test unit. If the accumulated results of the statistical test satisfied Eq. (2), that test was
no longer performed in the iterations. On the other hand, our program on the GPU was
executed as an N unit of 18 statistical tests, and if the results of all tests satisfied Eq. (2), it
was not repeated. Namely, the kernel Shuffling and Statistical test were not called
again. If the noise source was likely to be determined as the IID from the permutation
testing, there is a high probability that all of the statistical tests satisfy Eq. (2). The NIST

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 23/29

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.404


Table 11 Execution time of the GPU-based parallel implementation of permutation testing with/with-
out GPU Boost (Device A).

Execution time (s)

Name of noise source With GPU
Boost

Without GPU
Boost

truerand-1bit 3.17 3.21
truerand-4bit 4.39 4.57
truerand-8bit 4.53 4.66
GetTickCount-1bit 12.72 12.87
GetTickCount-4bit 17.63 18.28

program operating as one test unit repeatedly performed each test less than N times and
then determined truerand as the IID; however, in the case of GetTickCount, both the NIST
program and our program performed 10,000 iterations and determined GetTickCount as
the non-IID. Therefore, it is analyzed that the difference in performance improvement of
our program by noise source is reasonable.

NVIDIA GPU Boost technology boosts the CUDA core frequency from 1,582 to 1,873
MHz in Device A. The execution time of our GPU-based program without GPU Boost is
presented in Table 11. Without GPU Boost, the performance decreased by up to 0.96 times
compared to the case with GPU Boost. It is analyzed that the difference in performance
with or without GPU Boost is not significant. The performance of our GPU-based program
without GPU Boost is approximately 5 to 34 times better than the single-threaded NIST
program and about 5 to 25 times better than the multi-threaded NIST program.

Performance evaluation of our hybrid CPU/GPU program
We measured the performance of the proposed hybrid CPU/GPU program and the NIST
program using truerand and GetTickCount, whose sample size is 8 bits. Both programs
included the compression test. Figure 10 presents the performance of each program. A
base-10 logarithmic scale is used for the Y -axis.
Since the NIST program performs the compression tests, it takes longer than the runtime
of the NIST program without the compression test written in Table 10. In particular, when
determining GetTickCount to be non-IID, the compression test runs almost 10,000 times,
and so the NIST program, in this case, takes much longer than the runtime written in Table
10.

Our hybrid CPU/GPU program performs the compression tests using OpenMP only
when our GPU-based program determined the noise source (e.g., truerand) as the IID. As
shown in Fig. 10, it is reasonable that the execution time of our hybrid program for truerand
is longer than that of our GPU-based program presented in Table 10. Since GetTickCount
was determined as the non-IID by our GPU-based program, the compression test does not
run in our hybrid program. Therefore, our hybrid program has the same execution time
as our GPU-based program in Table 10.

Compared to the single-threaded NIST program, the proposed hybrid CPU/GPU
program had an improved performance of approximately 4.9 to 192.9 times. Compared

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 24/29

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.404


Figure 10 Execution time of our hybrid program and NIST program.
Full-size DOI: 10.7717/peerjcs.404/fig-10

with the multi-threaded NIST program, the performance improved about 3.8 to 29.7 times.
The NIST program always performed up to 10,000 compression tests using OpenMP;
however, our hybrid program performed the compression tests using OpenMP only if
the noise source was determined as the IID by all 18 statistical tests in our GPU-based
program. Therefore, our hybrid program is efficient when determining the noise source as
the non-IID than when determining the noise source as the IID.

When the NIST program applies our implementation method, it first performs the
shuffling and 18 statistical tests (at most 10,000 times). If it determined that the noise
source was non-IID by these results, it does not run the shuffling and the compression
tests. When the input is non-IID, the NIST program (with the compression test) had the
same runtime presented in Table 10. Otherwise, the NIST program has the same runtime
as the original program. Therefore, our hybrid CPU/GPU program sped the process about
3 times over the multi-threaded NIST program applied our method for IID noise sources
(8-bit sample size). Our program had an improved performance of approximately 25 for
the non-IID input.

CONCLUSIONS
The security of modern cryptography is heavily reliant on sensitive security parameters
such as encryption keys. RNGs should provide cryptosystems with ideal random bits,
which are independent, unbiased, and, most importantly, unpredictable. To use a secure
RNG, it is necessary to estimate its input entropy as precisely as possible. The NIST offers

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 25/29

https://peerj.com
https://doi.org/10.7717/peerjcs.404/fig-10
http://dx.doi.org/10.7717/peerj-cs.404


two programs for entropy estimations, as outlined in SP 800-90B. However, much time is
required to manipulate several noise sources for an RNG.

We proposed GPU-based parallel implementation of the permutation testing, which
required the longest execution time in the IID test of SP 800-90B. Our GPU-based
implementation excluded the compression test that is unsuitable for CUDA version
implementation. Our GPU-based method was designed to use massive parallelism of the
GPU by balancing the execution time for statistical tests, as well as optimizing the use of
the global memory for data shuffling. We experimentally compared our GPU optimization
with the NIST program excluded the compression test. Our GPU-based program was
approximately 3 to 34 times faster than the single-threaded NIST program. Moreover, our
proposal improved the performance by about 3 to 25 times over the multi-threaded NIST
program. We proposed the hybrid CPU/GPU implementation of the permutation testing.
It consists of our GPU-based program and the compression tests that run using OpenMP.
Experimental results show that the performance of our hybrid program is approximately
3 to 25 times better than that of the multi-threaded NIST program (with compression
test). Most noise sources are non-IID, and our program has better performance when
determining the noise source as the non-IID. It is expected that the time required for
analyzing the RNG security will be significantly reduced for developers and evaluators
by using the proposed approach, thereby improving the validation efficiency in the
development of cryptographic modules. It is expected that our optimization techniques
might be adapted to the problems of performing several tests or processes on thousands or
more of data, each of which is large.

ADDITIONAL INFORMATION AND DECLARATIONS

Funding
This work was supported by an Institute for Information & Communications Technology
Promotion (IITP) grant funded by the Korean Government (MSIT) (No. 2014-6-00908,
Research on the Security of Random Number Generators and Embedded Devices). The
funders had no role in study design, data collection and analysis, decision to publish, or
preparation of the manuscript.

Grant Disclosures
The following grant information was disclosed by the authors:
Institute for Information & Communications Technology Promotion (IITP) grant: No.
2014-6-00908.

Competing Interests
The authors declare there are no competing interests.

Author Contributions
• Yewon Kim conceived and designed the experiments, performed the experiments,
analyzed the data, performed the computation work, prepared figures and/or tables, and
approved the final draft.

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 26/29

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.404


• Yongjin Yeom conceived and designed the experiments, analyzed the data, authored or
reviewed drafts of the paper, and approved the final draft.

Data Availability
The following information was supplied regarding data availability:

Data and source code are available at GitHub: https://github.com/yeah1kim/yeah_
GPU_SP800_90B_IID.

REFERENCES
Barker E, Kelsey J. 2012. Recommendation for the entropy sources used for random bit gen-

eration. National Institute of Standards and Technology NIST Special Publication
(SP) 800-90B (Draft).

Bernstein DJ, Chang Y-A, Cheng C-M, Chou L-P, Heninger N, Lange T, Van Someren
N. 2013. Factoring RSA keys from certified smart cards: Coppersmith in the wild. In:
Sako K, Sarkar P, eds. Advances in Cryptology - ASIACRYPT 2013. ASIACRYPT 2013.
Lecture notes in computer science, vol. 8270. Berlin, Heidelberg: Springer, 341–360
DOI 10.1007/978-3-642-42045-0_18.

Ding Y, Peng Z, Zhou Y, Zhang C. 2014. Android low entropy demystified. In: 2014 IEEE
international conference on communications (ICC). Piscataway: IEEE, 659–664.

Heninger N, Durumeric Z, Wustrow E, Halderman JA. 2012. Mining your Ps and Qs:
detection of widespread weak keys in network devices. In: Presented as part of the 21st
USENIX security symposium (USENIX Security 12). 205–220.

ISO/IEC-20543. 2019. Information technology —Security techniques —Test and analysis
methods for random bit generators within ISO/IEC 19790 and ISO/IEC 15408.

Kang J-S, Park H, Yeom Y. 2017. On the additional chi-square tests for the IID assump-
tion of NIST SP 800-90B. In: 2017 15th annual conference on privacy, security and
trust (PST). Piscataway: IEEE, 375–3757.

Kaplan D, Kedmi S, Hay R, Dayan A. 2014. Attacking the Linux PRNG On Android:
weaknesses in seeding of entropic pools and low boot-time entropy. In: 8th USENIX
workshop on offensive technologies (WOOT 14).

Kelsey J. 2012. Entropy sources and you: an overview of SP 800-90B. In: Random Bit
Generation Workshop.

Kim SH, Han D, Lee DH. 2013. Predictability of Android OpenSSL’s pseudo random
number generator. In: Proceedings of the 2013 ACM SIGSAC conference on Computer
& communications security. New York: ACM, 659–668.

Li P, Zhou S, Ren B, Tang S, Li T, Xu C, Chen J. 2019. Efficient implementation of
lightweight block ciphers on volta and pascal architecture. Journal of Information
Security and Applications 47:235–245 DOI 10.1016/j.jisa.2019.04.006.

Li Q, Zhong C, Zhao K, Mei X, Chu X. 2012. Implementation and analysis of AES
encryption on GPU. In: 2012 IEEE 14th international conference on high performance
computing and communication & 2012 IEEE 9th international conference on embedded
software and systems. Piscataway: IEEE, 843–848.

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 27/29

https://peerj.com
https://github.com/yeah1kim/yeah_GPU_SP800_90B_IID
https://github.com/yeah1kim/yeah_GPU_SP800_90B_IID
http://dx.doi.org/10.1007/978-3-642-42045-0_18
http://dx.doi.org/10.1016/j.jisa.2019.04.006
http://dx.doi.org/10.7717/peerj-cs.404


Ma J, Chen X, Xu R, Shi J. 2017. Implementation and evaluation of different parallel
designs of AES using CUDA. In: 2012 IEEE 14th International Conference on
High Performance Computing and Communication & 2012 IEEE 9th International
Conference on Embedded Software and Systems. Piscataway: IEEE, 606–614.

Michaelis K, Meyer C, Schwenk J. 2013. Randomly failed! The state of randomness in
current Java implementations. In: Dawson E, ed. Topics in Cryptology – CT-RSA
2013. CT-RSA 2013. Lecture notes in computer science. vol. 7779. Berlin, Heidelberg:
Springer, 129–144 DOI 10.1007/978-3-642-36095-4_9.

Müller S. 2020. Linux random number generator - a new approach. Available at https:
//chronox.de/lrng/doc/lrng.pdf (accessed on February 2020).

Neves S, Araujo F. 2011. On the performance of GPU public-key cryptography. In: ASAP
2011-22nd IEEE international conference on application-specific systems, architectures
and processors. Piscataway: IEEE, 133–140.

NIST. 2015. EntropyAssessment. GitHub. Available at https://github.com/usnistgov/
SP800-90B_EntropyAssessment (accessed on February 2020).

NIST, CSE. 2021. Implementation guidance for FIPS PUB 140-2 and the cryptographic
module validation program. Available at http://csrc.nist.gov/groups/STM/cmvp/
documents/fips140-2/FIPS1402IG.pdf (accessed on February 2020).

NVIDIA. 2020a. CUDA C++ BEST Practices guide. In: NVIDIA, Aug. Available at
https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html (accessed on
February 2020).

NVIDIA. 2020b. CUDA C++ Programming guide. NVIDIA, Aug. Available at https://
docs.nvidia.com/cuda/cuda-c-programming-guide/index.html (accessed on February
2020).

OpenMP. 2018. OpenMP application programming interface. Available at https://www.
openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf (accessed on
February 2020).

Pan W, Zheng F, Zhao Y, Zhu W-T, Jing J. 2016. An efficient elliptic curve cryptography
signature server with GPU acceleration. IEEE Transactions on Information Forensics
and Security 12(1):111–122.

Patel RA, Zhang Y, Mak J, Davidson A, Owens JD. 2012. Parallel lossless data compression
on the GPU. Piscataway: IEEE.

Ristenpart T, Yilek S. 2010. When good randomness goes bad: virtual machine reset
vulnerabilities and hedging deployed cryptography. In: Proceedings of Network and
Distributed Security Symposium (NDSS). San Diego, CA, USA: The Internet Society,
1–18.

Schneier B, Fredrikson M, Kohno T, Ristenpart T. 2015. Surreptitiously weakening
cryptographic systems. In: IACR Cryptol. ePrint Arch. vol. 2015. 97. Available at
https://eprint.iacr.org/2015/097 (accessed on February 2020).

Seward J. 2019. bzip2 and libbzip2, version 1.0.8: a program and library for data
compression. Available at https://sourceware.org/bzip2/ .

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 28/29

https://peerj.com
http://dx.doi.org/10.1007/978-3-642-36095-4_9
https://chronox.de/lrng/doc/lrng.pdf
https://chronox.de/lrng/doc/lrng.pdf
https://github.com/usnistgov/SP800-90B_EntropyAssessment
https://github.com/usnistgov/SP800-90B_EntropyAssessment
http://csrc.nist.gov/groups/STM/cmvp/documents/fips140-2/FIPS1402IG.pdf
http://csrc.nist.gov/groups/STM/cmvp/documents/fips140-2/FIPS1402IG.pdf
https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf
https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf
https://eprint.iacr.org/2015/097
https://sourceware.org/bzip2/
http://dx.doi.org/10.7717/peerj-cs.404


Shastry K, Pandey A, Agrawal A, Sarveswara R. 2016. Compression acceleration using
GPGPU. In: 2016 IEEE 23rd international conference on high performance computing
workshops (HiPCW). Piscataway: IEEE, 70–78.

Stevens M, Bursztein E, Karpman P, Albertini A, Markov Y. 2017. The first collision
for full SHA-1. In: Annual international cryptology conference. Heidelberg: Springer,
570–596.

Sönmez Turan M, Barker E, Kelsey J, McKay K, Baish M, Boyle M. 2016. Recommenda-
tion for the entropy sources used for random bit generation. In: National Institute of
Standards and Technology. NIST Special Publication (SP) 800-90B (2nd Draft).

Sönmez Turan M, Barker E, Kelsey J, McKay K, Baish M, Boyle M. 2018. Recommen-
dation for the entropy sources used for random bit generation. In: National Institute
of Standards and Technology. NIST Special Publication (SP) 800-90B. Available at
https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-90B.pdf .

Vaidya B. 2018. Hands-On GPU-accelerated computer vision with OpenCV and CUDA:
effective techniques for processing complex image data in real time using GPUs.
Birmingham, UK: Packt Publishing Ltd.

Yoo T, Kang J-S, Yeom Y. 2017. Recoverable random numbers in an internet of things
operating system. Entropy 19(3):113 DOI 10.3390/e19030113.

Zhu S, Ma Y, Chen T, Lin J, Jing J. 2017. Analysis and improvement of entropy esti-
mators in NIST SP 800-90B for non-IID entropy sources. IACR Transactions on
Symmetric Cryptology 2017(3):151–168 DOI 10.46586/tosc.v2017.i3.151-168.

Zhu S, Ma Y, Li X, Yang J, Lin J, Jing J. 2019. On the analysis and improvement of min-
entropy estimation on time-varying data. IEEE Transactions on Information Forensics
and Security 15:1696–1708.

Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 29/29

https://peerj.com
https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-90B.pdf
http://dx.doi.org/10.3390/e19030113
http://dx.doi.org/10.46586/tosc.v2017.i3.151-168
http://dx.doi.org/10.7717/peerj-cs.404