key: cord-0058353-ydyl1fuu authors: Dmitruk, Beata; Stpiczyński, Przemysław title: High Performance Portable Solver for Tridiagonal Toeplitz Systems of Linear Equations date: 2021-02-15 journal: Euro-Par 2020: Parallel Processing Workshops DOI: 10.1007/978-3-030-71593-9_14 sha: 3e8fc031af0494ba7682d5c737f6129f306a09a9 doc_id: 58353 cord_uid: ydyl1fuu We show that recently developed divide and conquer parallel algorithm for solving tridiagonal Toeplitz systems of linear equations can be easily and efficiently implemented for a variety of modern multicore and GPU architectures, as well as hybrid systems. Our new portable implementation that uses OpenACC can be executed on both CPU-based and GPU-accelerated systems. More sophisticated variants of the implementation are suitable for systems with multiple GPUs and it can use CPU and GPU cores. We consider the use of both column-wise and row-wise storage formats for two dimensional double precision arrays and show how to efficiently convert between these two formats using cache memory. Numerical experiments performed on Intel CPUs and Nvidia GPUs show that our new implementation achieves relatively good performance. Tridiagonal Toeplitz systems of linear equations appear in many theoretical and practical applications. For example, numerical algorithms for solving boundary value problems for ordinary and partial differential equations reduce to such systems [13, 15] . They also play an important role in piecewise cubic interpolation and spline algorithms [4, 14] . There are several methods for solving such systems (the review of the literature can be found in [5] ). Rojo [10] proposed a method for solving symmetric tridiagonal Toeplitz systems using LU decomposition of a system with almost Toeplitz structure together with Sherman-Morrison's formula and this approach was modified to obtain new solvers for a possible parallel execution [7, 9] . In [5] we proposed a new divide and conquer parallel algorithm for solving such systems using the splitting T = LR + P , where L, R are bidiagonal and P has only one non-zero entry. We showed how to reduce the number of necessary synchronizations and use SIMD extensions of modern processors. Our OpenMP (version 3.1) implementation of this method achieved very good speedup on Intel Xeon CPUs (up to 5.06) and Intel Xeon Phi (up to 29.45). While this approach can be further improved using more sophisticated vectorization techniques such as the use of intrinsics [1, 8, 11] , it will result in a loss of portability between different architectures. OpenACC is a standard for accelerated computing [2, 6] . It offers compiler directives for offloading C/C++ and Fortran programs from host to attached accelerator devices. Such simple directives allow marking regions of source code for automatic acceleration in a portable vendor-independent manner. However, sometimes it is desired to apply some high-level transformations of source codes to achieve better performance [2, 6, 12] . Marked sources can be compiled for a variety of accelerators and parallel computers based on multicore CPUs. It is also possible to use OpenACC and OpenMP together to utilize CPU and GPU cores at the same time. In this paper, we present a new portable OpenACC-based implementation of the algorithm that can be used on both CPUs and GPU accelerators without any changes in the source code. We also show how to use both OpenACC and OpenMP to utilize multiple GPUs, as well as implement the algorithm for hybrid systems. We study its performance on Intel Xeon CPUs and three Nvidia GPUs architectures: Kepler, Turing, and Volta. We consider both column-wise and row-wise storage formats for two dimensional arrays and show how to efficiently convert arrays between these two formats using cache memory to ensure coalesced access to device's global memory. We also discuss which format is more suitable for CPU-based and GPU-accelerated architectures. Let us consider a tridiagonal Toeplitz system of linear equations T x = f of the following form ⎡ For the sake of simplicity we assume that n = 2 k , k ∈ N. The matrix T can be decomposed as where α = (t 2 + sign(t 2 ) (t 2 ) 2 − 4t 1 t 3 )/(2t 3 ) and β = t 2 − t 3 α. Using (2) we can rewrite the Eq. (1) as follows ⎡ or simply x = v − t 3 αx 0 u. Then the solution to (1) can be found using: In order to solve (1) we have to find two vectors v and u, solving two systems of linear equations with the same coefficient matrix, namely LR. The solution to a system of linear equations LRy = d can be found in two stages. First, we solve Lz = d, and then Ry = z. This can be easily done using the following simple sequential algorithm based on the two recurrence relations: and This simple sequential algorithm can be efficient only for small values of n because it does not utilize the underlying hardware of modern processors, namely multiple cores and vector units. In order to obtain an efficient parallel algorithm for solving (5) let us consider the following divide and conquer method. First, we choose two integers r, s > 1, rs = n, and rewrite L in the following block form: Let us define: Then Lz = d can be rewritten as follows: Similarly, in case of the upper bidiagonal system Ry = z, assuming the same as previously, we get the following block form of R: Analogously to (8), we get the following formula to find y: The algorithm for solving LRy = d based on (8) and (10) comprises two main stages (A and B), each consisting of three steps, has a lot of potential parallelisms. All vectors L −1 s D can be found using the following vector-recursive formula: where X i, * and D i, * denote i-th rows of X and D, respectively. Then we can find the last entry of each vector z i , i = 1, . . . , r − 1, sequentially (Step A2) applying (8) . Finally, again in parallel, we calculate s − 1 remaining entries of z 1 , . . . , z r−1 (Step A3). In the case of (10) we proceed similarly. If Z = [z 0 , . . . , z r−1 ], then Z ← R −1 s Z can be found (Step B1) using: Then during the sequential part (Step B2) we find first entries of each y i , i = r−2, . . . , 0, and finally (Step B3) we use (10) to calculate in parallel all remaining entries of Y = [y 0 , . . . , y r−1 ]. The algorithm should be applied twice to find vectors v and u, respectively. Then we use (4) to find the solution to (2) . Note that (4) can be easily vectorized and parallelized. The algorithm for solving (2) presented in Sect. 2 can be easily implemented using OpenACC. Parallel steps A1, A3, B1, B3 can be vectorized and parallelized using parallel loop constructs. It means that the execution of the independent loops will be distributed among gangs that work in SIMD mode utilizing available hardware. The steps A2 and B2 are sequential and should be executed by a single gang. Figure 1 (left) shows our OpenACC implementation of the steps A1, A2, A3 using the column-wise storage format for the matrix X stored as a double precision array. This format is the most natural because there is no need to do any data transfers after choosing specific values of r, s. Unfortunately, this format does not allow to use of coalesced memory access to global memory of devices [3] during the steps A1 and B1. Thus, one can expect that the performance of the implementation using such storage format can be much worse compared to the implementation using the row-wise storage format (Fig. 1, right) . In this case, all references to global memory are coalesced. Figure 2 shows simple parallel loops that can be used to convert between column-wise storage and row-wise storage. Unfortunately, such conversion between formats can significantly reduce the overall performance because of non-coalesced memory access. In order to improve the performance of the implementation let us consider a more sophisticated method for conversion between column-wise and row-wise storage formats that uses cache memory to ensure coalesced memory access (Fig. 5) . Each gang of BSIZE threads is responsible for reading a block of V L× BSIZE elements from the array stored column-wise using coalesced memory access and writing it to the cache memory. Then such a block is moved to a new array stored row-wise. Figures 3 and 4 explain how to do it. A gang of threads operates on a sequence of V L blocks of V L × NC elements, where V L × NC = BSIZE. Each column of such a block is loaded by V L threads, where V L is the size of a warp [3] . As soon as all blocks are in the cache memory, all threads within a gang write rows of BSIZE elements into global memory. Devices coalesce such global memory access issued a warp into as few transactions as possible to minimize DRAM bandwidth [3] . Finally, it should be noted that in both cases (i.e. column-wise and row-wise storage formats) the steps B1, B2, B3 have been implemented similarly to excerpts of the source code shown in Fig. 1 . Computations according to (4) have been implemented using acc parallel loop construct. Figure 6 shows how to use OpenMP and OpenACC to utilize two GPUs. Two halves of the arrays are stored row-wise in global memories of GPUs. We create two OpenMP threads, each responsible for controlling one GPU. GPUs share data via host memory. The array shared data is allocated on the host and each GPU has its copy of this array. The update self construct is used to update data on the host, while update device updates device memory. It is necessary to synchronize threads using the omp barrier construct. This approach can also be applied to utilize GPU and CPU cores at the same time. Figure 7 shows the idea of hybrid implementation using OpenMP and OpenACC. CPU is responsible for finding r 1 first columns, while GPU is used to find further r 2 columns, where r 1 +r 2 = r. Note that in this case, OpenMP nested parallelism should be enabled [8] . We have tested three versions of our OpenACC-based implementation. The first one uses column-wise storage. The next two implementations rely on rowwise storage using the simple conversion between formats (Fig. 2 ) and the more sophisticated method (Fig. 5 ) that utilizes cache memory, respectively. On multicore (Table 1) we have also tested the OpenMP version of columnwise implementation compiled with the PGI compiler, the sequential Thomas algorithm, the algorithm based on sequential formulas (5), (6) and the simple automatic parallelization of (4). Tests have been performed for various problem sizes and values of the parameter r. Because of global memory limitations, in case of Kepler and Turing, the biggest problem size is n = 2 28 . We have observed that the best performance is achieved when r is O( √ n), and then the performance differs slightly. Tables 1, 2, and 3 show the results obtained for the best value of r. Table 2 also contains the results obtained for our OpenMP+OpenACC implementations (i.e. for two GPUs and hybrid one utilizing GPU and CPU cores). Moreover, it shows the order of the relative error We can observe that column-wise storage is the best for multicore and in this case, there is no need to perform conversion to row-wise storage. Thus, such conversion should be performed only if any GPU device is present. It can be easily checked using the acc get num devices() function from OpenACC Runtime Library. It means that the source code will work properly on GPUaccelerated and non-accelerated systems without any changes. Our OpenACC implementation compiled for non-accelerated multicore systems achieves good speedup over the sequential Thomas algorithm (up to 7.39) and its performance is about 65% of the performance achieved by our OpenMPbased implementation [5] . Thus, if we only consider CPU as the target architecture, the use of OpenMP is a better choice. In most cases for all tested GPUs, the implementation using row-wise storage achieves much better performance than the version using column-wise storage. We can also observe that the use of cache memory allows to speedup conversion between considered storage formats, especially on Kepler. In this case, the version using row-wise storage is more than 50% faster for bigger problem sizes. In the case of Turing and Volta, the use of -wise Row-wise Row-wise with cache Column-wise Row-wise Row-wise with cache r Time r Time r Time r Time r Time r Time 2 cache memory is profitable for bigger problem sizes. It should be noticed that the performance on all considered GPUs significantly outperforms the performance achieved on multicore. Thus, the use of our hybrid implementation is profitable for bigger problem sizes, i.e. when the GPU memory capacity is exceeded (see Table 2 , the results marked with " * "). Our implementation for two GPUs scales very well (see Table 2 , 2 × Kepler) without significant performance overheads (Fig. 8 ). It should be noticed that the timing results do not take into account the time needed to copy data between host and GPUs. However, solving considered systems is, in most cases, a part of a larger problem. The implementation achieves the performance of 6.6 GFLOPS on multicore and 68. 4 Volta what is far from the peak performances of those architectures, but is normal for problems where the ratio of the number of memory references to the number of arithmetic operations is O(1). Finally, let us observe that the relative error of the solution obtained by the parallel algorithm is acceptable. We have presented the new portable OpenACC-based implementation of the solver for tridiagonal Toeplitz systems of linear equations. Numerical experiments show that column-wise storage is the best for CPU-based architectures and it achieves good speedup (up to 7.39) over the sequential Thomas algorithm. In most cases for all tested GPUs, the implementation using row-wise storage achieves much better performance than the version using column-wise storage and the use of cache memory allows to improve its performance. Moreover, all considered GPUs outperform CPU-based systems. We have also shown how to use OpenMP and OpenACC together in order to obtain the implementation suitable for systems with multiple GPUs or hybrid systems. In the future, we plan to study the numerical properties of the method, especially its stability. SIMD programming using intel vector extensions OpenACC for Programmers: Concepts and Strategies Professional CUDA C Programming Parallel B-spline surface fitting on mesh-connected computers Vectorized parallel solver for tridiagonal Toeplitz systems of linear equations Parallel Programming with OpenACC A parallel method for linear equations with tridiagonal Toeplitz coefficient matrices Intel Xeon Phi Processor High-Performance Programming: Knights Landing edition A communication-less parallel algorithm for tridiagonal Toeplitz systems A new method for solving symmetric circulant tridiagonal systems of linear equations Language-based vectorization and parallelization using intrinsics, OpenMP, TBB and Cilk Plus Algorithmic and language-based optimization of Marsa-LFIB4 pseudorandom number generator using OpenMP Solving a kind of boundary-value problem for ordinary differential equations using Fermi -the next generation CUDA computing architecture A highly scalable parallel algorithm for solving Toeplitz tridiagonal systems of linear equations Solving systems of symmetric Toeplitz tridiagonal equations: Rojo's algorithm revisited