key: cord-0058748-kiexbvlv authors: Bogdanov, Alexander V.; Mareev, Vladimir V.; Storublevtcev, Nikita; Smirnov, Victor; Podsevalov, Ivan title: A Modified Algorithm of Porting a Wave Motion date: 2020-08-24 journal: Computational Science and Its Applications - ICCSA 2020 DOI: 10.1007/978-3-030-58817-5_4 sha: 360feec5782c7b5b8876913d1aef50b97bcdcabd doc_id: 58748 cord_uid: kiexbvlv The purpose of our work is to identify the key features of algorithms in porting codes for calculating of essentially nonlinear processes to a modern cluster of hybrid architecture. The KPI equation with a source term was numerically modeled to reveal the features of the occurrence of extreme waves. A methodology for the implementation of boundary conditions for the modified KPI equation was also proposed. One of the main challenges of computational physics is the high computational cost of any simulation sufficiently complex to be practically useful. With the continuous advancement of computer hardware, this problem is slowly but steadily solved. The highly parallel architecture of General-Purpose Graphical Processing Units (GPGPUs) lends itself well to the simulation of various processes, such as fluid dynamics. Unlike previous works dedicated to porting of applications to a different softwarehardware platform [16] [17] [18] , in this paper we present a modified algorithm of porting, using a wave motion model based on heterogeneous generalized two-dimensional Kadomtsev-Petviashvili equation taking into account the new boundary conditions as an example. Extreme waves-frequently called "killer waves" ("rogue waves" or "freak waves")are lone waves of at least 20-30 m in height, which unexpectedly arise in the ocean with otherwise little disturbance and have completely unpredictable and uncharacteristic behavior compared to other ocean waves [1] [2] [3] [4] [5] [6] [7] [8] . The study of the generation of powerful waves, in particular, solitary, is carried out using numerical simulation. For this purpose, model partial differential equations are used. Some important information is obtained as a result of observations and experiments. In our previous analysis it was shown that the main problems in the numerical simulation of two-dimensional waves with the effects of an avalanche-like increase in amplitude are associated with changes in the type of equations in the transition from a continuum to a discrete analogue of the problem. In particular, the loss of full integrability or reduction in the number of conservation equations in a discrete problem is critical [14, 19] . As a result of this, a number of compensation mechanisms typical for nonlinear waves are destroyed, and effects called artifacts appear. Since these effects are most pronounced in completely integrable systems, the most "unpleasant" model with the maximum violation of the compensation conditions, namely KPI, was chosen for analysis. Kadomtsev-Petviashvili equation u t þ 0:5 u 2 ð Þ x þ bu xxx  à x ¼ gu yy in regards to the function u x; y; t ð Þ is usually labeled as "KPI equation" and is considered at t ! 0, This equation is also called "unstable Kadomtsev-Petviashvili equation" or "focusing effect equation". Earlier, for the two-dimensional KPI equation with growing solutions, a strategy for creating mesh methods was formulated and analyzed. When solving such problems numerically, the possibility of parallelizing programs using graphic processors was tested. Using finite-difference approach for KPI equation calculation is not effective due to the presence of the mixed derivative u tx . Instead of the original equation KPI its integro-differential analogue is considered [16] : Equation (1) with respect to function u x; y; t ð Þ is considered in the domain t ! 0, x; y 2 À1; 1 ð Þ , b; g ! 0, S x; y; t ð Þ¼ R x À1 u yy x 0 ; y; t ð Þdx 0 ; G x; y ð Þ is external source. Solution of the Eq. (1) is sought for initial distribution u x; y; 0 ð Þ. The continuous region of the equation KPI specification is replaced by a finite computational domain x min ; x max ½  y min ; y max ½  0; T ½ . A uniform differential grid with time t and spatial coordinates x and y is set in the calculation area: . . .; T=Dt À 1, with Dx; Dy being spatial coordinates steps, Dt being time step, j 0 being the index corresponding to For (1) the approximation is performed using the central-difference operators. The integral S x; y; t ð Þis calculated by the trapezoid method, the integrand derivative u yy is approximated by second order central differences: u yy % u n j;kÀ1 À 2u n j;k þ u n j;k þ 1 Dy 2 ð3Þ The resulting system of difference equations (2) is reduced to the form: with Du n þ 1 j;k ¼ u n þ 1 j;k À u n j;k and F n þ 1 The boundary conditions KPI are used: u x ¼ u xx ¼ 0 along boundary lines x 1 and x M , and u y ¼ 0 along the lines y 1 and y L ; , which means that in practical calculations it is impossible to choose a sufficiently large area x min ; x max ½  y min ; y max ½ to use uniform Dirichlet conditions. The coefficients a j ; b j ; c j ; d j ; e j ; f j (4) are determined from relations for the derivatives. Difference scheme is an implicit linearized finite-difference scheme. Quasi-onedimensional system is solved by the five-point sweep method. In some necessary cases, a flow correction procedure (FCT) is used [9] . A series of previous papers [10] [11] [12] [13] [14] [15] [16] was devoted to the numerical simulation of nonlinear KdVB and KPI type equations. In these papers, a difference scheme was chosen, its testing was carried out, and numerical results were obtained. Various methods for numerical solution of KPI were discussed in [13] . The main features of the difference scheme are illuminated in [12, 13, 15] . As an initial distribution is considered the ellipsoid of rotation: where r [ 0; the volume V 1 ¼ pra 1 b 1 =x, and a 1 ; b 1 being the half axis. A distribution of sources as an ellipsoid of rotation is chosen: with the volume V 2 ¼ 2pa 2 b 2 c 2 =3, and a 2 ; b 2 ; c 2 being the half axis, ðx 0 ; y 0 Þ-center of ellipsoid. When calculating such an Eq. (1) by a finite-difference method in a bounded rectangular region, the integral S introduces a significant error in the solution. Because of the finite region of calculation, the integral S at the boundaries of the region does not tend to zero. In order to get rid of the distortion of the numerical solution, due to the reasons described above, the following procedure is proposed [19] . From the test results, several conclusions were drawnthe schema has sufficient resolution for areas of high gradient; the schema describes the process of formation and distribution of solitons with characteristic conservation very well; the scheme satisfactorily handles cases with initial distributions that are not entirely integrable; a more powerful hardware is needed to model long-term cases. When calculating the equation system (4) with boundary conditions (5), the flow condition along the y axis are not implemented correctly, which leads to reflection condition. To be able to correctly set the boundary condition of the flow type u y ¼ 0 at y ¼ y min ; y ¼ y max in the left part of Eq. (2) a small quantity is introduced, which can be called artificial convection over y: Àe Á sign u y À Á u y , where the parameter e [ 0 and small enough (e ( 1), such as not to introduce significant error in the solution of the corresponding Eq. (1). Instead of the integral S x; y; t ð Þ, the function P y ð Þ is introduced: For functions h À y ð Þ and h þ y ð Þ fulfil the symmetry condition: h À y min þ y à ð Þ¼ h þ y max À y à ð Þat y à 2 0; d ½ Þ. When choosing the finite function P y ð Þ, and practically the function h þ y ð Þ, there are two criteria: The function h þ y ð Þ should monotonously decrease in the band adjacent to the upper boundary y max (increase in the lower band for y min ). The derivative of function h À y ð Þ and h þ y ð Þ at the boundaries y min and y max should be as small as possible. During the choice of finite function P y ð Þ several different h þ y ð Þ functions were considered, in particular, sigmoid function, 3-rd degree spline, parabolas. Model in question is represented by 7 matrices of equal size, which are operated upon by 4 main functions. Each matrix holds single-precision 32-bit floating-point values representing the wave height relative to the base ocean level. Two functions also feature temporary data arrays, size of which can differ depending on the implementation. In the original algorithm for CPU, these arrays were equal to 1-4 rows of the main matrices. Some of this temporary data can be reduced or eliminated completely by usage of additional on-the-spot calculations, but some of is required for parallelization. Reduction in temporary data increases cache localization, which can have a positive influence on performance, but inhibits parallelization, for example, automated loop unwinding. Matrices U 0 and U 1 represent the previous and current model states respectively. Since each subsequent model state depends on a previous one, we are forced to hold both in memory at all times. Matrix P j;k is responsible for boundary smoothing and contains coefficients from 0 to 1, with edge cells set at zero. Matrix G j;k G x j ; y k À Á is the external perturbation matrix. Matrix U yy ð Þ j;k -is second derivative u yy (3). Matrix s eta S n jk . Matrix U y ð Þ j;k -is first derivatives u y in boundary conditions. The original implementation was written in Fortran 95. OpenCL was chosen as the framework for the GPGPU porting process. Main reason for that choice was high degree of portability and low dependency from the host-side code language. As was mentioned above, heterogeneous architecture of GPGPUs is a good fit for matrix operations, but not all functions of the model allow for complete cell-level parallelism, as each cell's value may be dependent on the adjacent ones. Additionally, some functions feature accumulation calculations on one or both axes, making full parallelization very difficult. For example, the function implementing five-point tridiagonal matrix algorithm, can be roughly represented like this: Where j* -neighborhoods j, i.e. j − 1. j. j + 1; a, b, c and U 0 are temporary buffer matrices, F; G and H are simple linear functions, and U is the result. From this it is easy to see that column calculation is relatively independent, while row calculation includes forward and backward accumulation. In all other main functions, every element's value is dependent on the adjacent one. For example, an element with coordinates x; y ½ may rely on one at x þ = À 1; y ½ , and in the worst case require a simple additive accumulation. Taking all of this into account, the main focus was made on the inter-column parallelization, which ended up being surprisingly straightforward, with very little code changes required. Five-point tridiagonal matrix algorithm implementation required buffer arrays for each row, which were allocated using private memory which for datasets of this size is automatically projected onto the global device memory, and as such does not require additional synchronization. This resulted in an expected increase of performance. Next step was to implement the inter-row parallelism, at least in the trivial cases, such as calculation of U yy and U y matrices, which do not feature accumulation. This had significantly improved calculation times. This is most likely due to the memory block read optimization. When calculating s_eta j-th row element depends on the intermediary results from j − 1-th element. In the original algorithm it was solved by allocating a temporary row s such that s[j] = s[j − 1] * F1(U_yy[j*, k]), after which s_eta was calculated as s_eta [j] = F2(s[j]). This approach is ill-suited for GPU, because cumulative addition is hard to parallelize and intermediary results have to be stored in global memory. Thankfully, it can be avoided by utilizing local memorybreaking the calculation of s into blocks, each calculating F1(U_yy[j*, k]) in parallel, after which the main thread performs the additive accumulation and immediately calculates F2(s[j]), and repeats the same for the next block. Despite the fact that F1 function is very simple in terms of calculation, this approach game a noticeable boost to performance. This is, most likely, due to the GPU architecture detailsglobal memory access is much more efficient when reading a block of data in parallel, inside a single local group. The last step was to improve the five-point tridiagonal matrix algorithm parallelization. In this case local memory would not help as there is simply not enough of it to contain all the required data. The solution was to create a single buffer matrix and utilize the comparatively large 4 Gb global memory space. Big memory usage is unpleasant, but acceptable if it solves a much larger problem of slow global memory access. Each matrix element is only dependent on the two previous ones, which allows us to store these two values in faster memory. This way, at the first stage of calculations, each iteration writes the data into global memory but does not read it, which allows the optimizer to perform a much faster asynchronous write. For subsequent iterations the same data is stored in private memory, which is fast for both write and read. At the second stage of calculations the intermediary data is only read from global memory, which avoid the need for additional synchronization. There was a risk that the intermediary data would not fit into the private memory, since it is very limited, but in practice there were no problems with it, and it provided a significant boost to performance. The benchmarking platform specifications are as follows: CPU AMD FX 8550 *4 GHz GPU AMD R9 380 *970 MHz For the testing purposes, the model size was limited to 960 by 640 cells, with 200 iterations, which equates to 0.01 s of simulation time. Benchmark results of different versions are presented in Table 1 . As seen from the above Table 1 , the simple inter-column parallelism increased the computation speed by *20 times despite significantly lower GPU clock speed. Utilization of asynchronous kernel calls did provide an additional small boost to performance. Inter-row parallelization of trivial functions cut down the total computation time by *40%, while s_eta optimization improved it to *50%. The last step brought it to a total of *75% less compute time than a simple inter-column version, or about 90 times faster than the CPU version. Unfortunately, it was impossible to showcase performance increase per core, as profiling software provided incoherent results. An analysis of the results calculated with various finite functions showed that the most suitable for this initial distribution (5) are two functions: a spline of the 3rd order h þ y ð Þ ¼ 2 y 3 À 3 y 2 þ 1 and a parabola h þ y ð Þ ¼ y 2 À 2 y þ 1, e ¼ 0:01. Figure 1 shows visible effects of the interaction of propagating waves with the boundary y ¼ y min and y ¼ y max . For this picture and for all furthers, the main numerical values of the parameters are the same: mesh is 800  600 x  y ð Þ; Dx ¼ Dy ¼ 0:1; Dt ¼ 5 Á 10 À5 ; b ¼ g ¼ 1. In our case it is y min ¼ À31:9; y max ¼ 48. The Gaussian distribution (5) is taken with r ¼ 10; . The width d ¼ 0; 11 Á y max À j y min j ¼ 6:6 (11% from the width of the computational domain y max À y min j j ). The testing (Fig. 2) was conducted for the case with a rotation ellipsoid (6) source where (Fig. 3) . It is obvious that both functions give almost identical results (Fig. 4) , but parabola gives better ones. The analysis of various numerical procedures for nonlinear equations describing the evolution of strong waves is carried out. An equation is chosen that is a generalization of the Kadomtsev-Petviashvili-I (KPI) equation, which shows the main part of the problems of the evolution of ocean waves and at the same time is the most difficult from the point of view of stability of the numerical algorithm. The solution of such problems reveals the features of various types of discretization. Numerical analysis shows that the proposed approach solves at least those problems that are related to the complete integrability of the original model. A calculation method is proposed by introducing a smooth finite function and artificial convection. It works quite efficiently, at least for nonlinear equations with an infinite evolution domain. The proposed procedure showed acceptable accuracy. We managed not only to clearly demonstrate the effects of an avalanche-like increase in the amplitude of two-dimensional waves, but also to solve all the main problems leading to numerical errors, namely, the shift of the wave crest and the appearance of artifacts. Since the transition to more realistic models of "killer waves" is associated with a decrease in the number of restrictions in the transition to a discrete model, there is every reason to hope that our method there will be even more effective. Tsunami and Nonlinear Wave Nonlinear Waves: Theory, Computer Simulation, Experiment Nonlinear Waves, Solitons and Chaos, 2nd edn Theory and Applications of Nonviscous Fluid Flows Nonlinear Waves in Integrable and Non-integrable Systems, 430 pp A finite difference method for the Korteweg-de Vries and the Kadomtsev-Petviashvili equations Numerical implementation of two-soliton solutions to the Kadomtsev-Petviashvili equation Numerical simulation of Kadomtsev-Petviashvili-Benjamin-Bona-Mahony equations using finite difference method Flux corrected transport I, SHASTA, a fluid transport algorithm that works High performance algorithms for multiphase and multicomponent media Hybrid approach perturbed KdVB equation Numerical simulation perturbed KdVB equation Numerical simulation KPI equation Hybrid approach for nonlinear wave equation Influence of external source on KPI equation Porting the algorithm for calculating an asian option to a new processing architecture On porting of applications to new heterogeneous systems Algorithms for the calculation of nonlinear processes on hybrid architecture clusters On some problems of the numerical implementation of nonlinear systems on example of KPI equation