On the Performance of Multi-GPU-Based Expert Systems for Acoustic Localization Involving Massive Microphone Arrays Jose A. Bellocha,∗, Alberto Gonzaleza, Antonio M. Vidalb, Maximo Cobosc aInstitute of Telecommunications and Multimedia Applications, Universitat Politècnica de València, Camino de Vera s/n, 46022 Valencia, Spain. bDSIC department, Universitat Politècnica de València, Camino de Vera s/n, 46022, Valencia, Spain. cComputer Science Department, Universitat de València,Poligon de la Coma s/n, 46100, Valencia, Spain. Abstract Sound source localization is an important topic in expert systems involving mi- crophone arrays, such as automatic camera steering systems, human-machine interaction, video gaming or audio surveillance. The Steered Response Power with Phase Transform (SRP-PHAT) algorithm is a well-known approach for sound source localization due to its robust performance in noisy and reverberant environments. This algorithm analyzes the sound power captured by an acoustic beamformer on a defined spatial grid, estimating the source location as the point that maximizes the output power. Since localization accuracy can be improved by using high-resolution spatial grids and a high number of microphones, accu- rate acoustic localization systems require high computational power. Graphics Processing Units (GPUs) are highly parallel programmable co-processors that provide massive computation when the needed operations are properly paral- lelized. Emerging GPUs offer multiple parallelism levels; however, properly managing their computational resources becomes a very challenging task. In fact, management issues become even more difficult when multiple GPUs are ∗Corresponding author: Phone Number +34-655436190 Email addresses: jobelrod@iteam.upv.es (Jose A. Belloch), agonzal@dcom.upv.es (Alberto Gonzalez), avidal@dsic.upv.es (Antonio M. Vidal), maximo.cobos@uv.es (Maximo Cobos) Preprint submitted to Journal of LATEX Templates February 25, 2015 involved, adding one more level of parallelism. In this paper, the performance of an acoustic source localization system using distributed microphones is analyzed over a massive multichannel processing framework in a multi-GPU system. The paper evaluates and points out the influence that the number of microphones and the available computational resources have in the overall system perfor- mance. Several acoustic environments are considered to show the impact that noise and reverberation have in the localization accuracy and how the use of massive microphone systems combined with parallelized GPU algorithms can help to mitigate substantially adverse acoustic effects. In this context, the pro- posed implementation is able to work in real time with high-resolution spatial grids and using up to 48 microphones. These results confirm the advantages of suitable GPU architectures in the development of real-time massive acoustic signal processing systems. Keywords: Sound Source Localization; Steered Response Power; Microphone Arrays; Graphics Processing Units 1. Introduction Microphone arrays are commonly employed in many signal processing tasks, such as speech enhancement, acoustic echo cancellation or sound source separa- tion (Brandstein & Ward, 2001). The localization of broadband sound sources under high noise and reverberation is another challenging task in multichannel5 signal processing, being an active research topic with applications in human- computer interfaces (Kodagoda & Sehestedt, 2014), teleconferencing (Wang et al., 2011) or emergency units (Calderoni et al., 2015). Microphone arrays may follow a given geometry, such as spherical arrays (Huang & Wang, 2014), or may be distributed. Algorithms for sound source localization can be broadly10 divided into indirect and direct approaches (Madhu & Martin, 2008). Indirect approaches usually follow a two-step procedure: they first estimate the Time Difference Of Arrival (TDOA) (Chen et al., 2006) between microphone pairs, and, afterwards, they estimate the source position based on the geometry of 2 the array and the estimated delays. On the other hand, direct approaches per-15 form TDOA estimation and source localization in one single step by scanning a set of candidate source locations and selecting the most likely position as an estimate of the real source location. Although the computation of TDOAs usually requires time synchronization, new approaches are being developed to avoid this limitation (Xu et al., 2013). Most localization algorithms are based20 on the Generalized Cross-Correlation (GCC) (Knapp & Carter, 1976), which is calculated by using the inverse Fourier transform of the weighted cross-power spectral density of the signals. The Steered Response Power - Phase Trans- form (SRP-PHAT) algorithm is a direct approach that has been shown to be very robust in adverse acoustic environments (DiBiase et al., 2001). The algo-25 rithm is usually interpreted as a beamforming-based approach that searches for the candidate position that maximizes the output of a steered delay-and-sum beamformer. The CUDA platform (CUDA, 2015) provides a computing framework that enables the use of Graphics Processing Units (GPUs) in applications beyond30 image processing (Liu et al., 2007; Zhao & Lau, 2013). GPUs are high parallel programmable co-processors that provide efficient computation when the needed operations are properly parallelized. Programming a GPU efficiently requires having good knowledge of both the underlying architecture and the mechanisms used by GPUs to distribute their tasks among their processing units. Since the35 appearance of CUDA programming, many researchers in different areas have made use of it to achieve better performances in their respective fields. For example, well-known computational cores have also been adapted to a GPU computing framework, such as LU factorization (Dazevedo & Hill, 2012), matrix multiplication (Matsumoto et al., 2011) or the Boltzmann equation (Kloss et al.,40 2010). In audio and acoustics, several works demonstrate the potential of GPUs for carrying out audio processing tasks. For example, the implementation of a multichannel room impulse response reshaping algorithm was carried out in (Mazur et al., 2011), and implementations of adaptive filtering algorithms were presented in (Schneider et al., 2012; Lorente et al., 2012, 2013, 2014). GPU-45 3 based room acoustics simulation was carried out in (Savioja, 2010; Southern et al., 2010; Webb & Bilbao, 2011; Hamilton & Webb, 2013). One of the main contributions within this field was carried out in (Savioja et al., 2011), where improved performances in additive synthesis, Fourier transform and convolution in the frequency domain were presented. A comparison between CPU and GPU50 performance for a simple crosstalk canceller is presented in (Belloch et al., 2011). Similarly, a binaural audio application with massive audio processing that was fully implemented on a GPU is presented in (Belloch et al., 2013a). GPUs are also used in (Vanek et al., 2012) and in (Bradford et al., 2011) for evaluating the likelihood function in automatic speech recognizers and for sliding phase55 vocoder, respectively. The use of GPUs for implementing sound source localization algorithms has also recently been tackled in the literature. The time performances of different localization algorithms implemented on GPU were reported in (Peruffo Minotto et al., 2012) and (Liang et al., 2012). In fact, although different implementations60 of the SRP-PHAT in the time-domain and frequency-domain are analyzed in (Peruffo Minotto et al., 2012), their results mainly focus on pure computational issues and do not discuss how localization performance is affected by using differ- ent numbers of microphones or a finer spatial grid. In (Seewald et al., 2014), the SRP-PHAT algorithm is implemented over two Kinects for performing sound65 source localization. In the same work, the algorithm only estimates the relative source direction instead of providing the absolute source position and the im- plementation is evaluated on different GPUs that belong to the old-fashioned Fermi(CUDA, 2015). One of our previous works (Belloch et al., 2013b) analyzed the performance of70 a 2-D SRP-PHAT implementation with different Nvidia GPU architectures. The present paper extends that work in various aspects. First, 3-D source localiza- tion is considered, leading to a significant increase in the required computational cost. Second, the system considered in this work makes use of multiple GPUs, facing new challenges in parallelization and resource management. Finally, this75 paper provides a deeper analysis of the influence of the acoustic environment 4 and the number of microphones in the final performance. As a result, this paper is aimed at demonstrating how localization systems using a high number of mi- crophones distributed within a room can perform sound source localization in real time under adverse acoustic environments by using GPU massive computa-80 tion resources. Specifically, the well-known SRP-PHAT algorithm is considered here. Note that coarse-to-fine search strategies have been proposed to overcome many of the processing limitations of SRP-PHAT (Do & Silverman, 2007; Said et al., 2013; Marti et al., 2013). However, while these strategies provide more efficient ways to explore the localization search volume, they only provide better85 performance than the conventional SRP-PHAT when the number of operations is restricted. Thus, the performance of the conventional SRP-PHAT with fine spatial grids is usually considered as an upper bound in these cases. Relevant parameters that affect the computational cost of the algorithm (number of microphones and spatial resolution) are analyzed, showing their in-90 fluence on the localization accuracy in different situations. We also discuss the scalability of the algorithm when multi-GPU parallelization issues are consid- ered. This paper highlights the need for massive computation in order to achieve high-accuracy localization in adverse acoustic environments, taking advantage of GPUs to fulfill the computational demand of the system.95 In comparison with the implementation presented in (Seewald et al., 2014), we design our application to achieve maximum performance on GPUs making use of the Kepler architecture GK110 (K20, 2014) (See Appendix A for de- tails). This architecture can be found on the Tegra K1 (TK1) systems-on-chip (SoC), embedded in the Jetson development kit (DevKit) (Jetson, 2015), and100 it is becoming widespread in current mobile devices such as Google’s Nexus 9 tablet (Nexus, 2015). Thus, the proposed implementation can be success- fully adapted to work properly on GPUs that are currently embedded in mobile devices. The paper is structured as follows. Section 2 briefly describes the basic SRP-105 PHAT localization algorithm that will be used throughout this paper. Section 3 presents the implementation of the algorithm on multi-GPU systems. The 5 proposed acoustic environments for real-time sound source localization are pre- sented in Section 4, describing the experiments conducted for studying the per- formance of the method in a real application context. The computational perfor-110 mance of the different multi-GPU implementations are also analyzed. Finally, Section 5 provides some concluding remarks. Two Appendixes are provided in order to facilitate the understanding of the parallelization techniques that are used throughout this article. 2. Sound Source Localization: SRP-PHAT Algorithm115 Consider the output from microphone l, ml(t), in an M microphone system. The Steered Response Power (SRP) at the spatial point x = [x,y,z]T for a time frame n of length TL can then be defined as Pn(x) ≡ ∫ (n+1)TL nTL ∣∣∣∣∣ M∑ l=1 wlml (t− τ(x, l)) ∣∣∣∣∣ 2 dt, (1) where wl is a weight and τ(x, l) is the direct time of travel from location x to microphone l. DiBiase (DiBiase, 2000) showed that the SRP can be computed120 by summing up the Generalized Cross-Correlations (GCCs) for all possible pairs of the set of microphones. The GCC for a microphone pair (k,l) is defined as Rmkml (τ) = ∫ ∞ −∞ Φkl(ω)Mk(ω)M ∗ l (ω)e jωτdω, (2) where τ is the time lag, ∗ denotes complex conjugation, Ml(ω) is the Fourier transform of the microphone signal ml(t), and Φkl(ω) is a combined weighting function in the frequency domain. The phase transform (PHAT) (Knapp &125 Carter, 1976) has been shown to be a suitable GCC weighting for time delay estimation in reverberant environments. The PHAT weighting is expressed as: Φkl(ω) ≡ 1 |Mk(ω)M∗l (ω)| . (3) Taking into account the symmetries involved in the computation of Eq.(1) and removing some fixed energy terms (DiBiase, 2000), the part of Pn(x) that 6 changes with x can be isolated as130 P ′n(x) = M∑ k=1 M∑ l=k+1 Rmkml (τkl(x)) , (4) where τkl(x) is the Inter-Microphone Time-Delay Function (IMTDF). This func- tion is very important since it represents the theoretical direct path delay for the microphone pair (k,l) resulting from a point source located at x. The IMTDF is mathematically expressed as (Cobos et al., 2011) τkl(x) = ‖x−xk‖−‖x−xl‖ c , (5) where c is the speed of sound (≈ 343 m/s), and xk and xl are the locations of135 the microphone pair (k,l). The SRP-PHAT algorithm consists in evaluating the functional P ′n(x) on a fine grid G with the aim of finding the point-source location xs that provides the maximum value: xs = arg max x∈G P ′n(x). (6) Figure 1 shows schematically the intuition behind SRP-PHAT localization.140 In this figure, an anechoic environment is assumed so that the GCC for each microphone pair is a delta function located at the real TDOA. Each TDOA de- fines a half-hyperboloid of potential source locations. The intersection resulting from all the half-hyperboloids matches the point of the grid having the greatest accumulated value.145 2.1. SRP-PHAT Implementation The SRP-PHAT algorithm is usually implemented on a grid by carrying out the following steps: 1. A spatial grid G is defined with a given spatial resolution r. The the- oretical delays from each point of the grid to each microphone pair are150 pre-computed using Eq.(5). 2. For each analysis frame, the GCC of each microphone pair is computed as expressed in Eq.(2). 7 -5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5 Intersecting Half-Hyperboloids for a Point-Source x [m] y [m ] Mic 1 Mic 2 Mic 3 Figure 1: Intersecting half-hyperboloids for M = 3 microphones. Each half-hyperboloid corresponds to a TDOA peak in the GCC. 3. For each position of the grid x ∈ G, the contribution of the different cross-correlations are accumulated (using delays pre-computed in 1), as in155 Eq.(4). 4. Finally, the position with the maximum score is selected as in Eq.(6). The SRP-PHAT localization performance depends on the selected spatial resolution r. Figure 2 illustrates the algorithm performance when considering different spatial grid resolutions. The accumulated SRP-PHAT values for each160 spatial grid location are shown for a 2-D plane in a 4 × 6 m room with N = 6 microphones. Note how the location of the source is more easily detected when finer spatial resolutions are used, as in the case of r = 0.01 m. 8 2.2. Computational Cost The SRP-PHAT algorithm is usually implemented by performing a frequency-165 domain processing of the input microphone signals. Given M microphones, the number of microphone pairs to process is Q = M(M − 1)/2. For a DFT size of L (equal to the time-window size), an FFT takes 5L log2 L arithmetic oper- ations that result from L 2 log2 L complex multiplications and L log2 L complex additions. Note that one complex multiplication is equivalent to four real mul-170 tiplications and one real addition, while a complex addition is equivalent to two real additions. As a result, the signal processing cost for computing the GCC is given by: • DFT: Compute M FFTs, then, M × 5L log2 L. • Cross-Power Spectrum: A complex multiplication for L points, result-175 ing in 6L operations (4 real multiplications and 2 real additions). This is done for Q microphone pairs, resulting in a cost of 6QL. • Phase Transform: Magnitude of the L points of the GCC, which costs L operations. This is also done for Q pairs, resulting in QL operations. • IDFT: The IDFT for Q pairs must be performed, which requires Q5L log2 L180 operations. Moreover, for each functional evaluation, the following parameters must be calculated: • M Euclidean distances, ‖xm‖, requiring 3 multiplications, 5 additions and 1 square root (≈ 12 operations): 20M operations185 • Q TDOAs, requiring 2 operations (1 subtraction and 1 division by c) per microphone pair: 2Q operations. • The SRP requires truncating the TDOA values to the closest sample ac- cording to the system sampling frequency, multiplying the cross-power spectrum to obtain the phase transform for each microphone pair and190 adding up all the GCC values: 5Q operations. 9 0 1 2 3 4 0 1 2 3 4 5 6 SRP-PHAT values with r = 0.01 m 0 1 2 3 4 0 1 2 3 4 5 6 SRP-PHAT values with r = 0.1 m y [m ] y [m ] x [m] x [m] (a) (b) Figure 2: Accumulated SRP-PHAT values for a 2-D spatial grid (4 × 6 m and M = 6 microphones) with different spatial resolutions. (a) r = 0.01 m. (b) r = 0.1 m. As a result, the cost of the SRP-PHAT is given by: Cost = ( M + M2 2 ) 5L log2 L + 7M(M − 1) 2 L + ν ( 20M + 7M(M − 1) 2 ) , (7) where ν is the total number of functional evaluations. In the conventional full grid-search procedure, ν equals the total number of points of the grid G. Figure 3 shows the computational cost of the algorithm for different spatial resolutions195 and number of microphones, considering a 3D grid search space with a uniform spatial resolution of r meters. 3. Algorithm Parallelization for real-time GPU implementation The GPU-based implementation of the SRP-PHAT algorithm is applied to Nvidia hardware devices with Kepler architecture GK110 (K20, 2014). Ap-200 pendix A provides a detailed description of the GPU-parallelization techniques used throughout this section. 10 Computational Cost M (Number of microphones) 5 10 15 20 25 30 35 40 45 10 6 10 7 10 8 10 9 10 10 10 11 10 12 10 13 = 0.05 m = 0.1 m = 0.01 m = 0.005 m r r r r N u m b er o f o p er a ti o n s Figure 3: Computational cost when for different number of microphones M and spatial reso- lutions r. Since the localization is carried out in three dimensions, three different reso- lutions rx, ry, and rz define the spatial grid G. Taking a shoe-box-shaped room as a model room with dimensions lx × ly × lz, the size of the grid is ν = Px ×205 Py × Pz, where Px = lxrx , Py = ly ry and Pz = lz rz . The real-time implementation of the SRP-PHAT algorithm uses 50% time- window overlap, with audio sample buffers of size L. These L×M samples are transferred to the GPU first. A GPU buffer (denoted here as TGPU ) stores the audio samples in consecutive memory positions as they arrive to the GPU. One210 aspect that affects the performance for all audio signal processing applications on GPU is the transfer of audio samples from CPU to GPU. As mentioned in Appendix A.1, streams can be used to parallelize these transfers and overlap them with the computation. Since we use 50% overlap, the processing is carried out in blocks of size 2L, which are composed of the current audio-sample buffer215 and the previous one. Thus, a size of 2LM is used for TGPU . The SRP-PHAT 11 GPU implementation carries out the following steps: 1. M streams are created (one stream for each microphone in the system). The streams are launched consecutively in an asynchronous way. Stream l transfers L samples captured by microphone l to the GPU and stores them220 in TGPU , with l = 0, . . . ,M−1. Then, stream l launches Kernel A, which is responsible for grouping 2L elements of microphone l (L samples from previous buffers and L samples from current buffers). These 2L elements are also weighted using a Hamming window vector. For this purpose, the stream launches a kernel that is composed of 128-size thread blocks in a225 CUDA grid of dimensions ( 2L 128 × 1 ) (i.e., it is composed of 2L CUDA threads). Each thread computes one element of the 2L elements. The tasks carried out by Kernel A are simple. Each thread reads one value from global-memory, multiplies it by a float number (a value of Hamming window vector) and stores it in a different position of global-230 memory. The accesses to global-memory are totally coalesced, since audio samples are stored in consecutive memory positions both when reading and when writing (see Fig. 4). Also, L is power of 2 and is always larger than 1024. Thus, each thread block reads and writes in 128 consecutive memory positions. The selection of 128 for the block size was done experimentally235 among 64, 128, 256 and 512, with 128 being the one that requires less time. 2. Once Kernel A has finished, stream l uses the CUFFT library to per- form a 2L-FFT using these 2L elements. As a result of the computation performed by all the streams, M vectors that are composed of 2L fre-240 quency bins (denoted as fl, l = 0, . . . ,M − 1 ) are obtained. The use of streams allows us to overlap data transfers with computation. For exam- ple, while stream 1 is transferring samples from microphone 1, stream 0 can be executing Kernel A. However, the next steps involve operations among different channels. Thus, all the previous operations must finalize245 before continuing. This implies synchronization among all the streams. 12 b lo ck D im .y (ThreadIdx.x, ThreadIdx.y) Grid of CUDA threads blockDim.x . . .. . . .. .. .. . .. . TGPU L ML elements from previous audio samples ML elements from current audio samples f0 Hamming Vector 2L 128 128 Figure 4: Operations that are carried out by CUDA kernel 21 in case M=4. The following steps are computed by only one stream. 3. The GCC matrix is computed by means of another kernel (Kernel B). In this kernel, a GPU thread takes one value from each of two different fl buffers that are at the same vector position. It conjugates one of the250 values and multiplies it by the corresponding value of the other fl buffer. The phase of the complex number obtained by the multiplication is stored in the corresponding position in the GCC matrix. The accesses to the two fl buffers by GPU threads are totally coalesced since consecutive threads access consecutive memory positions (see Fig. 5).255 Kernel B is limited by the instruction bandwidth since GPU-native func- tions cosf, sinf, and atan2f are used and all of them require various clock cycles. Kernel B computes 2LQ values of the GCC matrix, where Q represents the number of microphone pairs. In order to define the size and the number of blocks to launch in Kernel B, different tests were exe-260 cuted. The best performance was achieved by using 128-size thread blocks in a CUDA grid with size 32×16. This implies launching 65536 threads, where each thread is responsible for computing 2LQ 65536 values of the GCC matrix. In this case, increasing the amount of work per thread block in Kernel B is more beneficial than launching more blocks with fewer oper-265 ations per GPU thread. Thus, the grid configuration applied to Kernel B 13 achieves maximum occupancy when 512 blocks are launched. This kernel does not require using shared-memory but preferably a large number of registers. Thus, we set L1 cache to 48 KB. As described in (K20, 2014), cache L1 is used for register spills, local memory, and stack, which are all270 private per-thread variables. b lo ck D im .y (ThreadIdx.x, ThreadIdx.y) Grid of CUDA threads blockDim.x . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. .idx f0 M f1 fM-1 Q GCC Multiplication and Phase Computation 2L 128 gridDim.y=16 gridDim.x=32 128 x 32 = 4096 Figure 5: Operations that are carried out by Kernel B. 4. Q inverse FFTs of size 2L are then carried out by using the CUFFT library. The GCC matrix is now composed of temporal (time delay) values (i.e., 2LQ real values). 5. The computation of a tridimensional matrix SRP storing the accumulated275 SRP values is carried out by Kernel C. This kernel also launches thread blocks of size 128 in a tridimensional CUDA grid whose dimension depends on the number of points of the grid G (ν). In total, ν threads are launched. In this kernel, each GPU thread is devoted to the computation of the total value of the SRP at each point of the grid. To this end, each thread280 computes and accumulates Q GCC values (it takes a value from each row of the GCC matrix and accumulates it). The computation of the SRP requires Q calculations of the IMTDF (see Eq. 5) at each point of the grid. The IMTDF of a pair of microphones specifies the column of the GCC matrix that should be selected and then accumulated in the SRP.285 Figure 6 illustrates these operations. Since the value of the IMTDF can indicate any position of the column of the GCC matrix, coalesced access to the global-memory is not guar- anteed. In fact, the most probable situation is that the accesses will be 14 . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . b lo ck D im .y (ThreadIdx.x, ThreadIdx.y, ThreadIdx.z) Grid of CUDA threads blockDim.x . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . . . .. . . .. .. .. . .. . blockDim.z idx SRP Px Py GCC Pz A GCC value is accumulated from each Row. This value corresponds to the column pointed out by the IMTDF. Computation of IMTDF Q Figure 6: Operations that are carried out by Kernel C. quite disordered, so that the kernel employs most of its time in memory290 accessing. However, this limitation can be reduced if we force the compiler to use the Kepler read-only data cache with the GCC matrix, since this cache does not require aligned accesses. This read-only cache memory has also been used in recent GPU-based audio research such as (Hamilton & Webb, 2013) and (Bilbao & Webb, 2013). Furthermore, as in Kernel B, we295 set L1 cache to 48 KB to favor possible register spills. In the accumulation loop of the SRP values, we have set a #pragma unroll to accelerate the computation. 6. The grid position corresponding to the maximum SRP value has to be searched. To this end, we launch Kernel D. This kernel exactly follows300 the reduction example in Harris’ implementation (Harris, 2014) that comes with the Nvidia GPU Computing SDK (Software development kit), but it changes the sum operation for a maximum operation. However, even though this code is optimized for finding the maximum value, it does not indicate its position. Thus, after obtaining the maximum, we launch an-305 other kernel (Kernel E). This kernel launches as many threads as elements of the SRP matrix and only performs a comparison operation with the maximum. If the comparison matches, the thread writes the value of its index in a variable. 15 3.1. Memory considerations310 The computation of the IMTDF could be carried out off-line since the grid resolutions and the microphone locations are static. However, this would imply storing a 4-dimensional data structure composed of ν ·Q elements. If we use a standard room size (such as 6.0 × 4.0 × 3.0 m), a resolution of r = rx = ry = rz = 0.01 m, and M = 48 microphones, this data structure would require using315 more than eight gigabytes of global-memory. This exceeds the global-memory size of most available GPU devices. Thus, every IMTDF value is computed for each group of processed buffers. There are also other variables that are used to compute the values of the GCC and SRP matrices, such as the room dimensions, the number of mi-320 crophones and their position. Since all of these read-only variables must be available for all of the threads, they are stored in the constant memory (with size 64 KB). 3.2. Multi-GPU Parallelization Distributing the above processing tasks among different GPUs is not straight-325 forward. The greatest computational load relies on Step 6, which consists in computing the maximum value of the SRP matrix. Table 1 shows the elapsed time corresponding to each step for M =48 microphones and a spatial grid resolution of r =0.01 m. The tasks from Kernels C, D and E can be easily distributed among NGPU330 GPUs (the number of GPUs present in the system): each GPU computes ν NGP U elements of the SRP matrix and locates the maximum among its computed elements. To this end, NGPU CPU threads are created at the beginning of a parallel region by means of openMP (see Appendix A.2 to know how openMP can deal with multiple GPUs). This strategy is only focused on multi-GPU335 parallelization of the SRP matrix. In Appendix B, there is a description of an alternative strategy that aims at parallelizing both the computation of the SRP matrix and the GCC matrix. 16 Table 1: Elapsed Time in each kernel with M=48 and spatial grid resolution r=0.01. Steps of the algorithm Time [ms] Transfer + Kernel A + FFT (steps 1 and 2 in Section III) 1.416 Kernel B (step 3: Computation of GCC) 0.015 IFFT of GCC (step 4) 0.006 Kernel C (Computation of SRP matrix) 0.007 Kernel D (Reduction: Computation of Maximum SRP value) 121.267 Kernel E (Localization of the Maximum) 0.009 Total elapsed time 122.720 This strategy uses also the UVA (Unified Virtual Addressing) feature for inter- GPU communication. This strategy requires different synchronization points340 that significantly penalize their performances, especially when compared to the parallelization presented in this article. 3.3. Basic Implementation using two GPUs As shown in Section 4, the performance of the SRP-PHAT algorithm is assessed in a system that is composed of two GPUs. Using all the parallelization345 techniques previously presented, the SRP-PHAT algorithm is implemented on two GPUs as follows: 1. A parallel region is created with two CPU threads. Each CPU thread is bound with a GPU. 2. Since different audio buffers are received in the system, each CPU thread350 independently and asynchronously sends all audio buffers to its GPU by using stream parallelization. The Kernels A and the FFTs are computed for each channel inside the streams. 3. As in step 2 of Section 3, stream synchronization is addressed. Only one stream is used to compute the rows of the GCC matrix.355 17 CPU master thread CPU thread 0 CPU thread 1 Transfer L Kernel A 2L-FFT Stream 0 Stream 1 Transfer L Kernel A 2L-FFT Stream 11 Transfer L Kernel A 2L-FFT Stream 12 Kernel A 2L-FFT Transfer L Kernel A 2L-FFT Stream 0 Stream 1 Transfer L Kernel A 2L-FFT Stream 11 Transfer L Kernel A 2L-FFT Stream 12 Transfer L Kernel A 2L-FFT Kernel B computes 66 rows of GCC matrix Kernel C computes half elements of SRP matrix Kernel B computes 66 rows of GCC matrix Kernel C computes half elements of SRP matrix Kernels D and E computes and locate the maximum of half elements of SRP matrix Kernels D and E computes and locate the maximum of half elements of SRP matrix Transfer maximum and localization to CPU Transfer maximum and localization to CPU CPU master thread Select Maximum value and its localization GPU0 GPU1 Transfer L Figure 7: Steps of the GPU-based SRP-PHAT implementation using two GPUs and openMP. 4. Since both of them have computed the GCC matrix, each GPU computes ν/2 elements of the SRP matrix and locates a maximum value among the computed elements. 5. Each GPU transfers back to the CPU its maximum value and its location inside the SRP matrix. Then, a synchronization barrier for both CPU360 threads is set followed by an openMP section that is only executed by the master thread. This thread compares the two maximum values and chooses the greatest one, getting its location. This location indicates the sound source position. Figure 7 illustrates the computation of the SRP- PHAT when M = 12.365 4. Experiments and Performance To analyze both the computing and localization performance of the above GPU implementations, a set of acoustic simulations using the image-source method (Allen & Berkley, 1979) have been considered. A shoe-box-shaped room with dimensions 4×6×3 m and wall reflection factor ρ (Kuttruff, 2000)370 was simulated using different numbers of microphones (M ∈ {6, 12, 24, 48}). 18 M = 6 M = 12 M = 24 M = 48 x y 6 m 4 m 3 m Figure 8: Microphone set-ups for M = 6, M = 12, M = 24 and M = 48. The black dots denote the actual active microphones in each configuration. The microphone set-up for the considered systems are shown in Figure 8. Note that the microphones are located on the walls of the room and are placed on eight different planes (z = {0.33, 0.66, 1.00, 1.33, 1.66, 2.00, 2.33, 2.66}) following hexagon-like shapes. Moreover, different reflection factors (ρ ∈ {0, 0.5, 0.9})375 were used to take into account different reverberation degrees. In all cases, in- dependent white Gaussian noise was added to each microphone signal in order to simulate different Signal to Noise Ratios (SNR ∈{0, 5, 10, 20}) (in dB). The audio card used in the real-time prototype uses an ASIO (Audio Stream Input/Output) driver to communicate with the CPU and provides 2048 samples380 per microphone (L=2048) every 46.43 ms at a sample frequency of 44.1 kHz. This time is denoted by tbuff . The time employed for the computation is denoted by tproc. This time takes into account all transfer times and measures the time from the first audio sample transferred to the GPU until the final source location 19 Table 2: Characteristics of the GPU K20c. Cuda Device Tesla K20c Architecture Kepler Capability 3.5 Number of SM 13 Total number of cores 2496 Max. dimension of a block 1024 x 1024 x 64 Max. dimension of a grid 231-1 x 65535 x 65535 Total amount of global memory 4 GB is estimated (at each time frame). The localization system works in real time385 as long as tproc < tbuff . Otherwise, microphone samples would be lost and the localization would not be correctly performed. The simulations were carried out in the Nvidia GPU K20c (K20, 2014), which has the characteristics shown in Table 2. Both computational and localization performances have been assessed taking into account three spatial grid resolutions (r ∈ {0.1, 0.05, 0.01}) in the390 XY plane (resolutions rx and ry are equal). The resolution rz is 0.33 m (resulting from dividing the height of the room into eight slots). 4.1. Localization Performance The source signal used in this study was a 5-second male speech signal with no speech pauses. Pauses were manually suppressed to evaluate localization395 performance only over frames where speech was present. The processing was carried out by using 50% overlap in time windows of length 4096 samples (size 2L), with sampling frequency fs = 44.1 kHz. For each frame, a source location x̂ = [x̂, ŷ, ẑ]T was estimated. A total number of 107 frames (Nf =107) per 40 different positions (Np=40) that were uniformly distributed over the room space400 were performed. Localization accuracy was computed by means of the Mean 20 Absolute Error, which is given by: MAE = 1 Nf 1 Np Nf∑ i=1 Np∑ j=1 |eij|2, (8) where eij=xij − x̂ij, with xij and x̂ij being the true and estimated source loca- tions at a given time frame i and source position j. Note that the above MAE was computed for each environmental condition (reflection factor and signal to405 noise ratio), microphone setup and spatial grid resolution. Figure 9 shows the results for different values of wall reflection factor ρ taking into account different spatial resolutions and number of microphones. It is important to point out that using a high number of microphones helps to substantially improve localization accuracy under high noise and reverberation.410 The error decreases as the SNR increases and/or reverberation decreases (lower ρ). It is important to see how the spatial resolution has an impact when there are few microphones. In this case, a coarse spatial grid is not sufficient to correctly find the minimum of the SRP search space, which is more easily detected when the SRP is enhanced by the contributions of additional microphone pairs. In415 fact, when the number of microphones is 12 or higher, the performance difference between r = 0.01 and r = 0.1 is almost negligible. Accuracy differences among different values of ρ are noticiable. It should be emphasized that, under favorable acoustic conditions (high SNR and low ρ), the experimental error is always below the maximum expectable error independently of the number of microphones.420 Note that the maximum error in anechoic conditions is given by the largest diagonal of the cuboids forming the 3D grid (≈ 0.179 m for r = 0.1 and ≈ 0.165 m for r = 0.01). In all cases, the use of a higher number of microphones significantly helps in reducing this error. 4.2. Computational Performance425 The spatial resolutions considered in this paper result in large-scale SRP matrices. Table 3 shows the processing times tproc for different combinations of r and M when using two GPUs. It can be observed that the only that does not obtain a tproc lower than 46.43 ms (tbuff ) is the configuration composed of 21 SNR 0 5 10 15 20 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Accuracy =0.01 ρ=0.0 SNR 6 Mic. 12 Mic. 24 Mic. 48 Mic. M A E [ m ] spr 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Accuracy =0.01 ρ=0.5 6 Mic. 12 Mic. 24 Mic. 48 Mic. M A E [ m ] spr Accuracy =0.01 ρ=0.9 0 5 10 15 20 0 0.5 1 1.5 2 2.5 SNR 6 Mic. 12 Mic. 24 Mic. 48 Mic. M A E [ m ] spr 0 5 10 15 20 0 0.5 1 1.5 2 2.5 Accuracy =0.1 ρ=0.9 SNR 6 Mic. 12 Mic. 24 Mic. 48 Mic. M A E [ m ] spr 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Accuracy =0.1 ρ=0.5 SNR 6 Mic. 12 Mic. 24 Mic. 48 Mic. M A E [ m ] spr 0 5 10 15 20 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Accuracy =0.1 ρ=0.0 SNR 6 Mic. 12 Mic. 24 Mic. 48 Mic. M A E [ m ] spr Figure 9: Localization accuracy for different wall reflection factors (ρ ∈ {0, 0.5, 0.9}) as a function of the SNR and the number of microphones M. Each column presents results for different spatial resolutions (r = 0.1 and r = 0.01 m). M = 48 and r = 0.01. Thus, real-time processing is not possible in this case.430 However, by looking at the results shown in Table 4, it is possible to observe 22 Table 3: Processing time tproc using two GPUs. r M = 6 M = 12 M = 24 M = 48 0.01 1.031 ms 3.578 ms 15,564 ms 60.108 ms 0.05 0.381 ms 0.758 ms 2.238 ms 6.433 ms 0.1 0.371 ms 0.650 ms 1.588 ms 4.588 ms Table 4: Processing time tproc using one GPU. r M=6 M=12 M=24 M=48 0.01 1.894 ms 6.731 ms 30.145 ms 122.720 ms 0.05 0.564 ms 1.132 ms 3.484 ms 11.203 ms 0.1 0.546 ms 0.926 ms 2.336 ms 7.493 ms that the influence of the second GPU becomes relevant. In the case of M = 48, the processing time is halved for any resolution. Real-time processing would be easily achieved for M = 48 and r = 0.01 by adding an additional GPU. Figure 10 shows more clearly the time differences among all the configurations by varying435 the number of microphones and the grid resolutions r ∈ {0.1, 0.05, 0.01}. Note that the time tbuff is marked by a solid black line. 5. Conclusion New emerging GPU architectures help to overcome different computational problems in acoustic signal processing algorithms involving many microphone440 channels. This paper has analyzed the specific case of sound source localization, where very fine spatial resolutions or having a high number of microphones have a deep impact in the performance of real-time applications. In this context, the following contributions have been presented in this paper. Firstly, we have proposed a scalable multi-GPU implementation of the well-445 known SRP-PHAT algorithm for source localization in three dimensions. To this end, two parallelization levels have been considered. On the one hand, 23 0.01 0.05 0.1 0 20 40 60 80 100 120 One GPU r [m s] M=06 M=12 M=24 M=48 tbuff 0.01 0.05 0.1 0 20 40 60 80 100 120 Two GPUs r [m s] M=06 M=12 M=24 M=48 tbuff Figure 10: Time tproc for different resolutions and number of microphones. multiple cores are included in a GPU device. On the other hand, the system is composed of several GPUs. The analyzed computational performance indicates that the algorithm is scalable, so that the time employed to estimate the source450 location is reduced with the number of GPU devices. Secondly, we have evaluated the relation existing among localization accu- racy, number of microphones and available computational resources. While most works take into account computational issues or performance issues only, we have approached both aspects from a practical implementation perspective.455 To this end, simulated experiments going from a noiseless anechoic case to a noisy and highly reverberant case have been analyzed, studying the overall per- formance with a varying number of microphones and spatial resolutions. Results show that thanks to the GPU computational resources, a fine spatial search grid and a high number of microphones can be used to improve the localization ac-460 curacy and the robustness of the system. Finally, the presented implementation exploits the resources of GPUs with Kepler architecture, which can be currently found in new generation mobile devices such as Google’s Nexus 9 tablet. Thus, the proposed implementation can be successfully run on embedded mobile devices.465 24 Acknowledgments This work has been partially funded by the Spanish Ministerio de Economı́a y Competitividad (TEC2009-13741, TEC2012-38142-C04-01, and TEC2012- 37945-C02-02), Generalitat Valenciana PROMETEO 2009/2013, and Univer- sitat Politècnica de València through Programa de Apoyo a la Investigación y470 Desarrollo (PAID-05-11 and PAID-05-12). References Allen, J. B., & Berkley, D. A. (1979). Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am., 65 , 943–950. Belloch, J. A., Ferrer, M., Gonzalez, A., Martinez-Zaldivar, F., & Vidal, A. M.475 (2013a). Headphone-based virtual spatialization of sound with a GPU accel- erator. J. Audio Eng. Soc, 61 , 546–561. Belloch, J. A., Gonzalez, A., Martinez-Zaldivar, F. J., & Vidal, A. M. (2011). A real-time crosstalk canceller on a notebook GPU. In Proc. of IEEE Inter- national Conference on Multimedia and Expo (ICME) (pp. 1 –4). Barcelona,480 Spain. Belloch, J. A., Gonzalez, A., Vidal, A. M., & Cobos, M. (2013b). Real-Time Sound Source Localization on Graphics Processing Units. In Proc. of the International Conference on Computational Science, ICCS 2013 . Barcelona, Spain.485 Bilbao, S., & Webb, C. J. (2013). Physical modeling of timpani drums in 3D on GPGPUs. J. Audio Eng. Soc, 61 , 737–748. Bradford, R., Ffitch, J., & Dobson, R. (2011). Real-time sliding phase vocoder using a commodity GPU. In Proc. of ICMC 2011 . University of Huddersfield, United Kingdom.490 Brandstein, M., & Ward, D. (2001). Microphone arrays . Springer. 25 Calderoni, L., Ferrara, M., Franco, A., & Maio, D. (2015). Indoor localization in a hospital environment using Random Forest classifiers. Expert Systems with Applications , 42 , 125 – 134. doi:http://dx.doi.org/10.1016/j.eswa. 2014.07.042.495 Chen, J., Benesty, J., & Huang, Y. (2006). Time delay estimation in room acoustic environments: an overview. EURASIP Journal on Applied Signal Processing , 2006 , 1–19. Cobos, M., Marti, A., & Lopez, J. J. (2011). A modified SRP-PHAT functional for robust real-time sound source localization with scalable spatial sampling.500 IEEE Signal Processing Letters , 18 , 71–74. Cook, S. (2013). A Developer’s Guide to Parallel Computing with GPUs . Mor- gan Kaufmann. CUDA (2015). Nvidia CUDA Developer Zone. https://developer.nvidia.com/cuda-downloads. (accessed 2015 Febru-505 ary 02). Dazevedo, E., & Hill, J. (2012). Parallel LU Factorization on GPU Cluster. Procedia Computer Science, 9 , 67 – 75. doi:10.1016/j.procs.2012.04.008. Proceedings of the International Conference on Computational Science, ICCS 2012.510 DiBiase, J. H. (2000). A high accuracy, low-latency technique for talker local- ization in reverberant environments using microphone arrays . Ph.D. thesis Brown University Providence, RI. DiBiase, J. H., Silverman, H. F., & Brandstein, M. S. (2001). Robust localization in reverberant rooms. In M. S. Brandstein, & D. Ward (Eds.), Microphone515 Arrays: Signal Processing Techniques and Applications chapter 8. (pp. 157– 180). Berlin, Germany: Springer-Verlag. Do, H., & Silverman, H. F. (2007). A fast microphone array SRP-PHAT source location implementation using coarse-to-fine region contraction (CFRC). In 26 http://dx.doi.org/http://dx.doi.org/10.1016/j.eswa.2014.07.042 http://dx.doi.org/http://dx.doi.org/10.1016/j.eswa.2014.07.042 http://dx.doi.org/http://dx.doi.org/10.1016/j.eswa.2014.07.042 https://developer.nvidia.com/cuda-downloads http://dx.doi.org/10.1016/j.procs.2012.04.008 Proc. of the IEEE Workshop on Applications of Signal Processing to Audio520 and Acoustics . New Paltz, USA. Hamilton, B., & Webb, C. J. (2013). Room acoustics modelling using GPU- accelerated finite difference and finite volume methods on a face-centered cubic grid. In Proc. of Conference on Digital Audio Effects (DAFx-13). Maynooth, Ireland.525 Harris, M. (2014). Optimizing Parallel Reduction in CUDA NVIDIA. http://developer.download.nvidia.com/assets/cuda/files/ reduction.pdf. (accessed 2014 August 27). Huang, Q., & Wang, T. (2014). Acoustic source localization in mixed field using spherical microphone arrays. EURASIP Journal on Applied Signal Process-530 ing , 90 , 1–16. Jetson (2015). Mobile GPU: Jetson. http://developer.download.nvidia.com/embedded/jetson/TK1/docs/ Jetson_platform_brief_May2014.pdf. (accessed 2014 November 22). K20 (2014). NVIDIA Kepler Architecture.535 http://www.nvidia.com/content/PDF/kepler/ NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf. (accessed 2014 August 27). Kloss, Y., Shuvalov, P., & Tcheremissine, F. (2010). Solving Boltzmann equa- tion on GPU. Procedia Computer Science, 1 , 1083 – 1091. Proceedings of540 the International Conference on Computational Science, ICCS 2010. Knapp, C. H., & Carter, G. C. (1976). The Generalized Correlation Method for Estimation of Time Delay. IEEE Transactions on Acoustics, Speech and Signal Processing , ASSP-24 , 320–327. Kodagoda, S., & Sehestedt, S. (2014). Simultaneous people tracking and mo-545 tion pattern learning. Expert Systems with Applications , 41 , 7272 – 7280. doi:http://dx.doi.org/10.1016/j.eswa.2014.05.019. 27 http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf http://developer.download.nvidia.com/embedded/jetson/TK1/docs/Jetson_platform_brief_May2014.pdf http://developer.download.nvidia.com/embedded/jetson/TK1/docs/Jetson_platform_brief_May2014.pdf http://developer.download.nvidia.com/embedded/jetson/TK1/docs/Jetson_platform_brief_May2014.pdf http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf http://dx.doi.org/http://dx.doi.org/10.1016/j.eswa.2014.05.019 Kuttruff, H. (2000). Room acoustics . Abingdon, Oxford, UK: Taylor & Francis. 368 pages. Liang, Y., Cui, Z., Zhao, S., Rupnow, K., Zhang, Y., Jones, D. L., & Chen, D.550 (2012). Real-time implementation and performance optimization of 3D sound localization on GPUs. In DATE’12 (pp. 832–835). Liu, W., Schmidt, B., Voss, G., & Muller-Wittig, W. (2007). Streaming algorithms for biological sequence alignment on GPUs. IEEE Transac- tions on Parallel and Distributed Systems , 18 , 1270–1281. doi:http://doi.555 ieeecomputersociety.org/10.1109/TPDS.2007.1069. Lorente, J., Ferrer, M., , De Diego, M., & Gonzalez, A. (2014). GPU Implemen- tation of Multichannel Adaptive Algorithms for Local Active Noise Control. IEEE Transactions on Acoustics, Speech and Signal Processing , 22 , 1624 – 1635. doi:10.1109/TASLP.2014.2344852.560 Lorente, J., Ferrer, M., De Diego, M., Belloch, J. A., & Gonzalez, A. (2013). GPU implementation of a frequency-domain modified filtered-X LMS algo- rithm for multichannel local active noise control. In Proc. of the 52nd AES Conference. Guildford, United Kingdom. Lorente, J., Gonzalez, A., Ferrer, M., Belloch, J. A., De Diego, M., Piñero,565 G., & Vidal, A. M. (2012). Active noise conrol using Graphics Processing Units. In Proc of the International Congress on Sound and Vibration. Vilnius, Lithuania. Madhu, N., & Martin, R. (2008). Advances in digital speech transmission. chapter Acoustic Source Localization with Microphone Arrays. (pp. 135–166).570 New York, NY, USA: Wiley. Marti, A., Cobos, M., & Lopez, J. J. (2013). A steered response power iterative method for high-accuracy acoustic source location. Journal of the Acoustical Society of America, 134 . 28 http://dx.doi.org/http://doi.ieeecomputersociety.org/10.1109/TPDS.2007.1069 http://dx.doi.org/http://doi.ieeecomputersociety.org/10.1109/TPDS.2007.1069 http://dx.doi.org/http://doi.ieeecomputersociety.org/10.1109/TPDS.2007.1069 http://dx.doi.org/10.1109/TASLP.2014.2344852 Matsumoto, K., Nakasato, N., Sakai, T., Yahagi, H., & Sedukhin, S. G. (2011).575 Multi-level Optimization of Matrix Multiplication for GPU-equipped Sys- tems. Procedia Computer Science, 4 , 342 – 351. Proceedings of the In- ternational Conference on Computational Science, ICCS 2011. Mazur, R., Jungmann, J., & Mertins, A. (2011). On CUDA implementation of a multichannel room impulse response reshaping algorithm based on p-norm580 optimization. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (pp. 305 –308). doi:10.1109/ASPAA.2011. 6082310. Nexus (2015). Google’s nexus 9. http://blogs.nvidia.com/blog/2014/10/17/585 nvidia-tegra-k1-google-nexus-9/. (accessed 2015 January 11). openMP (2014). openMP API Specifications. http://www.openmp.org. (accessed 2014 June 05). Peruffo Minotto, V., Rosito Jung, C., Gonzaga da Silveira, L., & Lee, B. (2012). GPU-based approaches for real-time sound source localization using the SRP-590 PHAT algorithm. International Journal of High Performance Computing Applications , . doi:10.1177/1094342012452166. Said, A., Lee, B., & Kalker, T. (2013). Fast steered response power computation in 3D spatial regions . Technical Report HPL-2013-40 HP Labs Palo Alto, USA.595 Savioja, L. (2010). Real-time 3D finite-difference time-domain simulation of low- and mid-frequency room acoustics. In Proc. of the Int. Conf. Digital Audio Effects . Graz, Austria. Savioja, L., Välimäki, V., & Smith, J. O. (2011). Audio Signal Processing using Graphics Processing Units. J. Audio Eng. Soc, 59 , 3–19.600 Schneider, M., Schuh, F., & Kellermann, W. (2012). The Generalized Frequency-Domain Adaptive Filtering Algorithm Implemented on a GPU for 29 http://dx.doi.org/10.1109/ASPAA.2011.6082310 http://dx.doi.org/10.1109/ASPAA.2011.6082310 http://dx.doi.org/10.1109/ASPAA.2011.6082310 http://blogs.nvidia.com/blog/2014/10/17/nvidia-tegra-k1-google-nexus-9/ http://blogs.nvidia.com/blog/2014/10/17/nvidia-tegra-k1-google-nexus-9/ http://blogs.nvidia.com/blog/2014/10/17/nvidia-tegra-k1-google-nexus-9/ http://www.openmp.org http://dx.doi.org/10.1177/1094342012452166 Large-Scale Multichannel Acoustic Echo Cancellation. Speech Communica- tion; 10. ITG Symposium; Proceedings of , (pp. 1 –4). Seewald, L. A., Gonzaga, L., Veronez, M. R., Minotto, V. P., & Jung, C. R.605 (2014). Combining SRP-PHAT and two Kinects for 3D Sound Source Lo- calization. Expert Systems with Applications , 41 , 7106 – 7113. doi:http: //dx.doi.org/10.1016/j.eswa.2014.05.033. Southern, A., Murphy, D., Campos, G., & Dias, P. (2010). Finite difference room acoustic modelling on a General Purpose Graphics Processing Unit. In610 Proc. of the 128th AES Convention. London, United Kingdom. Vanek, J., Trmal, J., Psutka, J., & Psutka, J. (2012). Optimized Acoustic Likelihoods Computation for NVIDIA and ATI/AMD Graphics Processors. IEEE Transactions on Audio, Speech, and Language Processing , 20 , 1818– 1828. doi:10.1109/TASL.2012.2190928.615 Wang, R., Wang, X., & Kim, M. J. (2011). Motivated learning agent model for distributed collaborative systems. Expert Systems with Applications , 38 , 1079 – 1088. doi:http://dx.doi.org/10.1016/j.eswa.2010.05.003. Intelligent Collaboration and Design. Webb, C. J., & Bilbao, S. (2011). Computing room acoustics with CUDA - 3D620 FDTD schemes with boundary losses and viscosity. In Proc of IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). Prague, Czech Republic. Xu, B., Sun, G., Yu, R., & Yang, Z. (2013). High-Accuracy TDOA-Based Localization without Time Synchronization. IEEE Transactions on Parallel625 and Distributed Systems , 24 , 1567–1576. doi:10.1109/TPDS.2012.248. Zhao, Y., & Lau, F. C. (2013). Implementation of Decoders for LDPC Block Codes and LDPC Convolutional Codes Based on GPUs. IEEE Transactions on Parallel and Distributed Systems , 99 , 1. doi:http://doi. ieeecomputersociety.org/10.1109/TPDS.2013.52.630 30 http://dx.doi.org/http://dx.doi.org/10.1016/j.eswa.2014.05.033 http://dx.doi.org/http://dx.doi.org/10.1016/j.eswa.2014.05.033 http://dx.doi.org/http://dx.doi.org/10.1016/j.eswa.2014.05.033 http://dx.doi.org/10.1109/TASL.2012.2190928 http://dx.doi.org/http://dx.doi.org/10.1016/j.eswa.2010.05.003 http://dx.doi.org/10.1109/TPDS.2012.248 http://dx.doi.org/http://doi.ieeecomputersociety.org/10.1109/TPDS.2013.52 http://dx.doi.org/http://doi.ieeecomputersociety.org/10.1109/TPDS.2013.52 http://dx.doi.org/http://doi.ieeecomputersociety.org/10.1109/TPDS.2013.52 Appendix A. GPU and CUDA GPUs are wellknown for their potential in highly parallel data processing. A GPU is composed by multiple Stream Multiprocessors (SM) where, for 3.5 capability (Kepler architecture (K20, 2014)), there are 192 pipelined cores per SM1. In the CUDA model, the programmer defines the kernel function where the635 code to be executed on the GPU is written. A grid configuration, which defines the number of threads and how they are distributed and grouped, must be built into the main code (threads are grouped into thread blocks, and thread blocks configure a grid that is organized in three dimensions, denoted as BlockIdx.x, BlockIdx.y, and BlockIdx.z). The total number of threads launched in a640 kernel by means of thread blocks can exceed the number of physical cores. At runtime, the kernel distributes all the thread blocks among SMs. Each SM can host up to 16 thread blocks. If the number of blocks exceeds the resources of the GPU, these blocks wait until other blocks finish in order to be hosted later. A GPU device has a large amount of off-chip device memory (global-memory )645 and a fast on-chip memory (shared-memory, registers ). As its name indicates, the shared-memory is normally used when multiple threads must share data. There are also read-only cached memories called constant-memory and texture- memory. The first memory is optimized for broadcast (i.e., when all the threads read the same memory location), while the second one is more oriented to graph-650 ics. Figure A.1 shows how the GPU architecture is organized. Advanced GPU devices (beyond 2.x capability) come with an L1/L2 cache hierarchy that is used to cache global-memory. Cache L1 uses the same on-chip memory as shared- memory ; how much of the on-chip memory is dedicated to L1 is set for each kernel call.655 1At the time this paper was written, the most advanced GPU device was K20c with Kepler architecture which is the one considered throughout this work 31 SP SM GPU Shared M. / L1 Cache Registers SM Shared M. / L1 Cache Registers SM Shared M. / L1 Cache Registers Constant Memory/Texture Memory L2 Cache Global Memory SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SPSP SP SP SP SP Figure A.1: The GPU is configured by 16 Stream Multiprocessors (SMs), each of which has 192 pipelined cores (SP). Appendix A.1. Streams on GPU Streams are virtual work queues on the GPU. They are used for asynchronous operation, (i.e, the control of the program returns to the CPU immediately). Operations assigned to the same stream are executed in order and sequentially. Multiple streams can be defined on CUDA programming; however, up to 32660 streams are available to be independently run on the GPU thanks to the Hyper- Q technology that is presented in hardware with 3.5 capability (Cook, 2013). Different streams may execute their assigned operations out of order with respect to one another or concurrently. Thus, when a launched kernel does not require all the GPU resources, these could be used for another kernel that665 was launched from a different stream. Hence, streams allow multiple kernels to be launched concurrently. Following this idea, data transfer between CPU and GPU can also be overlapped with kernel computations and other transfers whenever they are carried out in different streams. If the data transfers are not assigned to any stream queue, they are executed synchronously and in an670 isolated way, (i.e., the CPU waits until all the previous operations have finished). GPU kernels are always launched asynchronously by the CPU (regardless of 32 whether or not they are scheduled on a stream queue). Thus, data transfers are usually used as a synchronization barrier. Figure A.2 illustrates the parallelization obtained using streams when M = 4675 (number of microphones) for steps 1,2, and 3 from Section 3 in the main article. Note that, in Fig.A.2, the alternative Kernel A’ would have a CUDA grid with dimensions ( 2L 128 × M ). Transfer L audio samples Kernel A 2L-FFT Transformation Transfer L audio samples Kernel A Stream 0 Stream 3 Transfer L audio samples Kernel A 2L-FFT Transformation Transfer L audio samples Kernel A 2L-FFT Transformation 2L-FFT Transformation Stream 1 Stream 2 CPU thread Transfer ML audio samples Kernel A’ M 2L-FFT Transformations CPU thread Figure A.2: Parallelization obtained with streams when M=4. Appendix A.2. Multi-GPU programming with multicore One of the standards that allows for multicore processing is openMP (openMP,680 2014). This standard works by using a fork/join pattern, that is, parallel regions are specified by the programmer. The CPU code runs sequentially and at some point hits a section where work can be distributed into several processors that perform the computations (CPU core spans several CPU threads). Afterwards, when all the computations are completed, all the CPU threads converge to a685 single thread again, which is called the master thread. If a machine has a multicore processor and several GPUs, the parallelization can be achieved by defining a number of threads in the parallel region equal to the number of GPUs. In this sense, each CPU thread deals with a GPU. This is very important since a CPU thread is bound with a GPU context.690 Thus, all subsequent CUDA calls (e.g. cudaMalloc) allocate memory only on its corresponding GPU (Cook, 2013). 33 PCI-E GPU 0 Communication through CPU memory space GPU 1 CPU Memory space PCI-E GPU 0 GPU 1 Memory space CPU Peer-to-Peer Communication Figure A.3: The UVA feature reduces data-transfer time among GPUs by using peer-to-peer communication (bottom). Recent CUDA releases (beyond 2.x capability and CUDA SDK 4.x) allow the time employed in data transfers among GPUs to be reduced by using the UVA (Unified Virtual Addressing) feature. That means that inter-GPU com-695 munication (peer-to-peer, P2P) can also be performed without routing the data through the CPU, saving PCI-E bandwidth. Before the appearance of these recent features, communication among GPUs had to be carried out through memory space in the CPU, as shown in Figure A.3. Appendix B. Multi-GPU Parallelization strategy involving GCC and700 SRP matrices The challenge of this strategy consists in parallelizing the computation of the GCC matrix. Initially, all the GPUs must have access to this matrix since each point of the SRP matrix requires a contribution from each pair of microphones (each row of the GCC matrix).705 The strategy that we present aims at achieving a good trade-off between the total operations carried out in each GPU and the number of transferred audio 34 buffers. For example, if the number of microphones is M = 12, the number of pairs to compute in GCC matrix is Q = 66. These pairs are distributed among the NGPU in a pseudo-triangular way. Figure B.1 shows the distribution of the710 computation and audio buffers among 2, 3 and 4 GPUs. The notation 01 x 05, indicates the element-wise multiplication of vector 1 and vector 5 of all computed vectors fl, l = 0, . . . ,M − 1 (see step 2 of Section 3). Note that the GPU that performs more multiplications deals with less audio buffers, minimizing the data transfers between CPU and GPU. This triangular structure can be considered715 independently of the number of microphones. Finally, after the distributed computation of the GCC matrix, all GPUs need all of the rows of the GCC matrix in order to compute their corresponding ν/NGPU elements of the SRP matrix. The use of UVA (see Appendix A.2) allows each GPU to access other GPU via peer-to-peer over the PCI-E bus720 rather than copying data back to the host and then to another GPU. Thus, each GPU transparently accesses the memories of other GPUs by just referencing a memory location. Appendix B.1. Basic Implementation using two GPUs Using all the parallelization techniques presented in Appendix A, the SRP-725 PHAT algorithm is implemented on two GPUs as follows: 1. A parallel region is created with two CPU threads. Each CPU thread is bound with a GPU. 2. Since different audio buffers are received in the system, each CPU thread independently and asynchronously sends its corresponding audio buffers730 to its GPU by using stream parallelization. The Kernels A and the FFTs are computed for each channel inside the streams. 3. As in step 2 of Section 3, stream synchronization is addressed. Only one stream is used to compute the rows of the GCC matrix. According to Figure B.1, in the case of M = 12, one GPU would compute 35 vectors735 and the other one would compute 31 vectors. 35 01 x 02 01 x 03 02 x 03 01 x 04 02 x 04 03 x 04 01 x 05 02 x 05 03 x 05 04 x 05 01 x 06 02 x 06 03 x 06 04 x 06 05 x 06 01 x 07 02 x 07 03 x 07 04 x 07 05 x 07 06 x 07 01 x 08 02 x 08 03 x 08 04 x 08 05 x 08 01 x 09 02 x 09 03 x 09 04 x 09 01 x 10 02 x 10 03 x 10 01 x 11 02 x 11 05 x 09 05 x 10 05 x 11 05 x 12 06 x 08 07 x 08 06 x 09 07 x 09 08 x 09 06 x 10 07 x 10 08 x 10 09 x 10 06 x 11 07 x 11 08 x 11 09 x 11 10 x 11 06 x 12 07 x 12 08 x 12 09 x 12 10 x 11 11 x 12 04 x 10 04 x 11 04 x 12 03 x 11 03 x 1202 x 12 GPU 1 GPU 0 uses 11 audio buffers and performs 35 element-wise multiplications GPU 1 uses 12 audio buffers and performs 31 element-wise multiplications 01 x 12 GPU 0 01 x 02 01 x 03 02 x 03 01 x 04 02 x 04 03 x 04 01 x 05 02 x 05 03 x 05 04 x 05 01 x 06 02 x 06 03 x 06 04 x 06 05 x 06 01 x 07 02 x 07 03 x 07 04 x 07 05 x 07 06 x 07 01 x 08 02 x 08 03 x 08 04 x 08 05 x 08 06 x 08 07 x 08 07 x 09 08 x 09 07 x 10 08 x 10 09 x 10 07 x 11 08 x 11 09 x 11 10 x 11 07 x 12 08 x 12 09 x 12 10 x 11 11 x 12 06 x 10 06 x 11 06 x 12 05 x 11 05 x 1204 x 12 01 x 09 02 x 09 03 x 09 04 x 09 05 x 09 06 x 09 01 x 10 02 x 10 03 x 10 04 x 10 05 x 10 01 x 11 02 x 11 03 x 11 04 x 11 01 x 12 02 x 12 03 x 12 GPU 0 GPU 1 GPU 2 GPU 0 uses 08 audio buffers and performs 28 element-wise multiplications GPU 1 uses 10 audio buffers and performs 18 element-wise multiplications GPU 2 uses 09 audio buffers and performs 20 element-wise multiplications 01 x 02 01 x 03 02 x 03 01 x 04 02 x 04 03 x 04 01 x 05 02 x 05 03 x 05 04 x 05 01 x 06 02 x 06 03 x 06 04 x 06 05 x 06 01 x 07 02 x 07 03 x 07 04 x 07 05 x 07 06 x 07 07 x 08 07 x 09 08 x 09 07 x 10 08 x 10 09 x 10 07 x 11 08 x 11 09 x 11 10 x 11 07 x 12 08 x 12 09 x 12 10 x 11 11 x 12 04 x 10 04 x 11 04 x 12 03 x 11 03 x 1202 x 12 01 x 08 02 x 08 03 x 08 04 x 08 05 x 08 01 x 09 02 x 09 03 x 09 04 x 09 01 x 10 02 x 10 03 x 10 01 x 11 02 x 11 01 x 12 GPU 0 GPU 1 GPU 3 GPU 0 uses 07 audio buffers and performs 21 element-wise multiplications GPU 1 uses 09 audio buffers and performs 15 element-wise multiplications GPU 2 uses 10 audio buffers and performs 15 element-wise multiplications GPU 3 uses 06 audio buffers and performs 15 element-wise multiplications 05 x 09 05 x 10 05 x 11 05 x 12 06 x 08 06 x 09 06 x 10 06 x 11 06 x 12 GPU 2 01 x 02 01 x 03 02 x 03 01 x 04 02 x 04 03 x 04 01 x 05 02 x 05 03 x 05 04 x 05 01 x 06 02 x 06 03 x 06 04 x 06 05 x 06 01 x 07 02 x 07 03 x 07 04 x 07 05 x 07 06 x 07 01 x 08 02 x 08 03 x 08 04 x 08 05 x 08 06 x 08 07 x 08 01 x 09 02 x 09 03 x 09 04 x 09 05 x 09 06 x 09 07 x 09 08 x 09 01 x 10 02 x 10 03 x 10 04 x 10 05 x 10 06 x 10 07 x 10 08 x 10 09 x 10 01 x 11 02 x 11 03 x 11 04 x 11 05 x 11 06 x 11 07 x 11 08 x 11 09 x 11 10 x 11 01 x 12 02 x 12 03 x 12 04 x 12 05 x 12 06 x 12 07 x 12 08 x 12 09 x 12 10 x 11 11 x 12 GPU 0 uses 12 audio buffers and performs 66 element-wise multiplications GPU 0 Figure B.1: Distribution of the audio buffers in order to compute the rows of the GCC matrix when NGP U is 1,2,3 and 4. 36 Table B.1: Speed up between strategies. rx,ry M = 6 M = 12 M = 24 M = 48 0.01 30.097 35.443 36.968 31.649 0.05 12.259 24.043 31.291 43.313 0.1 4.815 9.861 15.310 21.249 4. By using UVA, each GPU has access to the whole GCC matrix in order to compute ν/2 elements of the SRP matrix and locates a maximum value among the computed elements. 5. Each GPU transfers back to the CPU its maximum value and its location740 inside the SRP matrix. Then, a synchronization barrier for both CPU threads is set followed by an openMP section that is only executed by the master thread. This thread compares the two maximum values and chooses the greatest one, getting its location. This location indicates the sound source position.745 Appendix B.2. Comparison between strategies Table B.1 shows the speed up that the implementation strategy presented in section 3.3 achieves with respect to the strategy presented in Appendix B. Two important aspects significantly penalize the performance of this strategy in comparison with the strategy in section 3.3. First, since each GPU does not750 contain the whole GCC matrix, each GPU must access the global-memory of the other GPU in order to compute the SRP matrix; second, after computing the corresponding elements of the GCC matrix, both GPUs must be synchronized. 37 Introduction Sound Source Localization: SRP-PHAT Algorithm SRP-PHAT Implementation Computational Cost Algorithm Parallelization for real-time GPU implementation Memory considerations Multi-GPU Parallelization Basic Implementation using two GPUs Experiments and Performance Localization Performance Computational Performance Conclusion GPU and CUDA Streams on GPU Multi-GPU programming with multicore Multi-GPU Parallelization strategy involving GCC and SRP matrices Basic Implementation using two GPUs Comparison between strategies