key: cord-102774-mtbo1tnq authors: Sun, Yuliang; Fei, Tai; Li, Xibo; Warnecke, Alexander; Warsitz, Ernst; Pohl, Nils title: Real-Time Radar-Based Gesture Detection and Recognition Built in an Edge-Computing Platform date: 2020-05-20 journal: nan DOI: 10.1109/jsen.2020.2994292 sha: doc_id: 102774 cord_uid: mtbo1tnq In this paper, a real-time signal processing frame-work based on a 60 GHz frequency-modulated continuous wave (FMCW) radar system to recognize gestures is proposed. In order to improve the robustness of the radar-based gesture recognition system, the proposed framework extracts a comprehensive hand profile, including range, Doppler, azimuth and elevation, over multiple measurement-cycles and encodes them into a feature cube. Rather than feeding the range-Doppler spectrum sequence into a deep convolutional neural network (CNN) connected with recurrent neural networks, the proposed framework takes the aforementioned feature cube as input of a shallow CNN for gesture recognition to reduce the computational complexity. In addition, we develop a hand activity detection (HAD) algorithm to automatize the detection of gestures in real-time case. The proposed HAD can capture the time-stamp at which a gesture finishes and feeds the hand profile of all the relevant measurement-cycles before this time-stamp into the CNN with low latency. Since the proposed framework is able to detect and classify gestures at limited computational cost, it could be deployed in an edge-computing platform for real-time applications, whose performance is notedly inferior to a state-of-the-art personal computer. The experimental results show that the proposed framework has the capability of classifying 12 gestures in real-time with a high F1-score. R ADAR sensors are being widely used in many longrange applications for the purpose of target surveillance, such as in aircrafts, ships and vehicles [1] , [2] . Thanks to the continuous development of silicon techniques, various electric components can be integrated in a compact form at a low price [2] , [3] . Since radar sensors become more and more affordable to the general public, numerous emerging short-range radar applications, e.g., non-contact hand gesture recognition, are gaining tremendous importance in efforts to improve the quality of human life [4] , [5] . Hand gesture recognition enables users to interact with machines in a more natural and intuitive manner than conventional touchscreen-based and button-based human-machine-interfaces [6] . For example, Google has integrated a 60 GHz radar into the smartphone Pixel 4, which allows users to change songs without touching the screen [7] . What's more, virus and bacteria surviving on surfaces for a long time could contaminate the interface and cause people's health problems. For instance, in 2020, tens of A video is available on https://youtu.be/IR5NnZvZBLk This article will be published in a future issue of IEEE Sensors Journal. DoI: 10.1109/JSEN.2020.2994292 thousands of people have been infected with COVID-19 by contacting such contaminate surfaces [8] . Radar-based hand gesture recognition allows people to interact with the machine in a touch-less way, which may reduce the risk of being infected with virus in a public environment. Unlike optical gesture recognition techniques, radar sensors are insensitive to the ambient light conditions; the electromagnetic waves can penetrate dielectric materials, which makes it possible to embed them inside devices. In addition, because of privacypreserving reasons, radar sensors are preferable to cameras in many circumstances [9] . Furthermore, computer vision techniques applied to extract hand motion information in every frame are usually not power efficient, which is therefore not suitable for wearable and mobile devices [10] . Motivated by the benefits of radar-based touch-less hand gesture recognition, numerous approaches were developed in recent years. The authors in [9] , [11] , [12] extracted physical features from micro-Doppler signature [1] in the time-Dopplerfrequency (TDF) domain to classify different gestures. Li et al. [13] extracted sparsity-based features from TDF spectrums for gesture recognition using a Doppler radar. In addition to Doppler information of hand gestures, the Google Soli project [10] , [14] utilized the range-Doppler (RD) spectrums for gesture recognition via a 60 GHz frequency-modulated continuous wave (FMCW) radar sensor. Thanks to the wide available bandwidth (7 GHz), their systems could recognize fine hand motions. Similarly, the authors in [15] - [17] also extracted hand motions based on RD spectrums via an FMCW radar. In [18] , [19] , apart from the range and Doppler information of hand gestures, they also considered the incident angle information by using multiple receive antennas to enhance the classification accuracy of their gesture recognition system. However, none of the aforementioned techniques exploited all the characteristics of a gesture simultaneously, i.e., range, Doppler, azimuth, elevation and temporal information. For example, in [9] - [16] , they could not differentiate two gestures, which share similar range and Doppler information. This restricts the design of gestures to be recognized. In order to classify different hand gestures, many research works employed artificial neural networks for this multiclass classification task. For example, the authors in [12] , [18] - [20] considered the TDF spectrums or range profiles as images and directly fed them into a deep convolutional neural network (CNN). Whereas, other research works [14] , [15] , [21] considered the radar data over multiple measurement-cycles 1558-1748 ©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. as a time-sequential signal, and utilized both the CNNs and recurrent neural networks (RNNs) for gesture classification. The Soli project [14] employed a 2-dimensional (2-D) CNN with a long short-term memory (LSTM) to extract both the spatial and temporal features, while the Latern [21] , [22] replaced the 2-D CNN with 3-D CNN [23] followed by several LSTM layers. Because the 3-D CNN could extract not only the spatial but also the short-term temporal information from the RD spectrum sequence, it results in a better classification accuracy than the 2-D CNN [24] . However, the proposed 2-D CNN, 3-D CNN and LSTM for gesture classification require huge amounts of memory in the system, and are computationally inefficient. Although Choi et al. [16] projected the range-Doppler-measurement-cycles into rangetime and Doppler-time to reduce the input dimension of the LSTM layer and achieved a good classification accuracy in real-time, the proposed algorithms were implemented on a personal computer with powerful computational capability. As a result, the aforementioned radar-based gesture recognition system in [12] , [14] - [16] , [18] - [21] are not applicable for most commercial embedded systems such as wearable devices, smartphones, in which both memory and computational power are limited. In this paper, we present a real-time gesture recognition system using a 60 GHz FMCW radar in an edge-computing platform. The proposed system is expected to be applied in short-range applications (e.g., tablet, display, and smartphone) where the radar is assumed to be stationary to the user. The entire signal processing framework is depicted in Fig. 1 . After applying the 2-dimensional finite Fourier transform to the raw data, we select a certain number of points from the resulting RD spectrum as an intermediate step rather than directly putting the entire spectrum into deep neural networks. Additionally, thanks to the L-shaped receive antenna array, the angle of arrival (AoA) information of the hand, i.e., azimuth and elevation, can be calculated. For every measurement-cycle, we store this information in a feature matrix with reduced dimensions. By selecting a few points from the RD spectrum, we reduce the input dimension of the classifier and limit the computational cost. Further, we present a hand activity detection (HAD) algorithm called the short-term average/longterm average (STA/LTA)-based gesture detector. It employs the concept of STA/LTA [25] to detect when a gesture comes to an end, i.e., the tail of a gesture. After detecting the tail of a gesture, we arrange the feature matrices belonging to the measurement-cycles, which are previous to this tail, into a feature cube. This feature cube constructs a compact and comprehensive gesture profile which includes the features of all the dominant point scatters of the hand. It is subsequently fed into a shallow CNN for classification. The main contributions are summarized as follows: • The proposed signal processing framework is able to recognize more gestures (12 gestures) than those reported in other works in the literature. The framework can run in real-time built in an edge-computing platform with limited memory and computational capability. • We develop a multi-feature encoder to construct the ges-ture profile, including range, Doppler, azimuth, elevation and temporal information into a feature cube with reduced dimensions for the sake of data processing efficiency. • We develop an HAD algorithm based on the concept of STA/LTA to reliably detect the tail of a gesture. • Since the proposed multi-feature encoder has encoded all necessary information in a compact manner, it is possible to deploy a shallow CNN with a feature cube as its input to achieve a promising classification performance. • The proposed framework is evaluated twofold: its performance is compared with the benchmark in off-line scenario, and its recognition ability in real-time case is assessed as well. The remainder of this paper is organized as follows. Section II introduces the FMCW radar system. Section III describes the multi-feature encoder including the extraction of range, Doppler and AoA information. In Section IV, we introduce the HAD algorithm based on the concept of the STA/LTA. In Section V, we present the structure of the applied shallow CNN for gesture classification. In Section VI, we describe the experimental scenario and the collected gesture dataset. In Section VII, the performance is evaluated in both off-line and real-time cases. Finally, conclusions are given in Section VIII. Our 60 GHz radar system adopts the linear chirp sequence frequency modulation [26] to design the waveform. After mixing, filtering and sampling, the discrete beat signal consisting of I T point scatters of the hand in a single measurement-cycle from the z-th receive antenna can be approximated as [27] : where the range and Doppler frequencies f ri and f Di are given as: respectively, r i and v ri are the range and relative velocity of the i-th point scatter of the hand, f B is the available bandwidth, T c is the chirp duration, λ is the wavelength at 60 GHz, c is the speed of light, the complex amplitude a (z) i contains the phase information, I s is the number of sampling points in each chirp, I c is the number of chirps in every measurement-cycle, and the sampling period T s = T c /I s . The 60 GHz radar system applied for gesture recognition can be seen in Fig. 2 . It can also be seen that, the radar system has an L-shaped receive antenna array. To calculate the AoA in azimuth and elevation directions, the spatial distance between two receive antennas in both directions is d, where d = λ/2. A 2-D FFT is applied to the discrete beat signal in (1) to extract the range and Doppler information in every measurement- cycle [28] . The resulting complex-valued RD spectrum for the z-th receive antenna can be calculated as: where w(u, v) is a 2-D window function, p and q are the range and Doppler frequency indexes. The range and relative velocity resolution can be deduced as: where the range and Doppler frequency resolution ∆f r and ∆f D are 1/T c and 1/(I c T c ), respectively. To improve the signal-to-noise ratio (SNR), we sum the RD spectrums of the three receive antennas incoherently, i.e., To obtain the range, Doppler and AoA information of the hand in every measurement-cycle, we select K points from RD(p, q), which have the largest magnitudes. The parameter K is predefined, and its choice will be discussed in Section VII-A. Then, we extract the range, Doppler frequencies and the magnitudes of those K points, which are denoted asf rk , f Dk and A k , respectively, where k = 1, · · · , K. The AoA can be calculated from the phase difference of extracted points in the same positions of complex-valued RD spectrums belonging to two receive antennas. The AoA in azimuth and elevation of the k-th point can be calculated as: respectively, where ψ(·) stands for the phase of a complex value, a (z) k is the complex amplitude B (z) f rk ,f Dk from the z-th receive antenna. As a consequence, in every measurement-cycle, the k-th point in RD(p, q) has five attributes, i.e., range, Doppler, azimuth, elevation and magnitude. As depicted in Fig. 3 , we encode the range, Doppler, azimuth, elevation and magnitude of those K points with the largest magnitudes in RD(p, q) along I L measurement-cycles into the feature cube V with dimension I L ×K ×5. The V has five channels corresponding to five attributes and each element in V at the l-th measurementcycle can be described as: where l = 1, · · · , I L . Similar to voice activity detection in the automatic speech recognition system, our gesture recognition system also needs to detect some hand activities in advance, before forwarding the data to the classifier. It helps to design a power-efficient gesture recognition system, since the classifier is only activated when a gesture is detected rather than keeping it active for every measurement-cycle. The state-of-the-art event detection algorithms usually detect the start time-stamp of an event. For example, the authors in [25] used the STA/LTA and power spectral density methods to detect when a micro-seismic event occurs. In the case of radar-based gesture recognition, we could also theoretically detect the start time-stamp of a gesture and consider that a gesture event occurs within the following I L measurement-cycles. However, detecting the start-stamp and forwarding the hand data in the following I L measurement-cycles to the classifier could cause a certain time delay, since the time duration of designed gestures is usually different. As illustrated in Fig. 4(a) , due to the facts that the proposed multi-feature encoder requires I L measurementcycles and the duration of the gesture is usually shorter than I L , a delay occurs, if we detect the start time-stamp of the gesture. Therefore, as depicted in Fig. 4(b) , to reduce the time delay, our proposed HAD algorithm is designed to detect when a gesture finishes, i.e., the tail of a gesture, rather than detecting the start time-stamp. We propose a STA/LTA-based gesture detector to detect the tail of a gesture. The exponential moving average (EMA) is used to detect the change of the magnitude signal at the l-th measurement-cycle, which is given as: where α ∈ [0, 1] is the predefined smoothing factor, x(l) is the range-weighted magnitude (RWM), and it is defined as: where A max represents the maximal magnitude among K points in RD(p, q) at l-th measurement-cycle, f rmax denotes the range corresponding to A max , and the predefined coefficient β denotes the compensation factor. The radar cross section (RCS) of a target is independent of the propagation path loss between the radar and the target. According to the radar equation [29] , the measured magnitude of a target is a function of many arguments, such as the path loss, RCS, etc. As deduced in (10), we have built a coarse estimate of the RCS by multiplying the maximal range information with its measured magnitude to partially compensate the path loss. Furthermore, we define the STA(l) and LTA(l) as the mean EMA in short and long windows at the l-th measurementcycle: respectively, where L 1 and L 2 are the length of the short and long window. The tail of a gesture is detected, when the following conditions are fulfilled: where γ 1 and γ 2 are the predefined detection thresholds. Fig. 5 illustrates that the tails of two gestures are detected via the proposed STA/LTA gesture detector. According to (12) , one condition of detecting the tail of a gesture is that, the average of RWM in the long window exceeds the threshold γ 1 . It means that a hand motion appears in the long window. The other condition is that, the ratio of the mean EMA in the short window and that in the long window is lower than the threshold γ 2 . In other words, it detects when the hand movement finishes. In practice, the parameters β, γ 1 and γ 2 in our HAD algorithm should be thoroughly chosen according to different application scenarios. As discussed in Section III-D, the feature cube obtained by the multi-feature encoder has a dimension of I L ×K ×5. Thus, we could simply use the CNN for classification without any reshaping operation. The structure of the CNN can be seen in Fig. 6 . We employ four convolutional (Conv) layers, each of that has a kernel size 3 × 3 and the number of kernels in each Conv layer is 64. In addition, the depth of the first kernel is five, since the input feature cube has five channels (i.e., range, Doppler, azimuth, elevation and magnitude), while that of the other kernels in the following three Conv layers is 64. We choose the rectified linear unit (ReLU) [30] as activation function, since it solves the problem of gradient vanishing and is able to accelerate the convergence speed of training [31] . Then, the last Conv layer is connected by two fullyconnected (FC) layers, either of which has 256 hidden units and is followed by a dropout layer for preventing the network from overfitting. The third FC layer with a softmax function is utilized as the output layer. The number of hidden units in the third FC layer is designed to be in accordance with the number of classes in the dataset. The softmax function normalizes the output of the last FC layer to a probability distribution over the classes. Through thoroughly network tuning (e.g., number of hidden layers, number of hidden units, depth number), we construct the CNN structure as shown in Fig. 6 . The designed network should (a) take the feature cube as input, (b) achieve a high classification accuracy, (c) consume few computational resources, and (d) be deployable in the edge-computing platform. In Section VII, we will show that the designed network in Fig. 6 fulfills these criteria. As illustrated in Fig. 7 , we used the 60 GHz FMCW radar in Fig. 2 to recognize gestures. Our radar system has a detection range up to 0.9 m and an approx. 120 • antenna beam width in both azimuth and elevation directions. The parameter setting used in the waveform design is presented in Table I , where the pulse repetition interval (PRI) is 34 ms. The radar is connected with an edge-computing platform, i.e., NVIDIA Jetson Nano, which is equipped with Quad-core ARM A57 at 1.43 GHz as central processing unit (CPU), 128-core Maxwell as graphics processing unit (GPU) and 4 GB memory. We have built our entire radar-based gesture recognition framework described in Fig. 1 in the edge-computing platform in C/C++. The proposed multi-feature encoder and HAD have been implemented in a straightforward manner without any runtime optimization, while the implementation of the CNN is supported by TensorRT developed by NVIDIA. In addition, as depicted in Fig We invited 20 human subjects including both genders with various heights and ages to perform these gestures. Among 20 subjects, the ages range from 20 to 35 years old, and the heights are from 160 cm to 200 cm. We divided the 20 subjects into two groups. In the first group, ten subjects were taught how to perform gestures in a normative way. Whereas, in the second group, in order to increase the diversity of the dataset, only an example for each gesture was demonstrated to the other ten subjects and they performed gestures using their own interpretations. Self-evidently, their gestures were no longer as normative as the ones performed by the ten taught subjects. Furthermore, every subject repeated each gesture 30 times. Therefore, the total number of realizations in our gesture dataset is (12 gestures)×(20 people)×(30 times), namely 7200. We also found out that the gestures performed in our dataset take less than 1.2 s. Thus, to ensure that the entire hand movement of a gesture is included in the observation time, we set I L to 40, which amounts to a duration of 1.36 s (40 measurement-cycles × 34 ms). In this section, the proposed approach is evaluated regarding a twofold objective: first, its performance is thoroughly compared with benchmarks in literature through an off-line crossvalidation, and secondly, its real-time capability is investigated with an on-line performance test. In Section VII-A, we discuss how the parameter K affects the classification accuracy. In Section VII-B, we compare our proposed algorithm with the state-of-the-art radar-based gesture recognition algorithms in terms of classification accuracy and computational complexity based on leave-one-out cross-validation (LOOCV). It means that, in each fold, we use the gestures from one subject as test set, and the rest as training set. In addition, Section VII-C describes the real-time evaluation results of our system. The performances of taught and untaught subjects are evaluated separately. We randomly selected eight taught and eight untaught subjects as training sets, while the remaining two taught and two untaught subjects are test sets. In realtime performance evaluation, we performed the hardware-inthe-loop (HIL) test, and fed the raw data recorded by the radar from the four test subjects into our edge-computing platform. A. Determination of Parameter K As described in Section III, we extract K points with the largest magnitudes from RD(p, q), to represent the hand information in a single measurement-cycle. We define the average (avg.) accuracy as the avg. classification accuracy across the 12 gestures based on LOOCV. In Fig. 9 , we let K vary from 1 to 40, and compute the avg. accuracy in five trials. It can be seen that the mean avg. accuracy over five trials keeps increasing and reaches approx. 95%, when K is 25. After that, increasing K can barely improve the classification accuracy. As a result, in order to keep low computational complexity of the system and achieve a high classification accuracy, we set K to 25. It results that the feature cube V in our proposed multi-feature encoder has a dimension of 40 × 25 × 5. In the off-line case, we assumed that each gesture is perfectly detected by the HAD algorithm and compared our proposed multi-feature encoder + CNN with the 2-D CNN + LSTM [14] , the 3-D CNN + LSTM [21] , 3-D CNN + LSTM (with AoA) and shallow 3-D CNN + LSTM (with AoA) in terms of the avg. classification accuracy and computational complexity based on LOOCV. In our proposed multi-feature encoder + CNN, the feature cube V, which has the dimension of 40 × 25 × 5, was fed into the CNN described in Fig. 6 . The input of the 2-D CNN + LSTM [14] and the 3-D CNN + LSTM [21] is the RD spectrum sequence over 40 measurement-cycles, which has the dimension of 40 × 32 × 32 × 1. Since [21] did not include any AoA information in their system for gesture classification, the comparison might not be fair. Thus, we added the AoA information according to (6) and (7) CNN but with reduced classification accuracy. To achieve a fair comparison, we optimized the structures and the hyperparameters as well as the training parameters of those models. The CNN demonstrated in Fig. 6 in the proposed approach was trained for 15000 steps based on the back propagation [32] using the Adam optimizer [33] with an initial learning rate of 1 × 10 −4 , which degraded to 10 −5 , 10 −6 and 10 −7 after 5000, 8000 and 11000 steps, respectively. The batch size is 128. 1) Classification Accuracy and Training Loss Curve: In Table II , we present the classification accuracy of each type of gesture based on the algorithms mentioned above. The avg. accuracies of the 2-D CNN + LSTM [14] and 3-D CNN + LSTM [21] are only 78.50% and 79.76%, respectively. Since no AoA information is utilized, the Rotate CW and Rotate CCW can hardly be distinguished, and similarly the four Swipe gestures can hardly be separated, either. On the contrary, considering the AoA information, the multi-feature encoder + CNN, the 3-D CNN + LSTM (with AoA) and the shallow 3-D CNN + LSTM (with AoA) are able to separate the two Rotate gestures, and the four Swipe gestures. It needs to be mentioned that the avg. accuracy of our proposed multifeature encoder is almost the same as that of the 3-D CNN + LSTM with (AoA). However, it will be shown in the following section that our approach requires much less computational resources and memory than those of the other approaches. What's more, in Fig. 10 , we plot the training loss curves of the three structures of neural networks. It can be seen that the loss of the proposed CNN in Fig. 6 has the fastest rate of convergence among the three structures of neural networks and approaches to zero at around the 2000-th training step. Unlike the input of the 3-D CNN + LSTM (with AoA) and shallow 3-D CNN + LSTM (with AoA), the feature cube contains sufficient gesture characteristics in spite of its compact form (40 × 25 × 5). It results that the CNN in Fig. 6 is easier to be trained than the other neural networks, and it achieves a high classification accuracy. 2) Confusion Matrix: In Fig. 11 , we plotted two confusion matrices for ten taught and ten untaught subjects based on our proposed multi-feature encoder + CNN. It could be observed that, for the normative gestures performed by the ten taught subjects, we could reach approx. 98.47% avg. accuracy. Although we could observe an approx. 5% degradation in avg. accuracy in Fig. 11(b) , where the gestures to be classified are performed by ten untaught subjects, it still has 93.11% avg. accuracy. 3) Computational Complexity and Memory: The structures of the 3-D CNN + LSTM (with AoA), shallow 3-D CNN + LSTM (with AoA) and the proposed multi-feature encoder + CNN are presented in Table III . We evaluated their computational complexity and required memory in line with the giga floating point operations per second (GFLOPs) and the model size. The GFLOPs of different models were calculated by the built-in function in TensorFlow, the model size is observed through TensorBoard [34] . Although the 3-D CNN + LSTM (with AoA) offers almost the same classification accuracy as that of the proposed multi-feature encoder + CNN, it needs much more GFLOPs than that of the multi-feature encoder + CNN (2.89 GFLOPs vs. 0.26 GFLOPs). Its model size is also much larger than that of the proposed approach (109 MB vs. 4.18 MB). Although we could reduce its GFLOPs using a shallow network structure, such as the shallow 3-D CNN + LSTM (with AoA) in Table III , it results in the degradation of classification accuracy (94.36%), as can be seen in Table II . We also found out that the CNN used in our approach has the least model size, since its input dimension is much smaller than that of other approaches. On the contrary, the input of the 3-D CNN + LSTM (with AoA) contains lots of zeros due to the sparsity of RD spectrums. Such large volumes usually need large amounts of coefficients in neural networks. Whereas, we exploit the hand information in every measurement-cycle using only 25 points, and the input dimension of the CNN is only 40 × 25 × 5, which requires much less computational complexity than the other approaches. As mentioned above, subjects are divided into taught and untaught groups, and each has ten subjects. In each group, eight subjects are randomly selected as training set, and the remaining two subjects constitute the test set, resulting in either group having 720 true gestures in the test set. In the HIL context, we directly fed the recorded raw data from the four test subjects into the edge-computing platform. In the realtime case, the system should be robust enough to distinguish true gestures from random motions (RMs). Thus, we also included a certain amount of RMs as negative samples during the training phase. The scale of RMs and true gestures is around 1:3. 1) Precision, Recall and F 1 -score: To quantitatively analyze the real-time performance of our system, we introduce the precision, recall and F 1 -score, which are calculated as: precision = TP TP + FP , recall = TP TP + FN , where TP, FP and FN denote the number of true positive, false positive, and false negative estimates. For two subjects in the test set, we have 60 realizations for each gesture. It means that TP + FN = 60. As presented in Table IV , the avg. precision and recall over 12 types of gestures using two taught subjects as test set are 93.90% and 94.44%, respectively, while those using two untaught subjects as test set are 91.20% and 86.11%. It needs to be mentioned that, the off-line avg. accuracies in Fig. 11 , namely 98.47 % and 93.11%, can also be regarded as the recall in taught and untaught cases. After comparing with the recall in the off-line case, we could observe an approx. 4% and 7% degradation in recall in the realtime case considering both the taught and untaught subjects. The reason is that, in the off-line performance evaluation, we assumed that each gesture is detected perfectly. However, in the real-time case, the recall reduction is caused by the facts that our HAD performance miss-detected some gestures or incorrectly triggered the classifier even when the gesture was not completely finished. For example, due to the small movement of the hand, the HAD sometimes failed to detect the gesture "Pinch index". Similarly, the recall of the gesture "Cross" is also impaired, since the gesture "Cross" has a turning point, which leads to a short pause. In some cases where the subject performs the gesture "Cross" with lowvelocity, the HAD would incorrectly consider the turning point as the end of "Cross", resulting in a wrong classification. Overall, in both taught and untaught cases, the F 1 -score of our radar-based gesture recognition system reaches 94.17% and 88.58%, respectively. 2) Detection Matrix: We summarized the gesture detection results of our real-time system. Since we did not aim to evaluate the classification performance here, we depicted the detection results in Table V considering all four test subjects. Our system correctly detected 1388 true positive gestures, and provoked 25 false alarms among the total of 1864 test samples in which there are 1440 true gestures and 424 true negative RMs, respectively. Furthermore, we define two different types of miss-detections (MDs), in which the MDs from HAD means that our HAD miss-detects a gesture, while the MDs from the classifier means that, the HAD detects the gesture, but this gesture is incorrectly rejected by our classifier as a RM. The false alarm rate (FAR) and miss-detection rate (MDR) of our system are 5.90% and 3.61%, respectively. 3) Runtime: As depicted in Table VI , in the HIL context, we also noted the avg. runtime of the multi-feature encoder, HAD and CNN based on all the 1838 classifications, which include 1388 true positives, 399 true negatives, 25 false alarms and 26 MDs from the classifier. The multi-feature encoder includes the 2-D FFT, 25 points selection, RD and AoA estimation. It needs to be mentioned that the multifeature encoder and the HAD were executed in the CPU using unoptimized C/C++ code, while the CNN ran in the GPU based on TensorRT. The multi-feature encoder and HAD took only approx. 7.12 ms and 0.38 ms without using any FFT acceleration engine, while the CNN took only 25.84 ms on average. The overall runtime of our proposed radar-based gesture recognition system is only approx. 33 ms. We developed a real-time radar-based gesture recognition system built in an edge-computing platform. The proposed multi-feature encoder could effectively encode the gesture profile, i.e., range, Doppler, azimuth, elevation, temporal information as a feature cube, which is then fed into a shallow CNN for gesture classification. Furthermore, to reduce the latency caused by the fixed number of required measurementcycles in our system, we proposed the STA/LTA-based gesture detector, which detects the tail of a gesture. In the off-line case, based on LOOCV, our proposed gesture recognition approach achieves 98.47% and 93.11% avg. accuracy using gestures from taught and untaught subjects, respectively. In addition, the trained shallow CNN has a small model size and requires few GFLOPs. In the HIL context, our approach achieves 94.17% and 88.58% F 1 -scores based on two taught and two untaught subjects as test sets, respectively. Finally, our system could be built in the edge-computing platform, and requires only approx. 33 ms to recognize a gesture. Thanks to the promising recognition performance and low computational complexity, our proposed radar-based gesture recognition system has the potential to be utilized for numerous applications, such as mobile and wearable devices. In future works, different gesture datasets with large diversity need to be constructed according to specific use cases. What's more, in some use cases where the radar is not stationary to the user, the classification accuracy of the proposed system might decrease and accordingly algorithms, such as ego motion compensation, could be considered. Micro-Doppler effect in radar: Phenomenon, model, and simulation study Millimeter-wave technology for automotive radar sensors in the 77 GHz frequency band An ultra-wideband 80 GHz FMCW radar system using a SiGe bipolar transceiver chip stabilized by a fractional-N PLL synthesizer Radar-based human-motion recognition with deep learning: Promising applications for indoor monitoring Radar signal processing for sensing in assisted living: The challenges associated with real-time implementation of emerging algorithms Motion sensing using radar: Gesture interaction and beyond Google Pixel 4 and 4 XL handson: This time, it's not about the camera Persistence of coronaviruses on inanimate surfaces and its inactivation with biocidal agents Gesture classification with handcrafted micro-Doppler features using a FMCW radar Soli: Ubiquitous gesture sensing with millimeter wave radar Hand gesture recognition based on radar micro-Doppler signature envelopes Hand gesture recognition using micro-Doppler signatures with convolutional neural network Sparsity-driven micro-Doppler feature extraction for dynamic hand gesture recognition Interacting with Soli: Exploring fine-grained dynamic gesture recognition in the radio-frequency spectrum TS-I3D based hand gesture recognition method with radar sensor Short-range radar based real-time hand gesture recognition using LSTM encoder Short-range radar-based gesture recognition system using 3D CNN with triplet loss Hand-gesture recognition using two-antenna Doppler radar with deep convolutional neural networks Automatic radar-based gesture detection and classification via a region-based deep convolutional neural network u-DeepHand: FMCW radar-based unsupervised hand gesture feature learning using deep convolutional auto-encoder network Latern: Dynamic continuous hand gesture recognition using FMCW radar sensor Riddle: Real-time interacting with hand description via millimeter-wave sensor 3D convolutional neural networks for human action recognition Multimodal gesture recognition using 3-D convolution and convolutional LSTM Comparison of the STA/LTA and power spectral density methods for microseismic event detection New chirp sequence radar waveform A high-resolution framework for range-Doppler frequency estimation in automotive radar systems Two-dimensional subspace-based model order selection methods for FMCW automotive radar systems Radar handbook Rectified linear units improve restricted boltzmann machines Empirical evaluation of rectified activations in convolutional network Backpropagation applied to handwritten zip code recognition Adam: A method for stochastic optimization Tensorflow: A system for large-scale machine learning and the Research Institute for Automotive Electronics (E-LAB) in collaboration with HELLA GmbH & Co. KGaA, Lippstadt, Germany. His research interests are automotive radar signal processing, radar-based human motion recognition and machine learning collaboration with the Signal Processing Group at TUD, Darmstadt, Germany, where his research interest was the detection and classification of underwater mines in sonar imagery Lippstadt, Germany, where he is mainly responsible for the development of reliable signal processing algorithms for automotive radar systems Xibo Li received the B.Sc. degree in mechanical engineering from Beijing Institute of Technology His current research interests include automotive radar signal processing, machine learning and sensor fusion As a Research Associate at the Institute for Power Electronic and Electrical Drives (ISEA), he was involved in several projects related to ageing of lithiumion batteries at the chair for electrochemical energy conversion and storage systems He joined the Department of Communications Engineering of the University of Paderborn in 2001 as a Research Staff Member, where he was involved in several projects related to single-and multi-channel speech processing and automated speech recognition He is currently the head of the Radar Signal Processing and Signal Validation Department at HELLA GmbH & Co. KGaA, Lippstadt, Germany. Nils Pohl (GSM'07-M'11-SM'14) received the Dipl.-Ing. and Dr.-Ing. degrees in electrical engineering from He has authored or coauthored more than 100 scientific papers and has issued several patents. His current research interests include ultra-wideband mm-wave radar, design, and optimization of mm-wave integrated SiGe circuits and system concepts with frequencies up to 300 GHz and above, as well as frequency synthesis and antennas. Prof. Pohl is a member of VDE, ITG, EUMA, and URSI. He was a corecipient of the The authors would like to thank the editor and anonymous reviewers for giving us fruitful suggestions, which significantly improve the quality of this paper. Many thanks to the students for helping us collect the gesture dataset in this interesting work.