key: cord-0058853-p8xvsv41
authors: Augello, Agnese; Ciulla, Angelo; Cuzzocrea, Alfredo; Gaglio, Salvatore; Pilato, Giovanni; Vella, Filippo
title: A Kinect-Based Gesture Acquisition and Reproduction System for Humanoid Robots
date: 2020-08-24
journal: Computational Science and Its Applications - ICCSA 2020
DOI: 10.1007/978-3-030-58799-4_69
sha: 702506e584c91e2f81e6a94a3b1443cca6c300ab
doc_id: 58853
cord_uid: p8xvsv41

The paper illustrates a system that endows an humanoid robot with the capability to mimic the motion of a human user in real time, serving as a basis for further gesture based human-robot interactions. The described approach uses the Microsoft Kinect as a low cost alternative to expensive motion capture devices.

Nowadays, the interaction between human beings and robots has become a very relevant issue in a wide range of applications (e.g., [14, 18, 20] ). It is commonly agreed that communication between humans is based on both verbal and not verbal cues. A humanoid robot capable of interacting with people combining speech and gestures would dramatically increase the naturalness of social interactions. On the other hand, other studies like [7, 13, 17] consider knowledge management techniques (e.g., [7] ) to improve this phase.

Furthermore, the Microsoft Kinect is a popular choice for any research that involves body motion capture. It is an affordable and low-cost device that can be used for non invasive, marker-less tracking of body gestures. As an example, Baron et al. [4] controlled a Mindstorm NXT artificial arm with sensor Kinect, employing gesture recognition to regulate arm movement. Chang et al. [5] developed a Kinect-based gesture command control method for driving a humanoid robot to learn human actions, using a Kinect sensor and three different recognition mechanisms: dynamic time wrapping (DTW), hidden Markov model (HMM) and principal component analysis (PCA).

Meanwhile, Sylvain Filiatrault and Ana-Maria Cretu [11] used sensor Kinect to mimic the motion of a human arm to an NAO humanoid robot. In their case, the software architecture is based on three modules: Kinect Manager, Interaction Manager, and NAO manager. The Kinect Manager deals with the events and data captured by the Kinect. The class Kinect Transformer is used to get the Euler angles of the desired joints. The Interaction Manager is the intermediary between the Kinect and the robot and contains the repository for the joints used by the other two modules. The use of a joint repository of all articulations allows reducing the data to be processed as some joints are not needed. Finally, the NAO manager contains the static and dynamic constraints to apply to each one of the articulations, as well as the methods that allow the control of the robot movements.

To be sure that the robot has enough time to execute the gesture, a delay of 200 ms between one cycle and the next has been introduced. Itauma et al. [12] used a Kinect to teach an NAO robot some basic Sign Language gestures. The aim was teaching Sign Language to impaired children by employing different machine learning techniques in the process. Shohin et al. [15] used three different methods to make a robot NAO imitate human motion: direct angle mapping, inverse kinematics using fuzzy logic and iterative Jacobian.

In some cases, neural networks were used: Miguel et al. [16] used a Kinect sensor and a Convolutional Neural Network (CNN) trained with the MSRC-12 dataset [1] to capture and classify gestures of a user and send related commands to a mobile robot. The used dataset was created by Microsoft and had 6244 gesture instances of 12 actions. To have gestures of the same length, without losing relevant information, the system used a Fast Dynamic Time Warping algorithm (FastDTW) to find the optimal match between sequences by non linearly warping them along the time axis. This resulted in all gestures normalized to sequences of 667 frames, with each frame having 80 variables, corresponding to the x, y, z values for each of the 20 joints, plus a separation value for each joint. The resulting 667 × 80 matrix is used as the input of the CNN, which classifies it in one of the 12 possible gestures. The CNN was trained using two strategies, combined training consisting of a single CNN to recognize all 12 gestures and individual training with 12 different CNN, each capable of recognizing only one gesture. The accuracy rates were 72.08% for combined training and 81.25% for the individual training.

Moreover, Unai et al. [19] developed a natural talking gesture generation behavior for P epper by feeding a Generative Adversarial Network (GAN) with human talking gestures recorded by a Kinect. Their approach in mapping the movements detected by Kinect on the robot is very similar to what we used, but while they feed the resulting values to a neural network (a GAN), we use the (filtered) values directly.

This paper reports the implementation of a system able to acquire and reproduce the gestures performed by a human during an interactive session. In our approach, we exploited a M icrosof tKinect sensor to capture the motion data from a user and then, we have defined a mapping algorithm to allow a Soft-Bank Pepper robot to reproduce the tracked gestures as close as possible to the original ones.

In particular, we used the OpenNi driver for the Kinect, the NiTE 2.2 libraries for detecting the user skeleton, and the Kinetic version of ROS with the module pepper dcm to provide package exchange and bridging between the computer and the robot and Ubuntu 16.04. We focused on the movements of the arms and the head, laying the basis for the extension of the same approach to the remaining parts.

The developed system is structured in a set of modules, to increase versatility for future projects and to simplify possible extensions to the current project. Besides the Kinect itself, the first module is named Viewer, which extracts data frames (consisting of nine float values: three values for a joint position in 3D space, four values for quaternion from the origin and reliability values for both) from the Kinect and sends them in a pipe. The module also provides the feed of the Kinect camera with the overlay of the tracked user's skeleton. The pipe, long 8640 chars (64 chars for each joint value, 9 values for each joint, 15 joints total), is read by the second module, Gesture Brain.

The Gesture Brain module works both as a gateway for the ROS system [2] and as the module where actual data processing takes place. The gathered data cannot be used directly: a mapping is required to correctly associate each joint user position to the equivalent one in the Pepper robot. For this reason, the data is parsed and structured in a 15 × 9 float matrix, which is then separated into three matrices: one for coordinates, one for quaternions, and one for reliability values. In our algorithm, we decided to use only the first matrix for simplicity reasons, neglecting the quaternion matrix, and not performing any reliability check on the joints. We assume that the joint data is accurate enough for our purpose, as the Kinect already discards joints whose reliability values are too low. The joint position data is used to estimate Pepper joint angles, specifically shoulder pitch, shoulder roll, elbow roll and elbow yaw for both arms and head yaw for the head (there are three more joint angles that could be estimated, left and right wrist yaw and head pitch, but the Kinect is too imprecise to allow a good estimate, so they have been fixed to a value of 0). The details about this estimation are discussed in the next section.

After all required values are collected, we can use the ROS threads provided by the bridge pepper dcm to send the joint angles to the robot. These threads consist of multiple joint angles divided into groups, each group representing a body part. As we are interested only in the movement of arms and head, we use three: head, left arm, right arm. The bridge reads the sent values and the time between each capture to dynamically calculate the gesture trajectory in real-time. This means that to allow the system to be as accurate as possible, the gesture should be executed quite slowly. The bridge itself was modified to activate the in-built Self Collision Avoidance (part of the NaoQi library) and to deactivate wait and breathe animations, as they interfere with the commands sent by the pepper dcm ( Figs. 1 and 2) . 

The Pepper robot has five Degree of Freedom for each arm (each one associated with a joint), unlike human beings who have seven. A mapping is thus required. From the Kinect the Cartesian coordinates for each joint, the quaternion for each segment (both referenced globally), and a reliability value for both are extracted. The bridge pepper dcm uses Euler angles to communicate to the robot the new position of its joint angles. 3D space coordinates are thus used since quaternions have proven unsuitable. This is because the quaternions extracted do not represent the rotation from the previous frame, but rather the rotation from a reference quaternion. This leads to excessive inaccuracies once converted in Euler angles. Let x, y and z be the unit vectors for each axis, that is:

x = (1, 0, 0) y = (0, 1, 0) z = (0, 0, 1) Let S L , E L and W L be the coordinates of the shoulder, the elbow and the wrist of the left arm respectively, S L E L and E L W L are defined as:

SR L is the supplementary to the angle between S L E L and −x axis:

SP L is the angle between the projection of S L E L on zy plane and z axis, shifted in range to avoid the jump discontinuity at 180 and −180:

For values of SR L close to π 2 , SP L become unstable. As such, in the algorithm is assigned a value of 0 for SR L > 1. 3 .

ER L is the angle between E L W L and S L E L , shifted by π 2 :

EY L is the angle between the projection of E L W L on zy plane and z axis, shifted in range for stability reasons, plus −SP L :

The right arm is almost the same as the left arm, the only difference is that some angles have the opposite sign. Let HN be the difference between the coordinates of the joints H (head) and N :

The head yaw HY is the angle between the projection of HN on the xz plane and the z axis:

Kinect joint detection is based on the shape of the user, which is redrawn at every frame. While calibrating the sensor helps to reduce the resulting jerkiness, there is still a significant amount of noise left. This noise can be approximately classified in two categories: a constant Gaussian noise caused by small alteration on the shape detected and large "spikes" when the Kinect fail to guess the position of one or more joints (especially common when part of the limb is outside of the frame or when two or more joints overlap). A simple way to compensate part of this noise is to use a line of best fit. Given k points in (x, y) coordinates system, we must find the values c 0 and c 1 in the equation:

that define the straight line minimizing the squared error:

in the equations:

The result is a smoother movement, especially when Kinect is not able to detect the precise coordinates of a given joint. This is because, given a disturbing signal, the line of best fit can be seen as an approximation of the tangent that the signal would have at that point if the noise were removed. This is not always true, especially when the signal changes rapidly, but it's close enough in most cases to give a generally cleaner movement.

Besides mimicking the user movement, the Gesture Brain module also has some additional features implemented to increase the breadth of experiments that can be performed with the system or to help with future projects. The behavior of the program is managed by the input arguments. These are, in order: mode, mirror flag, json file name, LAjpos, RAjpos, Hjpos. The first one determines which of the three different modes of operation will be used (default 0), the second one determines if the mirror mode is activated or not (default false), the third defines the name of the text file used to record (in mode 0 and 1) or read (mode 2) the gestures (the default value is NULL, that is no recording) and set a flag (record flag) to 1, the fourth, fifth and sixth ones are used to determine the pose to use in mode 1 (as default, the robot will spread its arms parallel to the ground, in a pose that in animation is known as "T-pose"). More in details, the modes of operation of the main program are: -Mode 0 or "Mimic Mode", is the default mode and makes the robot mimic the movements of the user. The record flag makes it, so the output is not just sent to the ROS publishers, but recorded in a JSON (JavaScript Object Notation) file, to be reproduced later. If the mirror flag is active, every movement is mirrored. In case both the record and mirror flags are active, the mirrored movement will be recorded and saved in the specified txt file.

-Mode 1, or "Pose Mode", make the robot execute a pose (defined at the beginning by the value of the given arguments) that the user must try to emulate. A distance algorithm calculates how close is the user pose to that of the robot, evaluated separately for the head, right upper arm, left upper arm, right forearm, and left forearm. If the user pose keeps all body parts below their respective thresholds (defined separately for each boy part), the program will communicate the success and shut down. The record flag makes it, so the distance values returned are written in a file, while the mirror flag makes it so the user must try to mirror the pose shown. -Mode 2, or "Playback Mode", consists of reproducing a previously recorded gesture. The mirror flag, even if selected, doesn't have any effect on the algorithm. The name necessary to activate the record flag is used as the name of the file with the gesture to execute.

As an example, an experiment that was conceptualized consisted in using the Pepper robot to show a specific pose that the user must replicate as closely as possible. The experiment envisaged the use of both the normal mode and the mirror mode (Fig. 3) . 

To measure the distance between the robot pose and the detected pose, we make use of a distance measurement algorithm. For each arm, the related joint angle group (consisting of Elbow roll, Elbow yaw, Shoulder pitch, Shoulder roll, and Wrist yaw) is split into two vectors, representing the upper arm (Shoulder pitch and Shoulder roll) and the forearm (Elbow roll, Elbow yaw, and Wrist yaw). Including the joint angle group for the head (Head yaw and Head pitch), we have five vectors. To avoid that joint angles that have a larger range of values than others affect excessively the final result, each angle is divided by its respective maximum value. This process is applied both to the robot pose and the detected pose of the user, giving us five pairs of vectors. Finally, a Mean Squared Error algorithm is applied at each pair of vectors, resulting in five distances defined in the interval [0, 1]. Having each vector evaluated separately is more precise and reliable, and allows to define separate thresholds if a pass/fail system is implemented (like in our case) (Fig. 4) . 

The activation of the mirror flag makes the algorithm mirror any movement detected. This means that in mode 0, the robot will mirror the user (and if the program is recording, it will record the mirrored movement), while in mode 1, the user will have to mirror the pose shown. The mirroring consists in switching the detected angles between the left and the right arm, changing the sign of every angle except shoulder pitch. For the head yaw angle, a simple change in the sign is enough.

If the recording flag is active, the program will write a text file in JSON syntax with the joint angles sent to the robot (mode 0) or the distance from the given pose (mode 1) for each frame.

In mode 2, the records created in mode 0 can be reproduced by the robot. In mode 0, the fields for each frame are: 

The system illustrated in this paper is capable of detecting the user poses with the Kinect with sufficient accuracy. The first experiments show that the reproduced movements are precise and smooth; the mirroring is accurate; the pose evaluation is coherent; Furthermore, the recording and execution of the gestures are very close to the real-time movements. However, sometimes, certain positions cannot be reliably detected, due to imprecise behavior of the Kinect output when joints overlap each other, and to excessive reliance on the silhouette to detect the human body and the lack of joints in key points of the detected skeleton (like the hands). There is also an environmental factor, like lightning and positioning, that can make accurate user detection problematic. Currently, we are setting up two experiments: the first one is to make the robot autonomously capable of acting both as an instructor and a learner of the Semaphore Flag Signalling System [3] , exploiting the gesture mirroring features; the second one is to make the robot capable of both encoding and decoding simple sentences from natural language to the flag semaphore system and vice-versa.

In future works, a neural network will be also trained to recognize and classify the gestures to give a proper answer, creating a more realistic verbal communication between humans and robots. Possible improvements should include a more effective detection algorithm for the Kinect, more efficient ways to execute the mapping, the use of all points detected (not just limbs and head), the use of the official Microsoft SDK to have even more points detected (provided it's possible to retrieve all the necessary libraries) and a more general user-friendly experience (like the ability to set a timer for recording). Furthermore, we plan to extend our framework as to deal with novel and emerging big data trends including performance (e.g., [6, 9] ), and privacy and security (e.g., [8, 10] ).

Remote control of the artificial arm model using 3D hand tracking

A kinect-based gesture command control method for human action imitations of humanoid robots

A novel distributed framework for optimizing query routing trees in wireless sensor networks via optimal operator placement

Combining multidimensional user models and knowledge representation and management techniques for making web services knowledge-aware

Privacy preserving OLAP over distributed XML data: a theoretically-sound secure-multiparty-computation approach

OLAP*: effectively and efficiently supporting parallel OLAP over big data

Privacy preserving OLAP and OLAP security

Human arm motion imitation by a humanoid robot

Gesture imitation using machine learning techniques

A sketch drawing humanoid robot using imagebased visual servoing

Modeling and control of humanoid robots

Inverse kinematics of a NAO humanoid robot using kinect to track and imitate human motion

Activity gesture recognition on kinect sensor using convolutional neural networks and FastDTW for the MSRC-12 dataset

Classifying obstacles and exploiting knowledge about classes for efficient humanoid navigation

Robust multi-objective multi-humanoid robots task allocation based on novel hybrid metaheuristic algorithm

Learning to gesticulate by observation using a deep generative approach

Humanoid walking pattern generation based on model predictive control approximated with basis functions