key: cord-0164482-y4hn3rzb
authors: Shahbandeh, Mobina; Ghaffarpour, Fatemeh; Vali, Sina; Haghpanah, Mohammad Amin; Torkamani, Amin Mousavi; Masouleh, Mehdi Tale; Kalhor, Ahmad
title: A Deep Learning Based Automated Hand Hygiene Training System
date: 2021-12-10
journal: nan
DOI: nan
sha: 79770a40255961569cab3a9b7c208c140b03e57b
doc_id: 164482
cord_uid: y4hn3rzb

Hand hygiene is crucial for preventing viruses and infections. Due to the pervasive outbreak of COVID-19, wearing a mask and hand hygiene appear to be the most effective ways for the public to curb the spread of these viruses. The World Health Organization (WHO) recommends a guideline for alcohol-based hand rub in eight steps to ensure that all surfaces of hands are entirely clean. As these steps involve complex gestures, human assessment of them lacks enough accuracy. However, Deep Neural Network (DNN) and machine vision have made it possible to accurately evaluate hand rubbing quality for the purposes of training and feedback. In this paper, an automated deep learning based hand rub assessment system with real-time feedback is presented. The system evaluates the compliance with the 8-step guideline using a DNN architecture trained on a dataset of videos collected from volunteers with various skin tones and hand characteristics following the hand rubbing guideline. Various DNN architectures were tested, and an Inception-ResNet model led to the best results with 97% test accuracy. In the proposed system, an NVIDIA Jetson AGX Xavier embedded board runs the software. The efficacy of the system is evaluated in a concrete situation of being used by various users, and challenging steps are identified. In this experiment, the average time taken by the hand rubbing steps among volunteers is 27.2 seconds, which conforms to the WHO guidelines.

Contaminated hands of people, especially healthcare professionals, are the most common vehicle for the transmission of healthcare-associated pathogens. Hand hygiene has been proven to be effective in preventing healthcare-associated Authors would like to acknowledge the financial support of Tavan Ressan Company.

infections and diseases [1] - [3] , such as COVID-19 [4] , which is now widespread in the world, and preventing it has utmost importance. In this regard, alcohol-based hand rub has been demonstrated to have advantages over hand washing with soap [5] , and the World Health Organization (WHO) has provided a hand rub guideline in eight steps [6] . Compliance with this guideline should be monitored to ensure hand cleanliness. As human observation is not accurate enough, automated quality control of hand rub is mandatory to provide feedback and training. For this purpose, some existing research have utilized classic computer vision approaches. In [7] , a commercial product is built by Surewash using these traditional approaches for hand hygiene training, and the paper provides data about the effectiveness of providing user feedback on improving the medical staff's adherence to the WHO guideline in a hand washing task. On the other hand, Convolutional Neural Network (CNN) architectures are being used in a broad range of applications, including image classification. Therefore, CNNs can also be exploited in detecting hand hygiene steps. In [8] , hand hygiene dispenser usage is detected using a CNN and depth sensor data. In [9] , hand washing compliance detection is also done using depth sensor data and a CNN. In [10] , recognizing a non-detail-level hand washing task's actions is performed using two CNNs. In [11] and [12] , compliance of hand washings with the WHO guidelines is assessed using data collected from a wearable sensor and Machine Learning techniques. Most existing research propounded in the literature leverage sensor data or classic machine vision techniques. In this paper, a state-of-the-art CNN architecture is employed for the real-time classification of user actions using camera images. Detecting hand rub poses can be regarded as a gesture recognition task. Gesture recognition has been applied in various contexts. In [13] , a CNN is used to detect the American Sign Language gestures. In [14] , the rudimentary task of detecting the presence of one hand, two hands, and no hand is carried out using a pre-trained neural network.

In this work, an automated system called DeepHARTS (short for Deep learning based HAnd Rub Training System) is presented, which evaluates the conformance to the hand rubbing guideline by users with real-time feedback. The markerless gesture recognition of hand rubbing steps is carried out using a CNN architecture. The CNN is built on top of Tensorflow and is trained using a dataset containing videos of each hand rub step performed by 22 volunteers following the guideline provided by WHO. In the fabricated system, a camera is placed on the top of the surface where the user washes her hands. The frames of user activities are captured by the camera and then fed to the CNN. Then, a frame-based classification of the user action is performed by the network. The system guides the user through hand rub steps and informs the user whether the steps have been taken correctly or not using visual indicators on the screen. The system's efficiency is evaluated using data gathered from a real-life scenario. The challenges in making such a system and difficult hand rub poses are identified in this paper. This paper includes the following contributions:

• A deep learning based system for guiding users through the whole hand rubbing process using real-time feedback and interactive Graphical User Interface (GUI); • Providing a dataset including videos of various volunteers following the WHO guideline on hand rub; • An evaluation on real-world usage to assess the perceived efficacy of the proposed system; • Identifying challenging steps in hand rubbing using data gathered from the application of the proposed system in a real-life scenario. The remainder of this paper is organized as follows. Section II provides the hardware and software description of the system, including the details of the CNN model. The evaluation of the system is discussed in Section III. Finally, Section IV includes a conclusion with a summary and future work of this research.

This section discusses the structure of DeepHARTS, including its mechanical, electrical, and software components.

Fig. 2 depicts the so-called DeepHARTS, which is an automated hand hygiene system developed by the Human and Robot Interaction Laboratory, University of Tehran, under a joint project with Tavan Ressan Company. The structure is made of steel, and its dimensions are illustrated in Fig. 3 . It is equipped with four wheels for portability and a 15.5inch touch screen for user convenience. From an ergonomic standpoint, prior to designing the structure, to find an appropriate dimension for the structure, a structure using aluminum profiles was fabricated, and several tests were performed. The accessibility to all system components is through the back, and filling the alcohol tank can be done from the top of the structure. The background of the video stream of the user's hands is homogenous and white, which alleviates challenges involved in machine vision procedures, such as hand segmentation. By the same token, a light-emitting diode (LED) illuminates the background, leading to a uniform and unvaried color histogram for captured frames. 

A Deep Neural Network (DNN) is used in this system, which requires computation power and thus appropriate computer hardware. NVIDIA provides a wide variety of Artificial Intelligence computers. For the purposes of this project, two of these computers were available and tested. At an earlier stage, an NVIDIA Jetson Nano (4GB Memory) was used to run the DNN model for detecting the hand rubbing process. However, as the Jetson Nano lacks enough computation power to run such a deep network alongside the GUI, another board such as Raspberry Pi was required to run the GUI which will be discussed in Subsection C. Hence, the computer has now been replaced by an NVIDIA Jetson AGX Xavier (512-core Volta GPU, 8-core CPU, 32GB Memory).

The main procedures for detecting hand rubbing steps fall into three steps as follows.

a) Hand detection step: First, the user initiates a hand rubbing process by bringing her hands under the camera. The camera points down to the surface where the user washes her hands, and it captures the hand pose and passes the frames to the Jetson AGX Xavier. The user should place her hands in the region shown on the screen. The correctness of this step is detected by hand segmentation from the background using Open-Source Computer Vision (OpenCV), a library used for image processing. b) Sanitizer step: Fig. 4 explains the procedure of getting the sanitizer and updating the GUI. A board is used for an automatic alcohol dispenser consisting of an IR sensor to sense the user's hands. The system asks the user to bring her hands under the alcohol dispenser. When an IR sensor detects the user's hands, it connects power to pump alcohol from the alcohol tank. Simultaneously, an Ultrasonic sensor next to the alcohol dispenser measures the distance from the object to which the sensor is faced. This sensor is connected to an Arduino ATmega board, which is programmed to calculate the distance of an object and constantly report the computed distance to the Jetson AGX Xavier. If the measured distance is in a defined range, the system detects that the object is the user's hands placed under the sensor. Thus, the screen is updated, and the step of getting the sanitizer is passed. c) Hand rubbing steps: After finishing the two aforementioned steps, the system guides the user through hand rubbing steps. The user starts hand rubbing based on steps 2 to 7 of the guideline as depicted in Fig. 1. Also, Fig. 5 shows the steps of classifying user hands gesture. First, the camera captures several frames from the hand pose and passes the frames to the Jetson AGX Xavier. Then, the frames get fed to the CNN that performs the classification. If the predicted step has a high probability, the step is considered passed, and the GUI shows corresponding feedback to the user. Otherwise, the user should repeat the step. After every three steps as illustrated in Fig. 1 , the system asks the user to repeat the sanitizer step.

The GUI is designed using the Qt Creator Integrated Development Environment (IDE). Fig. 6 displays the main screen of the GUI. An animated GIF file is played on the right side of the screen to guide the user through each step. The instruction of each stage is sequentially shown on the screen below the GIF file. The live stream of the camera video capture is shown on the left side of the screen, where the user can see her actions. A guideline is displayed at the bottom of the screen, helping users follow all stages. A progress indicator bar responds to the user activity when the step is performed correctly in a pre-specified time, and the current step is marked as passed.

If there are unmarked steps, the user should repeat them at the end of the cycle.

The deep neural network is a fundamental part of the system, which constantly classifies the user's hand rubbing steps and provides a corresponding output. This network is an Inception-ResNet-v2 [15] model, pre-trained on the ImageNet [16] database, and it is fine-tuned on our dataset and according to the application requirements. The input of the model is an RGB image, and the output is the probability of 9 classification categories (various hand rub steps). As in each step, the next is known, the Sigmoid activation function is used for the output layer. Moreover, a threshold on the output of the Sigmoid function should be set for considering the step passed. Therefore, different threshold values were tested to find and exploit the most effective one.

Various related networks have been tested, and the results are illustrated in Table I . These models include two MobileNet [17] models (MobileNet-Small and MobileNet-Large), Inception [18] , ResNet152 [19] , and Inception-ResNet models. The practical performance metric measures the model generalization and is an assessment of the model's performance in a new environment which is relatively different from the dataset's environment, including different lighting, background, and image quality. This metric is empirical and is evaluated subjectively and manually. Volunteers tested the system based on each model individually and expressed their experience. Conclusions are made based on this data since determining an accurate metric is challenging. As shown in Table I , the MobileNet models have lower accuracies compared to other models. This is justified because these networks are relatively smaller and designed for cases where the hardware is not high-performing. MobileNet-Large has higher accuracy than MobileNet-Small, but it is a deeper network and does more computation, thus slower. The Inception network has reached an accuracy of 100% and seems to have overfitted. This network was tested in new environments and found out that it has a lower performance in comparison to ResNet and Inception-ResNet. The ResNet model performs well, but its loss is higher than Inception and Inception-ResNet models. It is also slow in training (its training time is almost 3.5 times of Inception and 1.5 times of Inception-ResNet). Finally, the Inception-ResNet model, which shares characteristics of both Inception and ResNet networks, has low loss and high accuracy (its accuracy is 0.11% lower than Inception and ResNet, which might be a result of the randomness of the learning process). Also, its training time is fairly low, and it has a remarkably high performance in new environments. It is worthwhile to mention that in the actual setting of the built system, the background differs from the dataset, image quality is lower, and the lighting condition is different. Taking these differences into account, the Inception-ResNet model has a considerable generalization and correctly recognizes actions of users who were not present in the videos of the dataset. Additionally, this model's learning process is more stable, and its loss and accuracy curves do not include spikes and are far smoother than the ones of other models. Training and validation accuracy and loss values are close, indicating that the model has not overfitted the training data. Figs. 9 and 10 depict the validation and test confusion matrices, respectively. From the obtained results, it can be inferred that the proposed model can correctly recognize all of the steps. However, from the foregoing matrices, one can observe some confusion between steps 1 and 4, which is due to the physical similarity in performing these two steps.

As a convenient dataset was required for this application and such a dataset was not available, videos of 22 volunteers, 18 men and 4 women, carrying out the hand rubbing steps as recommended by the guideline were recorded. For each volunteer, each step is recorded in two different environments. The first set of videos is recorded with wooden background, and the second one has a green background. Samples of the dataset are illustrated in Figs. 11 and 12.

For ten participants, three females and seven males, a study is carried out to measure the average time of each step. Among these volunteers, five have used the system for the first time. Fig. 13 demonstrates the average required time for each hand rubbing step. It indicates that the most challenging step is step 8, taking 5.3 seconds on average. Other challenging steps are steps 2 and 5, taking 3.9 and 3.3 seconds on average, respectively. As suggested by WHO, the duration of the entire hand rubbing procedure should be between 20 and 30 seconds [6] . In this experiment, the average time of the hand rubbing procedure, excluding the sanitizer dispensing steps, is 27.2 seconds, which conforms to the standard timing of hand washing in the WHO guidelines.

In conclusion, an automated hand hygiene training system was proposed in this paper. The approach was based on markerless hand gesture recognition using an Inception-ResNet DNN, which has yielded the best results among evaluated models. In the system, user hand presence was first detected, and then alcohol was dispensed. Afterward, hand rubbing steps were shown on the screen, and as the user followed the steps correctly, the GUI was updated. The system was fabricated, and the main computing unit of the system was an NVIDIA Jetson AGX Xavier. The system was evaluated in a real-life scenario of being used by volunteers. The average of the overall time taken by the hand rubbing steps in this empirical testing was 27.2 seconds, which is within the interval suggested by WHO. Furthermore, a dataset was collected, including videos of volunteers carrying out the hand rubbing steps as suggested by the WHO guidelines. Future work includes the detection of fake gestures since an expert user might try to trick the system by making similar but incorrect gestures. In addition, an embedded system with lower performance than the NVIDIA Jetson AGX Xavier, e.g., the NVIDIA Jetson Nano, might be used instead if further optimizations are carried out.

A randomized, controlled trial of a multifaceted intervention including alcohol-based hand sanitizer and hand-hygiene education to reduce illness transmission in the home

Physical interventions to interrupt or reduce the spread of respiratory viruses

Effect of hand hygiene on infectious disease risk in the community setting: A meta-analysis

Modeling the effects of intervention strategies on COVID-19 transmission dynamics

A Frequently Missed Lifesaving Opportunity during Patient Care

WHO guidelines on hand hygiene in health care. World Health Organization

A vision-based system for hand washing quality assessment with real-time feedback

Automatic detection of hand hygiene using computer vision technology

Hand Washing Detection using Wrist Wearable Inertial Sensors

Hand Pose Classification Based on Neural Networks

Identification of free and who-compliant handwashing moments using low cost wrist-worn wearables

Automated Hand Hygiene Compliance Monitoring. PervasiveHealth: Pervasive Computing Technologies For Healthcare

Hand gesture feature extraction using deep convolutional neural network for recognizing American sign language

Multi-View Hand-Hygiene Recognition for Food Safety

Inception-v4, inception-ResNet and the impact of residual connections on learning

Efficient Convolutional Neural Networks for Mobile Vision Applications

Going deeper with convolutions

Deep residual learning for image recognition