key: cord-0629638-ohdqb0zc
authors: Samavati, Taha; Farvardin, Mahdi
title: Efficient Deep Learning-based Estimation of the Vital Signs on Smartphones
date: 2022-04-13
journal: nan
DOI: nan
sha: d6ee9c860a1ea170b0c4c3861d1cc71e13ce43ca
doc_id: 629638
cord_uid: ohdqb0zc

Nowadays, due to the widespread use of smartphones in everyday life and the improvement of computational capabilities of these devices, many complex tasks can now be deployed on them. Concerning the need for continuous monitoring of vital signs, especially for the elderly or those with certain types of diseases, the development of algorithms that can estimate vital signs using smartphones has attracted researchers worldwide. Such algorithms estimate vital signs (heart rate and oxygen saturation level) by processing an input PPG signal. These methods often apply multiple pre-processing steps to the input signal before the prediction step. This can increase the computational complexity of these methods, meaning only a limited number of mobile devices can run them. Furthermore, multiple pre-processing steps also require the design of a couple of hand-crafted stages to obtain an optimal result. This research proposes a novel end-to-end solution to mobile-based vital sign estimation by deep learning. The proposed method does not require any pre-processing. Due to the use of fully convolutional architecture, the parameter count of our proposed model is, on average, a quarter of the ordinary architectures that use fully-connected layers as the prediction heads. As a result, the proposed model has less over-fitting chance and computational complexity. A public dataset for vital sign estimation, including 62 videos collected from 35 men and 27 women, is also provided. The experimental results demonstrate state-of-the-art estimation accuracy.

Vital signs need to be monitored regularly, especially in the elderly or individuals with certain medical disorders. Nowadays, anyone has a smartphone and uses it for a variety of tasks on a daily basis. Due to the rapid development of both hardware and software, smartphones can perform more complex tasks. Having mentioned these, one can take advantage of smartphones for estimating and monitoring vital signs with near-clinical accuracy. Heart rate (HR) and Oxygen saturation level (SpO2) are two of the most significant vital signs in the human body. Heart Rate is an important indicator of people's physiological state and needs to be measured in many circumstances, especially for healthcare and medical purposes. The oxygen saturation level in our bloodstream, also known as SpO2, shows how much oxygen is carried by the blood. Measuring SpO2 levels is essential as a major * Equal Contribution deviation from normal levels (typically more than 95% at sea level) [1] indicates a dangerous health condition or a sign of a serious disease (including COPD, Asthma, interstitial lung diseases, sequelae of tuberculosis, lung cancer, chronic obstructive pulmonary disease, and COVID-19) [2] , [3] can cause significant drops in SpO2 levels.

To predict the vital signs using optical measurement methods, one should first obtain the Photoplethysmography signal. Photoplethysmography (PPG) is an uncomplicated and inexpensive optical measurement method that is often used for heart rate monitoring purposes. PPG is a non-invasive technology that uses a light source and a photodetector at the surface of the skin to measure the volumetric variations of blood circulation [4] . The PPG signal can also be obtained by smartphones with both a camera and a flashlight. By covering the camera with a finger while its flashlight is on, one can obtain photo readings and further process them to obtain the PPG signal. Various methods have been proposed for heart rate and SpO2 estimation. Some of these methods only rely on signal processing algorithms to extract the vital signs from the input PPG signal [5] , [6] , [7] . while some use both signal processing and deep learning methods [8] , [9] . The rest focus on developing end-to-end deep learning methods [10] , [11] , [12] . These mentioned methods either need multiple pre-processing stages or have a high computational burden to be run on mid-range or low-end smartphones.

In this article, a set of architectures for real-time heart rate and SpO2 estimation on mobile devices are proposed. These methods estimate heart rate and SpO2 in an end-to-end manner without any pre-processing steps. This makes it easy to deploy the models on mobile devices. Compared to prior proposed architectures that used fully connected layers following the feature extraction stage, fully convolutional architectures do not use dense connections after extracting features; hence they have fewer parameters while achieving better accuracy. As another contribution of this research, a public dataset of smartphone videos named MTHS is provided, containing extracted PPG signals from 62 distinct patients with their corresponding ground truth HRs and SpO2s. More detailed information about the dataset can be found in Section IV.

Heat rate Estimation: Research [8] utilizes a convolutional-based neural network to estimate the heart rate from PPG signals acquired from smartphone captured videos. The input videos are first converted to a set of 3 channel 1D signals. These signals are the mean values of the frames' red, green, and blue channels. Multiple pre-processing algorithms were applied to the signals before feeding them to the network. These pre-processing steps include multiple handcrafted steps such as denoising, applying moving average, and PCA. Another drawback of this method is the use of fully connected layers following the convolutional layers, which increases the network parameters and increases the over-fitting chance compared to the newer fully convolutional architectures.

Research [9] tends to estimate heart rate by taking FFT of single-channel time-series PPG along with 3-axis accelerometer motion signals and feeds the resulting four-channel signals to the proposed neural network. The input signal is clipped between 0-4 Hz to remove unwanted frequencies. The proposed model consists of 8 channel 1D convolution-max pool layers followed by a fully connected network. This model can be further improved by using fully convolutional architecture. The authors also claim that utilizing all the PPG channels can improve the estimation accuracy, which is left for future work.

Research [10] proposes an end-to-end deep learning model to estimate heart rate from PPG signals acquired by a wristworn device. The proposed method does not require any pre-processing steps or any motion data. Nevertheless, the proposed method achieves competitive results compared to those using motion signals. In this work, eight consecutive one-second PPG data with a sample rate of 125 Hz are fed to a set of convolutional layers and an LSTM layer in parallel. The produced feature vectors are then concatenated together and fed to another LSTM layer. Finally, a linear layer predicts the heart rate. Aside from the advantage of eliminating the need for pre-processing steps, due to the presence of LSTM layers, the model still has a relatively high computational complexity.

SpO2 Esitmation: Research [13] estimates spo2 from smartphone videos captured from finger-tips. After the preprocessing step, which includes motion removal, a convolutional neural network is proposed for SpO2 estimation, which has only two 1D convolutional layers with a large filter length followed by max pool and dropout. A pulse oximeter records ground truth heart rate and SpO2, while an iPhone 7 plus is used to record fingertip videos. The researchers conclude that the performance of their proposed method does not improve when increasing the frame rate above 30fps. They also suggest using Raw PPG instead of pre-processed (band-pass filtered) as it achieves the best performance.

Research [14] , proposes a convolutional neural network followed by a fully connected layer. The model takes in the mean channel values of gain applied RGB channels and predicts the SpO2 values. The captured video, from which the mean signals are extracted, has a time length of three seconds with a sampling rate of 30 fps. However, the collected dataset has limited coverage of patients. The proposed model uses fully connected layers after the CNN layers, resulting in higher parameter count and over-fitting chance compared to the fully convolutional architectures.

We propose a highly efficient real-time algorithm to estimate heart rate and SpO2 on smartphones and mobile devices. The proposed deep learning method has fewer parameters than the previously proposed architectures for vital sign estimation. It also does not require performing any pre-processing on the input PPG signal.

Given an image sequence or a video taken from fingertips, MTVital estimates both HR and SpO2 in real-time. Our proposed network estimates vital signs in a fully convolutional manner. This results in a network with 4x fewer parameters than the conventional methods, which use a fully connected network after the convolutional layers. Moreover, having fewer parameters decreases the chance of over-fitting, which can be beneficial when having a small-sized dataset. The commonly used batch normalization layers are not used to further decrease the computational complexity. The proposed architecture eliminates the need for implementing handcrafted stages such as designing and applying bandpass filters. We propose three types of architectures shown in Fig 1. The first architecture is a stack of 1D convolutional layers followed by a global average pooling (GAP) layer that produces the vital sign of interest (Heart rate or SpO2). The second one, Residual FCN, is deeper and adds residual connections to the normal convolutional layers. The last model, named DCT, first applies a Discrete Cosine Transformation (DCT) to the input PPG signal and removes the unwanted coefficients corresponding to the un-related frequencies. Afterward, a series of convolutional layers learn the mapping to the desired vital sign. The transformation and coefficient filtering procedures are implemented in the model itself and can be directly deployed to other platforms without the need for re-implementation of these steps in other platform-specific programming languages. In order to compare the performance of these methods against the conventional architectures, which use fully connected layers as the prediction heads, a base model is also implemented. The base model has a series of 1D convolutional layers followed by batch normalization as well as a fully connected network after the feature extraction phase. The experiments clearly show that the best-performing architecture is the fully convolutional ones. More detailed results are discussed in section VI.

At inference time, the network estimates the vital signs from the fingertips. Therefore it first needs an image sequence or video; we obtain these by asking the patient to cover the smartphone's camera with his/her fingertip while the flashlight is on. Depending on the task, a single channel (the Red) is used for heart rate estimation, and all RGB channels are used as input for SpO2 estimation. The image sequence is captured and processed simultaneously so that for each image in the input sequence, mean values for RGB channels are computed and listed in three arrays separately. After 10 seconds, the resulting mean RGB signals are extracted to be fed into MTVital for estimation of vital signs. As depicted in Figure 1 , the input PPG signal is first transformed to have zero mean and unit standard deviation. The model performs the latter process.

This section briefly discusses the datasets that are used to benchmark our proposed methods. In the following, the details of these datasets are explained.

BIDMC [15] : This dataset contains PPG, impedance respiratory signal, and electrocardiogram (ECG) for 53 patients. Each recording is eight minutes long and also includes ground truth data for heart rate (HR), respiratory rate (RR), and blood oxygen saturation level (SpO2) sampled at 1Hz.

MTHS: This research proposes a dataset from 62 patients (35 men and 27 women) that contains both hr and SpO2 labels sampled at 1Hz. The PPG signal is acquired at 30 FPS using a smartphone's camera. The data collection procedure is explained in Section V.

Since this use case involves using a smartphone to estimate the vital signs, the PPG is acquired using the phone's RGB camera. As no dataset was found available with a PPG signal acquired using a smartphone, we decided to collect such a dataset named MTHS. As described in the previous section, the MTHS dataset contains 30Hz PPG signals obtained from 62 patients, including 35 men and 27 women. The ground truth data includes heart rate and oxygen saturation levels sampled at 1Hz. The HR and SPo2 measurement is obtained using a pulse oximeter (M70). An iPhone 5s was used to obtain the ppg recordings at 30 fps. The flashlight is kept on during the recording phase. The patients were asked to fully cover the camera and the flashlight with their fingertips. Figure 2 shows the data collection procedure. The dataset is openly available on Github 1 . 

In this section, The proposed methods are evaluated on a number of benchmark datasets, and the results are included. The base model is based on common and older architectures in regression which use a CNN feature extractor followed by a feed-forward network. The FCN model has a fully convolutional architecture with 4x lower parameters than the base model. This architecture makes over-fitting less likely to happen and highly efficient for deployment on mobile devices. As explained before, the DCT model has the same fullyconvolutional architecture but works on the frequency domain. After taking the DCT of the input signal and filtering out DC coefficients and high-frequency coefficients that do not contain our desired frequencies for HR, a stack of 1D convolutions is applied to regress HR and SpO2 values. In the following, the performance of the proposed methods on each dataset is reported.

The BIDMC dataset is split into three sets of train, validation, and test set with slice size of 0.8, 0.04, and 0.16, respectively, seeded with the value of 1400. Below is the test MAE obtained by each of our models trained with different loss functions. According to the results in Table I, our proposed fully convolutional architecture with residual connections outperforms other architectures achieving an MAE of 1.33 on the BIDMC heart rate test set. Our experiments show that the huber and log-cosh losses are good choices. Additionally, the MSE loss function performs the worst. The SpO2 estimation results on MTHS are shown in table II. These results demonstrate that the best performing architecture for SpO2 estimation is Residual FCN, supervised with Logcosh loss. This proves the effectiveness of incorporating residual blocks in the model architecture, providing a more robust and accurate feature extraction and representation. Regarding the experimental results, it is observed that a single loss function that works best for a specific architecture does not necessarily work best on other architectures. 

MTHS is split into three sets of train, validation, and test set with slice size of 0.68, 0.12, and 0.2, respectively, seeded with the value of 1400. Below is the test MAE obtained by each of our models trained with different loss functions. According to the results in Table III , same as before, our proposed fully convolutional architecture with residual connections outperforms other architectures achieving an MAE of 6.96 on the testing subset of HR data. Our experiments show that the huber and MAE losses are good choices. VII. ALGORITHM DEPLOYMENT ON SMART PHONE Figure 9 shows some screenshots of the developed application. The application is developed for the android platform and is compatible with version 5 (Lollipop) and above. The deep learning models for heart rate and SpO2 estimation have been converted to TFLite [16] and integrated into the application. The user is requested to cover the camera while the flashlight is on at the running time. After the finger is detected, the estimation process starts. The application captures video frames at 30 fps. Every 10 seconds, a mean signal of red, green, and blue channels is computed and given to the deep learning model to infer the vital signs.

Due to the necessity of regular monitoring of vital signs, especially in the elderly, one can take advantage of smartphones for estimating and monitoring vital signs with acceptable accuracy. This research proposed an end-to-end, realtime, deep learning model for smartphone-based vital sign estimation. Our method has fewer parameters than previous methods and does not require any pre-processing steps. The proposed model can be deployed even on low-end devices due to the low computational complexity. There is also a public dataset of smartphone videos named MTHS provided which contains extracted PPG signals from 62 distinct patients with their corresponding ground truth HRs and SpO2s. A series of experiments were done on unsupervised learning methods during this research, but the performance was not satisfying. Given the lack of good quality and quantity data for smartphone-based vital sign estimation, the authors believe there is a strong potential for unsupervised learning methods that can be used in this field. Another possible future work is to collect SpO2 and HR data from the people with respiratoryrelated diseases; Since our dataset only contains the data of the healthy subjects. 

American Thoracic Society/American College of Chest Physicians statement on cardiopulmonary exercise testing

Low oxygen saturation and mortality in an adult cohort: the Tromsø study

Room to breathe: the impact of oxygen rationing on health outcomes in SARS-CoV2

A review on wearable photoplethysmography sensors and their potential future applications in health care

A pulse rate estimation algorithm using PPG and smartphone camera

Measuring oxygen saturation with smartphone cameras using convolutional neural networks

Monitoring of heart rate, blood oxygen saturation, and blood pressure using a smartphone

Heart Rate Monitoring Using PPG With Smartphone Camera

Large-Scale Heart Rate Estimation with Convolutional Neural Networks. Sensors (Basel)

PPGnet: Deep network for device independent heart rate estimation from photoplethysmogram

TransPPG: Twostream Transformer for Remote Heart Rate Estimate

Assessment of Deep Learningbased Heart Rate Estimation using Remote Photoplethysmography under Different Illuminations

Measuring Oxygen Saturation With Smartphone Cameras Using Convolutional Neural Networks

Smartphone camera oximetry in an induced hypoxemia study

Toward a robust estimation of respiratory rate from pulse oximeters

TensorFlow Lite -ML for Mobile and Edge Devices

ACKNOWLEDGMENT Special Thanks to Vahid Bastani, Mahdis Habibpour, Ali Farvardin, and Changiz Javadian for their kind support and help during this project.