key: cord-0984671-zv5kkj7b
authors: Flotho, P.; Bhamborae, M.; Grun, T.; Trenado, C.; Thinnes, D.; Limbach, D.; Strauss, D. J.
title: Multimodal Data Acquisition at SARS-CoV-2 Drive Through Screening Centers: Setup Description and Experiences in Saarland, Germany
date: 2020-12-11
journal: nan
DOI: 10.1101/2020.12.08.20240382
sha: 74ceecd1b888b050dc0c9b515fa37686a36f7883
doc_id: 984671
cord_uid: zv5kkj7b

SARS-CoV-2 drive through screening centers (DTSC) have been implemented worldwide as a fast and secure way of mass screening. We use DTSCs as a platform for the acquisition of multimodal datasets that are needed for the development of remote screening methods. Methods: Our acquisition setup consists of an array of thermal, infrared and RGB cameras as well as microphones and we apply methods from computer vision and computer audition for the contactless estimation of physiological parameters. Results: We have recorded a multimodal dataset of DT screening center participants in Germany for the development of remote screening methods and symptom identification. Conclusions: Acquisition in the early stages of a pandemic and in regions with high infection rates can facilitate and speed up the identification of infection specific symptoms and large-scale data acquisition at DT screening centers is possible without disturbing the flow of operation.

Research on digital technologies to combat the COVID-19 pandemic includes the computational analysis of video and audio data [1] , [2] , [3] . Due to their contactless nature, such methods are particularly promising and needed for mass screening purposes [3] and besides fever, there are other atypical and non-severe symptoms (e.g. [4] , [5] , [6] , [7] ) which allow for non-contact medical assessment. The value of such screening systems would of course be directly related to the achievable sensitivity and specificity for detecting SARS-CoV-2 infections. However, it is challenging to acquire homogeneous data sets for the development and assessment of such remote systems without interfering with medical services due to the pressure of the ongoing pandemic.

For the rapid collection of samples for polymerase chain reaction (PCR) based screening, drive-through screening centers (DTSCs), e.g., Kwon et al. [8] , are already in use in several countries. There are advantages of DTSCs for the acquisition of contactless recordings of the patients: They simulate a scenario, where contactless screening of infectious diseases might be implemented one day, such as the entrance of employee parking area. The exposure of equipment and personnel to patients is minimized and at the same time the exposure of healthy participants to contaminated air or equipment is fully controlled and can be completely avoided as the patients stay seated in their own car. 1 Systems Neuroscience and Neurotechnology Unit, Neurocenter, Saarland University Hospital, Homburg/Saar, Germany. (daniel.strauss@uni-saarland.de) * Equal author contribution.

We propose an acquisition system along with a processing pipeline for rapidly acquiring such data at DTSC without disturbing their flow of medical operation and present a multimodal dataset of DTSC users as well as preliminary evaluations.

We recorded our dataset between May and July 2020 at the SARS-CoV-2 DTSC located at the former fairground area in Saarbrücken, State of Saarland, Germany. The study was approved by the responsible ethics committee (ethics commission at theÄrztekammer des Saarlandes, ID No 90/20) and after a detailed explanation of the procedure, all included participants signed a consent form. Admission to and recommendation for the tests was given by the participants' general practitioner if a patient had a potential SARS-CoV-2 infection based on the Robert Koch Institute's guidelines [9] . The PCR-test result for SARS-CoV-2 from the individual nose and throat swap was accessible for us at the responsible public health office. The recordings were done through an opened window with the participants sitting in their car. Our multimodal setup consisted of RGB, NIR, depth and thermal cameras as well as microphones (see Fig. 1 ). We recorded at 120fps (face closeups, RGB), 50fps (thermal camera), 30fps (NIR) and 10fps (high resolution RGB, stereo) and used custom acquisition routines and frame grabbers to minimize the user interaction with the recording systems. The investigators followed the same guidelines for the personal protective equipment (PPE) as the physicians taking the swap samples. Participants waiting for the experiment were regularly informed about the estimated waiting period and had the option to quit the experiment between two recordings, to reduce the contamination with ambient sound during audio recordings. The experiment began with a set of yes/no/unknown questions with the goal to generate uniform voice samples that can be compared between subjects. These also provided additional medical history. The second segment was a free speech sample, where the participants were asked to tell the circumstances that led to them visiting the DTSC. Subsequently, we asked the participants to briefly present their hands from both sides towards the camera, to catch potential cues to skin rashes. In the final segment, participants were asked to take 10 deep breaths and to breathe normally for 30s afterwards. The entire study/data acquisition took around 6 minutes. A total of 436 participants with signed consent form participated in our study, aged 19-86 (mean age 45.6 ± 15.2, 215 males, 221 females, 7 participants All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. did not provide their age). 34 subjects reported chronical or acute respiratory diseases or symptoms (see Fig. 3 ). Despite a relatively high participation of 36% of the DTSC users in our study, our data set contained only two subjects with SARS-CoV-2 positive PCR results from the swab tests at the DTSC, see discussion.

We have recorded a dataset up to 6 minutes per subject with HFR or high resolution multimodal cameras and microphones. We applied already available procedures from computer vision and computer audition for assessment of the data quality. The evaluations and respective methodology used as proof of concept are described below. Fig. 2 summarizes the results for each of these modalities, in particular how the contactless physiological parameters compare to gender and age specific norm values and grouped them by a presence/absence of symptoms.

We applied the method of Wang et al. [10] for the extraction of remote plethysmography (rPPG) signals from skin segmented super-pixels of the HFR and 4k recordings and used custom scripts for peak detection and analysis. The analysis shows that the mean heart rate (MHR) decreases with age (see Fig. 2 (A) ). Female participants show slightly higher values in MHR and lower values related to heart rate variability than male participants. This agrees with results from experiments with gold standard contact based sensors on large sample sizes [11] , [12] .

We precomputed landmarks for the HFR and 4k recordings using dlib [13] . On the eye, we analyzed the average blinking rate which can be a marker for drowsiness [14] and redness of the sclera as an indicator for follicular conjunctivitis [15] (see Fig. 2 (B)). We calculated the eye-aspect-ratio (Eye-AR) [16] from the pre-computed landmarks and apply custom algorithms to detect / count peaks for blinking detection. We found a significant difference in the blinking rate which was larger for participants reporting itchy eyes as compared to fever. We also extracted a region of interest around the eye and applied segmentation (gray scale based) to calculate the redness index (RI) from the sclera [15] . We found no significant difference of the RI between those with fever vs. itchy eyes.

From the thermal recordings, we analyzed static temperatures from the orbital, periorbital, maxillary, and nose region (see Fig. 2 (C) ). Vanilla landmark detectors did not perform consistently for all participants, so we applied a stack of preprocessing methods such as image inversion and unsharp masking to a set of 5 randomly sampled frames for each

= 2 − 3 + | 3 − 5| 2| 1 − 4|

Blinking-Rate All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 11, 2020. ;  subject, applied a facial landmark detector [17] to each of the images with different pre-processing and averaged the results per frame. Failed detections were manually annotated. Our results showed a significant difference in temperature for subjects reporting fever vs. no fever in the maxillary, periorbital and nose region.

Schuller et al. argue how audio 'in-the wild' recordings under unconstrained conditions with various signal degradations can already have value for COVID19 computer audition [2] . With our platform, we get reproducible quality audio recordings from uniform hardware (see Fig 3 (D) ). For instance, using established speech feature extraction schemes [18] , the voice quality of our data is sufficient to solve a gender classification task with above 90% cross-validated accuracy of a support vector machine.

A major limitation of our study is the marginal number of participants with positive PCR test result for SARS-CoV-2. The reason for this was the generally very low incidence rate in the study period in the region where the DTSC was located. In fact, the positive rate was below 0.4% at the DTSC Saarbrücken in the respective period. Thus, the specificity and sensitivity of the described approach with respect to SARS-CoV-2 infections cannot be assessed. However, our proof of concept shows that a remote data acquisition of SARS-CoV-2 infection related symptoms at DTSCs is possible. Our experiences and results enable the installation of similar approaches in regions that do massive DTSC testing.

Additionally, we have recorded an unprecedented, multimodal dataset with a high number of subjects that can be used for the development and refinement of computer vision [20] . Our dataset contains recordings of 436 participants at 120fps (RGB), 50fps (thermal) and 30fps (NIR) over 6 minutes. While some subjects moved out of frame during various parts for the experiment for the HFR face close-up recordings, we expect a similar percentage of successful recordings that allow for micro-expression annotation as for our HR evaluations (see Fig. 2 ) and we can additionally report HFR thermal recordings with wider field of view. For a facial landmarking task in the context of functional thermal imaging, datasets of around 2935 frames from 90 subjects with full manual annotation can be considered among the state-of-the-art [21] . Our 436 recordings at 50hz of up to 6 minutes potentially enable the generation of a dataset with 7-8 million frames which would be multiple orders of magnitudes larger. Our preliminary results suggest the possibility of partial, automatic annotation with an appropriate stack of pre-processing methods and together with tracking approaches could be used for a full annotation of the dataset in the future. On top of that, the setup potentially allows for multimodal mapping between NIR and thermal domain using the depth channel of the kinect and between RGB and thermal domain with the stereo setup (e.g. compare Palmero et al. [22] ).

Computer vision algorithms require different conditions of environmental parameters. In the context of rPPG methods, Wang et al. require constant illumination to reconstruct PPG signals from videos in talking and static scenarios and various skin and then achieve high signal to noise ratio of the spectograms [10] . The employed studio illumination in our setup at the DTSC fulfills those requirements and allowed us to record high-quality data for rPPG measurements.

We have proposed a setup to record multimodal data and have recorded a unique dataset of DTSC users across all age groups. To our knowledge, this is the first time that a multimodal video and audio dataset has been recorded at a SARS-CoV-2 DTSC. We have shown in our preliminary evaluations, that the data quality is sufficient for the application of already available procedures from computer vision and computer audition. This could allow for SARS-CoV-2 infection related symptoms assessment from such data in the future. While we captured only a marginal number of PCR positive subjects, our dataset can help with the development and refinement of computer vision algorithms beyond the COVID-19 pandemic: After full annotation for various computer vision tasks such as landmarking or microexpression analysis, it has the potential to rank among the state-of-the-art in terms of the number of participants and age statistics. It has been recorded in an out-of-lab setting with realistic participant interaction, which for example could be encountered at the entrance of an employee parking area.

COVID-19 infection and cardiac arrhythmias

COVID-19 and Computer Audition: An Overview on What Speech & Sound Analysis Could Contribute in the SARS-CoV-2 Corona Crisis

Effectiveness of airport screening at detecting travellers infected with novel coronavirus (2019-nCoV)

A review of SARS-CoV-2 and the ongoing clinical trials

Virological assessment of hospitalized patients with COVID-2019

Clinical characteristics of coronavirus disease 2019 in China

Novel Coronavirus disease 2019 (COVID-19): The importance of recognising possible early ocular manifestation and using protective eyewear

Drivethrough screening center for covid-19: A safe and efficient screening system against massive community outbreak

COVID-19-Verdacht: Maßnahmen und Testkriterien -Orientierungshilfe fürÄrzte

Algorithmic Principles of Remote PPG

Influences of Age, Gender, and Circadian Rhythm on Deceleration Capacity in Subjects without Evident Heart Diseases

Sex differences in healthy human heart rate variability: A meta-analysis

Dlib-ml: A machine learning toolkit

Eye blinking as an indicator of fatigue and mental load-a systematic review

Quantitative conjunctival provocation test for controlled clinical trials

Real-Time Eye Blink Detection using Facial Landmarks

How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks)

OpenSMILE -The Munich versatile and fast open-source audio feature extractor

A spontaneous micro-expression database: Inducement, collection and baseline

Samm: A spontaneous micro-facial movement dataset

A thermal infrared face database with facial landmarks and emotion labels

Multi-modal rgb-depth-thermal human body segmentation

We would like to thank the association of statutory health insurance physicians Saarland (KVSaar), in particular, Dr. Joachim Meiser and Michael Schneider for supporting the management with the local medical offices as well as for equipping us with PPE at the DTSC Saarbrücken. We thank the Federal Defense Forces of Germany, in particular Oberstleutnant Christoph Schacht, for their steady support during the study regarding the logistics and the installation of our data acquisition station at DTSC. We acknowledge the support of the ZF Friedrichshafen AG, in particular Florian Dauth, Volker Wagner, and Dr. Peter Reitz for supporting us with the research vehicle used for the data acquisition. A special note of thanks goes to the responsible authorities at the Regionalverband (regional association of towns) Saarbrücken, in particular Peter Gillo and Alexander Birk and the responsible ethics commission at the medical council Saarland for helping us to resolve all the data security and ethical issues in an incredible short amount of time. Furthermore, the authors would like to thank Benedikt Buchheit, Maximilian Becker, Dr. Farah I. Corona-Strauss, Adrian Mai, Richard Morsch, Patrick Schäfer, and Elena Schneider from our research unit for supporting the administration as well as the data processing. Finally, we would like to thank Professor Alexander L. Francis from Purdue University for proving valuable feedback on the first version of this manuscript.