key: cord-0150478-rmusq74z authors: Guhan, Pooja; Awasthi, Naman; Das, Ritwika; Agarwal, Manas; McDonald, Kathryn; Bussell, Kristin; Manocha, Dinesh; Reeves, Gloria; Bera, Aniket title: DeepTMH: Multimodal Semi-supervised framework leveraging Affective and Cognitive engagement for Telemental Health date: 2020-11-17 journal: nan DOI: nan sha: 29017c5503913e2724d72bd75f5a0cc159ab493a doc_id: 150478 cord_uid: rmusq74z To aid existing telemental health services, we propose DeepTMH, a novel framework that models telemental health session videos by extracting latent vectors corresponding to Affective and Cognitive features frequently used in psychology literature. Our approach leverages advances in semi-supervised learning to tackle the data scarcity in the telemental health session video domain and consists of a multimodal semi-supervised GAN to detect important mental health indicators during telemental health sessions. We demonstrate the usefulness of our framework and contrast against existing works in two tasks: Engagement regression and Valence-Arousal regression, both of which are important to psychologists during a telemental health session. Our framework reports 40% improvement in RMSE over SOTA method in Engagement Regression and 50% improvement in RMSE over SOTA method in Valence-Arousal Regression. To tackle the scarcity of publicly available datasets in telemental health space, we release a new dataset, MEDICA, for mental health patient engagement detection. Our dataset, MEDICA consists of 1299 videos, each 3 seconds long. To the best of our knowledge, our approach is the first method to model telemental health session data based on psychology-driven Affective and Cognitive features, which also accounts for data sparsity by leveraging a semi-supervised setup. The World Health Organization defines mental health as "a state of well-being" that allows a person to lead a fulfilling and productive life and contribute to society [74] . It has been estimated that around 15.5% of the population suffers from mental illness globally, and these numbers are rising continuously. There is, however, a worldwide shortage of mental health providers. This, combined with issues related to affordability and reachability, has resulted Figure 1 . DeepTMH: We present a novel framework utilizing semi-supervised multi-modal GAN to detect mental health indicators for telemental health based on psychology literature. We use two constructs: affective and cognitive engagement, to extract various features that can effectively capture detect task related patterns. Task could be engagement/valence-arousal estimation. in more than 50% [66] of the mental health patients remaining untreated. The mental health landscape became even bleaker during the COVID-19 pandemic. However, the rapid expansion of telemental health services, especially during the pandemic, has increased access to clinical care options and introduced the opportunity to use artificial intelligence (AI) based strategies to improve the quality of human-delivered mental health services. Telemental health is the process of providing psychotherapy remotely, typically utilizing HIPAA compliant video conferencing [1] . Given that it relies a lot on technology, experienced human therapists face challenges engaging with patients due to unfamiliarity with the setup as well as other factors. For instance, in a telemental health session, the therapist has limited visual data (e.g., a therapist can only view the patient's face rather than full body language), so the therapist has fewer non-verbal cues to guide their responses. It is also more difficult for the therapist to estimate attentiveness since eye contact required during in-person sessions is replaced with the patient looking at a camera or screen. Video conferencing discussions may also appear more stilted, es-pecially if there is inadequate internet connection or technological challenges. Such challenges make it difficult for a therapist to perceive the patient's several mental health indicators like engagement level, valence and arousal. We develop a multimodal framework for modeling these three indicators, namely valence, arousal, and engagement, in our paper. Engagement basically refers to the connection between a therapist and patient that includes a sense of basic trust and willingness/interest to collaborate that is essential for the therapeutic process. It is critical for both retention in care as well as the accuracy of diagnosis. While engagement is considered as one of the key standards for mental health care [48] , valence and arousal values have also been shown to be extremely useful for a therapist to understand and improve early detection of patient safety issues [35] , including medication side effects (e.g., sedation), illicit substance use [14, 21, 32] , as well as the risk of violence (self-harm, aggression) [43] . Valence is basically the positive affectivity, and arousal captures a person's response to exciting information. Valence and arousal are two of the three dimensions of the Valence-Arousal-Dominance model. Therefore, developing a system that can provide feedback on engagement, using multimodal data has the potential to improve the therapeutic outcomes while performing telemental health. Since the patient population is individuals with mental illness, we wish to make our algorithm based on psychology and psychiatry literature so that the recognition and understanding of engagement are as close to what a therapist would do. Taking inspiration from existing psychology literature on engagement, we take a multicomponential approach. We operationalize engagement in terms of affective and cognitive states. 1. Cognitive Engagement involves comprehending complex concepts and issues and acquiring difficult skills. It conveys deep (rather than surface-level) processing of information whereby the person gains a critical or higher-order understanding of the subject matter and solves challenging problems. [60] 2. Affective Engagement encompasses affective reactions such as excitement, boredom, curiosity, and anger. [30] . The range of affective expressions will vary based on individual demographic factors (e.g., age), cultural backgrounds/norms, and mental health symptoms. These components are basically the categories of the different cues used by a mental health therapist to assess someone's engagement, valence-arousal levels. Additionally, since the state of these psychological indicators is temporal in nature, we are interested in analyzing it across microlevel time scales ranging a few seconds. These characteristics of our approach align perfectly with the person-oriented analysis discussed by [70] . One of the biggest motivations for us to incorporate these is to not only lay the foundations for a trustworthy analysis from the perspective of a clinician but also open a possibility for the user to understand the reasons behind a specific assessment. Different machine learning techniques can be explored to solve this problem. But obtaining a large amount of highquality labeled data to train a robust model for predicting patient engagement/valence-arousal levels is inevitably laborious and requires expert medical knowledge. Considering that unlabeled data is relatively easier to collect, we propose a semi-supervised learning-based solution. Main Contributions: The novel components and main contributions of our work include: 1. We propose DeepTMH, a novel framework that models telemental health session videos. Our algorithm takes into account different components of engagement defined in the psychology literature, namely-Affective and Cognitive engagement. These components are incorporated as the modalities in our multimodal framework. 2. We propose a novel regression-based framework that can capture psychology-inspired cues capable of perceiving the different important psychological indicators useful for a psychotherapist, namely, patient engagement, valence, and arousal. Our focus in this work is to only understand the patient's mental health state and not of the therapist. The input to the proposed framework would be the patient's visual, audio, and text data, while the output would be the desired psychological indicator (engagement/valence-arousal). 3. We release a new dataset, MEDICA (Multimodal Engagement Detection In Clinical Analysis), to enhance mental health research, specifically towards understanding the engagement levels of patients attending the therapy sessions. To the best of our knowledge, there is no other multimodal dataset that caters specifically to the needs of mental health-based research. Additionally, while there are some imagebased or sensory information-based datasets, there is no dataset that addresses the possibility of exploring engagement detection using visual, audio, and text modalities. MEDICA is a dataset that is a collection of around 1299 short video clips obtained from mock mental health therapy sessions conducted between an actor (who acts like a patient) and a real therapist, which is used by medical schools in their psychiatry curriculum. We compare the performance of our proposed framework against prior methods on MEDICA and RECOLA. Using MEDICA, we try to understand the usefulness of our approach for engagement detection. We report RMSE of 0.10 on engagement detection task by DeepTMH. We train our framework separately on the RECOLA dataset for estimating Valence-Arousal in video clips. For this task, we report RMSE of 0.064 on valence estimation and 0.062 on arousal estimation. In this section, we summarize prior work done in related domains. We first look into available literature using both unimodal and multimodal frameworks for engagement detection in section 2.1. In section 2.2, we give a brief overview of previous works that explore the valence-arousal prediction task and finally, in section 2.3, we discuss prior semi-supervised learning-driven approaches. Facial expressions [46] , speech [75] , body posture [67] , gaze direction [47] and head pose [68] have been used as single modalities for detecting engagement. Combining different modalities has been observed to improve engagement detection accuracy [24, 61] . [20] proposed a multimodal framework to detect the level of engagement of participants during project meetings in a work environment. The authors expanded the work of Stanford's PBL Labs called eRing [37] by including information streams such as facial expressions, voice, and other biometric data. [44] proposed an approach to detect engagement levels in students during a writing task by not only making use of facial features but also features obtained from remote video-based detection of heart rate. The dataset used was generated by the authors, and they used self-reports instead of external annotation for classification purposes. [13] make use of facial expressions as well as body posture for detecting engagement in learners. [18] proposes the use of audio, facial, and body pose features to detect engagement and disengagement for an imbalanced in-the-wild dataset. Our objective in this work is to leverage the knowledge available from psychology literature and understand different nuances observed in the patient as they try to have a conversation with their therapist. In telemedical sessions, it is challenging to get biometric data such as heart rate and observe the patient's body posture. We overcome these dependencies by proposing a framework that relies on modalities extractable from videos, i.e., image frames, audio, and text (of the patient's speech). Additionally, our framework uses two constructs of engagement (cognitive and affective) to evaluate a person's level of engagement/valence-arousal more accurately. Due to the complexity of human interactions and the affective features, it is difficult to encapsulate all variations into a set of discrete labels. Therefore, there has been increasing research interest in estimating the two critical psychological dimensions -valence and arousal. While there exist some works that use valence-arousal values to make some kind of predictions [80] , there also have been works that describe methods to predict the values of valence and arousal individually [54, 63] . However, these methods ignore the inherent dependency between the two dimensions [51] and therefore, a lot of crucial information could go missing. [55] captures this dependency and exploits it by proposing a multi-tasking framework to predict the values of valence, arousal, and dominance. Multimodal approaches [6, 8, 10, 26, 41, 42, 78] using visual-audio data or audio-linguistic data have also been proposed to predict the values of valence and arousal. In contrast to these methods, our approach focuses specifically on deducing features inspired by psychology and, therefore, serve as excellent indicators of valence and arousal values. Recently, semi-supervised learning has gained much importance as it has enabled us to deploy machine learning systems in real-life applications despite a lack of labeled data. Its ability to improve model performance in situations where we have few labeled data samples and a lot of unlabeled data has led it to be widely adopted in various applications like image search [36] , speech analysis [33] , natural language processing. [79] proposed a novel multimodal SSL architecture to detect emotions on the RECOLA dataset using audio and video-based modalities. The authors also describe a method to handle mislabeled data. In order to perform emotion recognition in speech [56] proposes the training ladder networks in a semisupervised fashion. There also has been some exploration in SSL to do engagement detection. One of the earliest works in this direction includes the works of [2] where they consider the development of an engagement detection system, more specifically emotional or affective engagement of the student in a semi-supervised fashion to personalize systems like Intelligent Tutoring Systems according to their needs. [49] conduct experiments to detect user engagement using a facial feature-based semi-supervised model. Most state-of-the-art semi-supervised learning methods in the literature include using Generative Adversarial Nets (GANs) [22] . The discriminator of the GAN is used as the classifier. The earliest works in the application of GANs for semi-supervised learning [65] . [31] proposed an SSL-based Wasserstein GAN to perform multimodal emotion recognition using separate generators and discriminators for each of the modalities being explored, namely -audio and visual. The conceptual model of our engagement regression framework is based on Bordin's 1979 theoretical work [11] . According to this theory, the therapist-provider alliance is driven by three factors -bond, agreement on goals, and agreement on tasks that fit nicely with the features identified in this work. While bond would correspond with affective, goals and task agreement correspond with cognitive. The merit of Bordin's approach is that it has been used for child therapy and adults, and it is one of the more widely studied therapeutic alliance measures. We provide an overview of our proposed framework in the following section. We present an overview of the semi-supervised GAN multimodal engagement detection model in Fig 2. Given an input of a video, audio, and text corresponding to a subject, the first objective is to extract useful psychology-derived features and then predict the different psychological metrics of the patient under consideration. Affective state h A needs all three modes, i.e., video frames, audio, and text, while cognitive state h C extracts useful information from the patient's speech. The concatenated vector h T = (h C , h A ) is fed to the GAN network based on [17] to perform semisupervised learning-based regression. Cognitive engagement is usually measured and evaluated using neuropsychological exams that are typically conducted via in-person interviews or self-evaluations to gauge memory, thinking, and the extent of understanding the topic of discussion [25, 71] . There has been a lot of work around determining biomarkers for detecting signs and lack of cognitive engagement [40] . However, these methods are either offline or fail to take into account various essential perceptual indicators. We take inspiration from literature in psychology and medicine to understand the possible signs of cognitive engagement. Studies similar to [58] states that people with poor cognition control have difficulty in being able to engage actively. People who lack cognitive engagement at some instant can be said to show symptoms that resemble the ones you would notice in someone having early, mild signs of cognitive impairment (for example, they may be unable to interpret instructions easily or remember events) [72] . Detecting signs of cognitive impairment, therefore, could help in giving an indication of a lack of cognitive engagement to some extent. Recently, there has been a lot of work around using speech as a potential biomarker for detecting cognitive impairment [19, 73] . Apart from looking at cognitive engagement from a perspective of cognitive impairment, one can also relate it to stress. It has also been found that stress negatively affects the cognitive functions of a person, and this too can be easily detected using speech signals. Moreover, speech-based methods are attractive because they are non-intrusive, inexpensive, and can potentially be real-time. Four major speechbased features namely -glottal(f g ) [3] , phonation(f ph ), articulation(f ar ) and prosody(f pr ) [62] ; have been found to be extremely useful to check for signs of cognitive impairment and are also being used a lot currently to detect early signs of extreme cognitive impairment conditions such as Parkinson's and Alzheimer [4, 9] . Therefore, the feature obtained from this model, h c can be written as: h C = concat(f g , f ph , f ar , f pr ). In order to understand affective engagement, we aim to check if there exists any inconsistency between the emotions perceived through what the person said, the tone with which the person expressed it, and the facial expressions that the person made. Often when a person is disengaged, the emotions perceived through the person's facial expressions may not match the emotions perceived from the statement the person made. [7, 59] suggests that when different modalities are modeled and projected onto a common space, they should point to similar affective cues; otherwise, the incongruity suggests distraction, deception, etc. Therefore, motivated by this, we adopt pre-trained emotion recognition models to extract affective features from each video sample separately. Let f t , f a , f v correspond to the affective features obtained from the text (features from the caption of the video), audio (features of audio in the video), and video (video frame features) respectively. Therefore, The existing engagement detection methods are supervised learning frameworks that are highly data-dependent. Additionally, it is challenging to obtain datasets that can capture the different possible variations in the variables that define engagement/valence-arousal. The inclusion of semisupervised GANs in the framework not only allows us to work with very few labeled data points but also in the process of trying to generate fake samples closer to the real sample distribution, the GAN generator manages to capture the data manifold well [28] . This improves our model's generalizability and makes it more robust compared to the previously defined approaches. There are two tasks of interest for us, Engagement Detection and Valence Arousal Estimation. We use the MED-ICA dataset for engagement detection and RECOLA for valence-arousal estimation. Engagement is a fairly overloaded term, and the definition varies with the application, making it hard and expensive to collect, annotate, and analyze such data. As a result, we find too few multimodal-based engagement detection datasets currently available for us to use. The only data available to therapists to do patient evaluations in telemedicine are video recordings of the patients. These video recordings capture a patient sitting in front of a camera talking in English to another person via the screen. CMU-MOSI [76] , CMU-MOSEI [77] , SEND [53] are publicly available video datasets which consist of videos in similar settings but are not for engagement detection. To encourage research towards engagement detection from multimodal data, we release a new dataset, MEDICA. We demonstrate the use of our proposed model and past existing models on MEDICA. Given the lack of a dataset that can allow researchers to use multimodal features (video, text, and audio) for engagement, we propose MEDICA, a novel dataset developed specifically to cater to engagement detection using telemental health session videos. To use this data to address a broader range of issues related to mental health, we also include labels pertaining to stress and emotions. According to the author's knowledge, this dataset is one of the first publicly available datasets that caters specifically to multimodal research in patient engagement in mental health. Table 1 presents a comparison between MEDICA and other related datasets. 1. Data Acquisition: The authors download publicly available mock-sessions of mental health therapy videos. These videos are primarily used by clinical therapists to teach their students how to address patients' grievances. The patients in the videos are being advised for depression, social anxiety, and PTSD. We have collected 13 videos, each having a duration of around 20min -30 mins. Our primary objective is to collect videos wherein only one participant (therapist or patient) is visible in the camera frame. Additionally, we also take only those videos with only one patient, and they are conversing in English. varying intensities while expressing their thoughts and feelings. Therefore, the videos have been labeled for multiple emotions. This is to motivate and provide the ability to the system to understand the various interacting emotions of the users. The data has also been annotated for 8 emotions related to mental health, namely -happy, sad, irritated, neutral, anxious, embarrassed, scared, and surprised. The annotation was carried out by a group of 20 psychotherapists. RECOLA [64] (License: EULA) is a multimodal data corpus consisting of video recordings of spontaneous interaction happening between two people in french. The dataset has 23 videos in total, and each video shows one person (who is visible) conversing with another person (not visible in the video) in french. The audio of the person not visible in the video is inaudible. Each video has been annotated by six people (three males and three females). Annotation Processing: We clip the videos in the RECOLA dataset into a size 1 second each and aggregate the valence-arousal values in that time frame. Initially, the valence-arousal values were provided by annotators for every 0.04 second of the videos. Voice utterances do not have meaning in a short span like 0.04 seconds, and video frame information does not change in such a short span. Therefore, we remodeled the dataset by dividing the videos into clippings of 1 sec each and extracting the corresponding frames, audio, and text. We rescale the valence and arousal values to lie between 0 and 1 instead of -1 and 1 for simplicity. As there were six annotators, we weighed each annotator's reported valence/arousal equally and averaged their six values to arrive at the final valence and final arousal values corresponding to every interval of 1 sec of the video. All the videos in the dataset have a duration of 5 mins, and valence, arousal annotations have been provided at every 0.04 second of the video. The net valence and arousal value for this duration is taken to be the mean of the valence and arousal values of the 25 sample points available in the orig- inal dataset for every second. In this mode, for the given input audio, we extract the glottal, prosody, articulation and phonation based features using librosa [39] and praat [57] libraries. Glottal features (f g ) help in characterizing speech under stress [15] . During periods of stress, there is an aberration in the amount of tension applied in the opening (abduction) and closing (Adduction) of the vocal cords [45] . Prosody features(f pr ) characterize the speaker's intonation and speaking styles. Under this, we analyze variables like timing, intonation, and loudness during the production of speech. Phonation (f ph ) in cognitively impaired people is characterized by bowing and inadequate closure of vocal cords, which produce problems in stability and periodicity of the vibration. They are analyzed in terms of features related to perturbation measures such as jitter (temporal perturbation of the fundamental frequency), shimmer(temporal perturbation of the amplitude of the signal), amplitude perturbation quotient (APQ), and pitch perturbation quotient (PPQ). Apart from these, the degree of unvoiced is also included. Articulation (f ar ) related issues in cognitively impaired patients are mainly related to reduced amplitude and velocity of lip, tongue, and jaw movements. The analysis is based primarily on the computation of the first two vocal formants F 1 and F 2 . All these features have been extracted from audio clips of 1 sec for each sample point in the dataset D. In this mode, we extract affective features from audio, video and text data input. 1. Audio (f a ): Mel-frequency Cepstrum (MFCC) features were extracted from the audio clips available in RECOLA and MEDICA. The affective features were extracted using a MLP network that has been trained for emotion recognition in speech using the data available in the RAVDESS [34] and CREMA-D [12] datasets. A feature vector of 150 was obtained corresponding to each audio clip. 2. Video (f v ): The VGG-B architecture used in [5] was used to extract affective features from the video frames. The output dimensions of the second last layer were modified to give a feature vector of length 100. 3. Text (f t ): We extract affect features from the text using a bert-based model pretrained network on GoEmotions dataset for MEDICA. For RECOLA dataset, we extract affect features from text using CamBERT model. We adopt a multimodal semi-supervised GAN-based network architecture for regressing the values of engagement or valence-arousal corresponding to each feature tuple h T . The network builds on the semi supervision framework SR-GAN proposed by Olmschenk in [52] . Our model training pipeline is described in Fig. 2 The 5 losses used to train these components are: L lab , L un , L f ake , L gen and L grad . Labeled Loss: Mean squared error of output with ground truth. Unlabeled Loss: Minimize the distance between unlabeled and labeled dataset's feature space. Fake loss: Maximize the distance between unlabeled dataset's features with respect to fake images. (3) Generator Loss: Minimize the distance between feature space of fake and unlabeled data. Gradient penalty: As described in [52] , gradient penalty is used to keep the gradient of discriminator in check which helps in convergence. The gradient penalty is calculated with respect to a randomly chosen point on the convex manifold connecting the unlabeled samples to the fake samples. We use the standard evaluation metric of RMSE to evaluate all our approaches. The semi-supervised models were trained on NVIDIA GeForce GTX 1080ti GPU with batch size 512, learning rate 0.0001 for 50 epochs. We present two sets of experiments to understand the usefulness of our proposed method. The first experiment demonstrates the ability of our model to predict score for the state of engagement exhibited by the person in the video. This experiment was performed on the MEDICA dataset. As our proposed methodology leverages semi supervised approach, we extract labeled samples from MEDICA and unlabeled samples from MOSEI dataset. After preprocesing, we extract 12854 unlabeled data points from MO-SEI. We split the 1299 labeled data from MEDICA into 70:10:20 for training, validation and testing respectively. Therefore, the split of labeled training data to unlabeled training data points is 909:12854. We compare our model with the following SOTA methods for engagement detection: We use the publicly available implementation for LBP-TOP [27] and train the entire model on MEDICA. S3VM [49] does not have a publicly available implementation. We reproduce the method to the best of our understanding. Table 2 summarises the RMSE values obtained for all the methods described above. We observe an improvement of at least 40%. Our approach is one of the first to do engagement detection specifically for mental health patients in a telemental session setting. The modules used, specifically cognitive and affective engagement, help the overall framework to effectively mimic the way a psychotherapist perceives the patient's level of engagement. Similar to a psychotherapist, these modules too look for specific engagement-related cues exhibited by the patient in the video. The second experiment demonstrates DeepTMH's capabilities in estimating valence-arousal values of a person in a telemental health video. These experiments were performed on the RECOLA dataset. We split RECOLA dataset into train and test set with the ratio 90:10. We ensure that the two sets are mutually exclusive with respect to the participants appearing in each set. This prevents data leak in our training/testing. The training set is further divided into labeled and unlabeled sets with ratio of 40:60. We compare our model with the following SOTA methods for valence arousal estimation: 1. Deng Didan et al [16] proposed a multitask CNN-RNN model for jointly learning three tasks, namely-Facial AU, expression classification and estimating valence, arousal values. We use their publicly available implementation and train the entire model on RECOLA. 2. Nguyen, Dung, et al [50] proposed a network consisting of a two-stream autoencoder and a LSTM for integrating visual and audio signals for estimating valencearousal values. The authors report results of their framework on RECOLA. 3. Lee, Jiyoung, et al. [29] extract color, depth and thermal information from videos and pass this as a multimodal input to a spatiotemporal attention network to predict valence-arousal values. The authors report results of their framework on RECOLA. Table 3 summarises the results obtained. We notice that our framework outperforms the SOTA methods discussed by almost 50%. Contrary to other methods, DeepTMH understands and effectively models the different facets of arousal and valence. While the affective engagement module relates with the kind of emotions experienced by the person, the cognitive module enables the system in understanding the degree of the emotion being experienced (extreme or normal or something in between). This aligns with the empirical and theoretical literature [23, 38, 69] available on understanding the multifaceted nature of valence and arousal. Valence RMSE Arousal RMSE Nguyen, Dung, et al [50] 0.187 0.474 Deng Didan et al [16] 0.14 0.080 Lee, Jiyoung, et al. [29] 0.102 -DeepTMH 0.064 0.062 To motivate the importance of the different components (Affective and Cognitive) used in our approach, we run DeepTMH on MEDICA and RECOLA by removing either one of the modes corresponding to affective engagement and cognitive engagement and report our findings. We summarize the ablation results performed in Table 4 . We can observe that the ablated frameworks (i.e only using mode A or C) do not perform as well as DeepTMH. Therefore, in order to understand and verify the contribution of these modes further, we leveraged the other labels (stress, hesitation and attention) available in MEDICA and performed regression tasks using DeepTMH on all of them. We observed that mode C performs better at predicting stress and hesitation values. Mode A performed better in estimating a patient's level of attentiveness. These results agree with our understanding of cognitive engagement and affective engagement(sections 3.2, 3.3). Therefore, the combination of affective and cognitive engagement modes helps in efficiently predicting the engagement level of the patient. It is interesting to observe from Table 5 that while valence has greater dependency on mode A, arousal seems to be slightly more dependent on mode C. Both modes A and C individually and combined outperform the SOTA methods in Arousal detection. But the combination of A and C helps in drastically decreasing the RMSE for valence. One of the primary limitations of the proposed framework is that predictions for the various tasks discussed (i.e., engagement, valence, and arousal prediction) may not be optimal in case of occlusions or missing modalities, data corruption due to low internet bandwidth, and integration of wearable devices with our model. But we understand that the occurrence of this case is very much possible in a telemental session and hope to incorporate solutions for it in our future work. We would also like to explore making the predictions more explainable so as to enable psychotherapists receive evidence-guided suggestions to make their final decisions. MEDICA has been created using publicly available videos of mock therapy sessions. These videos are often used by medical schools for the purpose of teaching. The results discussed in this paper are based on the data that we have collected so far. There is an ongoing effort to expand the dataset further to include all possible kinds of variations arising due to cultural and geographical differences among patients and, therefore, make it more inclusive. Due to privacy concerns, it is difficult to get access to release actual therapy session videos. Therefore, in order to further research in the development of tools useful for psychotherapists, we resort to building a collection of high-quality annotated mock therapy session videos approved by a group of experienced psychotherapists. We propose DeepTMH, a framework that leverages affective and cognitive features from the psychology literature to estimate the perceived engagement of a person. As labeled datasets for Telemental health are scarce, we create a GAN-based semi-supervised training methodology to train for perceived engagement detection using cues from psychology research. We also perform ablation studies to capture the importance of the different modes and their respective features. Even though the paper focuses on telemental health, we believe that our method can be easily adapted for other mental health support tools like virtual therapists or social robots as well. Despite mental health being an issue that needs immediate attention, adequate datasets to leverage the benefits of AI still remain a significant challenge to address. Therefore, to promote better research opportunities, we release a new dataset called MEDICA. While in this work, we present our primary use-case as telemental health, we believe that our work can be easily extended to be used in other mental health-based support tools like social robots and virtual therapists. As part of future work, we hope to build this dataset further to accommodate other related tasks that can improve existing telemental services by leaps and bounds. Sinem Emine Mete, Bert Arnrich, and Asli Arslan Esme Semi-supervised model personalization for improved detection of learner's emotional engagement Automatic speech analysis to early detect functional cognitive decline in elderly population Parkinson's disease and aging: analysis of their effect in phonation and articulation of speech @articleadair2017attitude, title=Attitude-Scenario-Emotion (ASE) sentiments are superficial, author=Adair, Heather and Carruthers, Peter, journal=Behavioral and Brain Sciences, volume=40, year=2017 Predicting valence and arousal by aggregating acoustic features for acoustic-linguistic information fusion Emotion analysis in man-machine interaction systems Learning unseen emotions from gestures via semanticallyconditioned zero-shot perception with adversarial autoencoders Glottal flow patterns analyses for parkinson's disease detection: acoustic and nonlinear approaches Take an emotion walk: Perceiving emotions from gaits using hierarchical attention pooling and affective mapping The generalizability of the psychoanalytic concept of the working alliance Crema-d: Crowd-sourced emotional multimodal actors dataset An ensemble model using face and body tracking for engagement detection Trauma and substance cue reactivity in individuals with comorbid posttraumatic stress disorder and cocaine or alcohol dependence. Drug and alcohol dependence Analysis of glottal waveforms across stress styles Multitask emotion recognition with incomplete labels Margingan: Adversarial training in semi-supervised learning Multimodal approach to engagement and disengagement detection with highly imbalanced in-the-wild data Engagement detection in meetings Prescription opioid misusers exhibit blunted parasympathetic regulation during inhibitory control challenge Generative adversarial nets Psychological skills for enhancing performance: Arousal regulation strategies Embodied affect in tutorial dialogue: student gesture and posture Measuring cognitive engagement with self-report scales: Reflections from over 20 years of research Affective video content analysis based on multimodal data fusion in heterogeneous networks Prediction and localization of student engagement in the wild Semi-supervised learning with gans: Manifold invariance with improved inference Multi-modal recurrent attention networks for facial expression recognition Interrelations of behavioral, emotional, and cognitive school engagement in high school students Semi-supervised multimodal emotion recognition with improved wasserstein gans Effects of substance cues in negative public service announcements on cognitive processing Graph-based semisupervised learning for phone and segment classification The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english Looking at the other side of the coin: a meta-analysis of selfreported emotional arousal in people with schizophrenia Semisupervised learning for image retrieval using support vector machines ering: Body motion engagement detection and feedback in global teams Positive emotion enhances association-memory Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python Biosensors show promise as a measure of student engagement in a large introductory biology course M3er: Multiplicative multimodal emotion recognition using facial, textual, and speech cues Emotions don't lie: An audiovisual deepfake detection method using affective cues Exposure to violence and neglect images differentially influences fear learning and extinction Automated detection of engagement using video-based estimation of facial expressions and heart rate Evaluating objective feature statistics of speech as indicators of vocal affect and depression Engagement detection in e-learning environments using convolutional neural networks Estimating user's engagement from eye-gaze behaviors in human-agent conversations Engagement: A New Standard for Mental Health Care Semi-supervised detection of student engagement Thin Khac Nguyen, Sridha Sridharan, and Clinton Fookes. Deep auto-encoders with sequential learning for multimodal dimensional emotion recognition Joint model-parameter validation of self-estimates of valence and arousal: Probing a differentialweighting model of affective intensity Generalizing semi-supervised generative adversarial networks to regression Modeling emotion in complex stories: the stanford emotional narratives dataset Defining emotionally salient regions using qualitative agreement method Jointly predicting arousal, valence and dominance with multi-task learning Semi-supervised speech emotion recognition with ladder networks Automatic differentiation in pytorch The disengagement of visual attention: an eyetracking study of cognitive impairment, ethnicity and age Reading between the lies: Identifying concealed and falsified emotions in universal facial expressions Orienting of attention Multimodal student engagement recognition in prosocial games Dagmar Berankova, Tereza Necasova, Zdenek Smekal, and Radek Marecek. Speech prosody impairment predicts cognitive decline in parkinson's disease. Parkinsonism & related disorders Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data Introducing the recola multimodal corpus of remote collaborative and affective interactions Improved techniques for training gans Mental health Automatic analysis of affective postures and body motion to detect engagement with a game companion Student engagement detection using emotion analysis, eye tracking and head movement with machine learning Levels of valence The challenges of defining and measuring student engagement in science Anita Liberalesso Neri, and Mônica Sanches Yassuda. Cognitive performance and engagement in physical, social and intellectual activities in older adults: The fibra study Identification of mild cognitive impairment from speech in swedish using deep sequential neural networks WHO urges more investments, services for mental health Detecting user engagement in everyday conversations Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph M3t: Multi-modal continuous valence-arousal estimation in the wild Enhanced semisupervised learning for multimodal emotion recognition Emotion-based end-toend matching between image and music in valence-arousal space