key: cord-0196864-zxlmwzwy authors: Malhotra, Ganeshan; Waheed, Abdul; Srivastava, Aseem; Akhtar, Md Shad; Chakraborty, Tanmoy title: Speaker and Time-aware Joint Contextual Learning for Dialogue-act Classification in Counselling Conversations date: 2021-11-12 journal: nan DOI: nan sha: 46af47efcc2b11c0b68cab3f59326115835ad0ab doc_id: 196864 cord_uid: zxlmwzwy The onset of the COVID-19 pandemic has brought the mental health of people under risk. Social counselling has gained remarkable significance in this environment. Unlike general goal-oriented dialogues, a conversation between a patient and a therapist is considerably implicit, though the objective of the conversation is quite apparent. In such a case, understanding the intent of the patient is imperative in providing effective counselling in therapy sessions, and the same applies to a dialogue system as well. In this work, we take forward a small but an important step in the development of an automated dialogue system for mental-health counselling. We develop a novel dataset, named HOPE, to provide a platform for the dialogue-act classification in counselling conversations. We identify the requirement of such conversation and propose twelve domain-specific dialogue-act (DAC) labels. We collect 12.9K utterances from publicly-available counselling session videos on YouTube, extract their transcripts, clean, and annotate them with DAC labels. Further, we propose SPARTA, a transformer-based architecture with a novel speaker- and time-aware contextual learning for the dialogue-act classification. Our evaluation shows convincing performance over several baselines, achieving state-of-the-art on HOPE. We also supplement our experiments with extensive empirical and qualitative analyses of SPARTA. Mental illness remains an alarming global health issue today. Due to the COVID-19 pandemic, there has been a significant growth in mental health disorders such as depression, attention deficit hyperactivity disorder (ADHD) and hypertension [27] . A recent study shows an unprecedented 20% increase in patients with mental health illness 1 . Similar study discusses the adverse impact on the mental health of US college students due to the pandemic [43] . Counselling therapy can benefit many people under risk by providing them emotional support. Amidst the surge in the number of patients, it has become a challenge for the therapists to diagnose too many patients. On the other hand, patients have found it difficult to access the services of the therapist amid lockdown. Counseling therapy is a sophisticated procedure that deals with the expression of emotion and intent of patients with different personalities. To build a strong therapeutic relationship with the patient, it is essential for a therapist to develop a better understanding of the implicit intents of the patients. The nature of conversations in a social counselling setting is particularly distinct as compared to a conventional chit-chat or goal-oriented conversations. It follows 1 https://cutt.ly/WbEziBF/ Table 1 : Example of a sample conversation session between a patient and a therapist. Each utterance has an associated dialogue-act classification (DAC) label. Therapist: Jackie, how are you? Greeting Patient: Okay, How are you? Greeting Therapist: Thanks for asking. I see that you have signed a release so I could talk to your mother and that she brought you in today. What's going on there? Patient: They think I have a drinking problem. My family.. Therapist: Your family thinks you have a drinking problem? Request Patient: Yeah. So we really started this was this past weekend. They came to pick me up for my bridal shower. And I was I was drunk when they came to get me so I couldn't go and now everybody's pretty pissed at me. Therapist: So they asked you to come into the agency? Clarification Request Patient: Yeah, you know, I don't want them to hate me or anything. So I agreed to come. Clarification Delivery a pattern which is different from both goal-oriented and general chit-chat based conversations. Usually these conversations begin with greetings followed by the therapist inquiring for problems faced by the patient. The therapist usually delves deeper into a particular problem acquiring as much context and fine-grained information before advising a remedy. These conversations also heavily utilise the contextual information of the entire conversation history. Moreover, the prime objective of the conversation is to understand the explicit and implicit requirements of the patients and suggest potential solutions accordingly. In comparison, a traditional goal-oriented dialogue system does not regard any implicit requirements, whereas a chit-chat based system lacks a target and does not care about the final solution. Another major difference is the length of utterances and conversations in a counseling session. These are particularly lengthy as patients describe their difficulties and issues, while the therapists list out possible causes and preventive solutions. The task of Dialogue-act classification (DAC) is cardinal in a dialogue system and even more so in counselling based conversations. It deals with understanding the intended requirements of the utterances, which essentially act as one of the precursors for the dialogue response generation. For instance, we present an example of a therapy session in Table 1 . For each utterance in Table 1 , a corresponding label defines its dialogue-act. As we can observe from the first two utterances, they are a part of the complementary greetings that usually occur at the beginning of a natural conversation. Subsequently, in the third utterance, the therapist leads the conversation and requests for information. In response, the patient delivers the requested information. Earlier studies like [1, 33] tackle the task of dialogue-act classification on chit-chat based conversation datasets such as Switchboard corpus [11] . Their proposed architectures take into account the contextual dependency of an utterance that aids in efficient dialogue-act classification. For example, an utterance tagged as having a dialogue-act 'question' has a high probability of being followed by an utterance with tag 'answer'. In another work, Shang et al. [39] argued that the information of speaker change is a critical feature in the dialogue-act classification task. Considering the severity of the issue and the complexity of the task, designing an automated system can facilitate the counselling sessions or assist the therapist, thus allowing them to cater to more patients. Literature in the natural language processing domain suggests a significant effort in understanding and building models for conversational dialogue [3, 21] . However, there are hardly any models that support mental-health counseling as a dialogue system; this is primarily due to lack of data. In this paper, we aim to address these limitations by creating the HOPE 2 dataset which consists of therapy conversations covering cognitive-behavioral therapy (CBT), child therapy, family therapy, etc. The HOPE dataset contains ∼ 12.9 utterances across 212 mental-health counseling sessions. Each utterance in the dataset is tagged with one of the 12 counseling-aligned dialogue-act labels (c.f. Section 3). We also propose SPARTA 3 , a novel speaker-and time-aware contextual transformer model for dialogue-act classification. SPARTA exploits both the local and global contexts along with the speakerdynamics in the dialogue. We model the problem as a dialogue-level sequence classification task, where the aim is to predict an appropriate dialogue-act for each utterance in a dialogue. To incorporate the global context, we employ a Gated Recurrent Unit (GRU) [5] that takes an utterance representation at each step of the dialogue. In addition, we introduce a novel time-aware attention mechanism to capture the local context -a sliding-window based memory unit is maintained, and subsequently, a cross-attention between the current utterance and the memory unit is computed. Our evaluation shows substantial improvement in performance in comparison to the recent state-of-the-art systems. Furthermore, we provide empirical evidences for each module of SPARTA using an extensive ablation study and detailed analyses. Major contributions: We summarize the main contributions of our current work as follows: • We present HOPE, a novel and large-scale manually annotated, counselling-based conversation data for dialogue-act classification. To the best of our knowledge, the current study is one of the first efforts in compiling a dataset related to the mental-health counseling dialogue system. 2 Mental Health cOunselling of PatiEnts 3 SPeaker and time-AwaRe conTextual trAnsformer • To cater to the requirements of counseling conversations, we define a novel hierarchical annotation scheme for the dialogueact annotation. We propose twelve dialogue-act labels that are aligned with mental-health counseling session. • We propose SPARTA, a novel dialogue-act classification system that combines speaker-dynamics and local context through a time-aware attention mechanism, along with long-term global context. • We perform extensive ablation study to establish the efficacy of each module of SPARTA. Furthermore, the comparative analysis shows that it attains state-of-the-art performance on HOPE. Societal Impact: A significant increase in the number of mental health issues has been observed in the last few years. The lack of therapists is a stumbling block to the mental health of society. Therapist-Bots (mental health chatbots) could bridge the gap by effectively interacting with patients and understanding them. Conversely, end-to-end chatbots in the mental health domain are delicate, where every aspect of the therapy is needed to be perceived precisely. Our research aims at the dialogue-understanding module in the mental health conversational system. The ongoing research in mental health domain could exploit this work and benefit the chatbots to understand the therapy conversation in a better way. We have made the code and (a subset of) the dataset available at github 4 . The current work is connected to the existing literature in at least two dimensions -first, the dialogue-act classification, and second, text processing for mental-health counseling. We present our literature survey for both dimensions. Mental Health and the role of text processing. The impact of Natural language processing in the study of mental-health is substantial. Though the field of therapeutic discourse analysis has been around since 1960s [41] , research on dialogue systems in mental-health domain is in nascent stage. Previous research on mental health intervention systems primarily focused on the problems related to the suicide ideation prevention [25] or generating empathetic responses to users [28] . Levis et al. [20] studied the mental-health notes to detect suicide ideation, whereas, Fine et al. [10] employed text processing to detect symptoms of anxiety and depression using social media text. In another work, Pennebaker et al. [31] showed the importance of several keywords in revealing users' social and psychological behaviours. Wei et al. [44] proposed an emotion-aware model for human-like emotional conversations. Gratch et al. [12] presented the Distress Analysis Interview Corpus (DAIC) to identify verbal and non-verbal cues of distress during an interview. Among other methods, the data collection was done using the automated agent Ellie based on the work of [26] . Recently, Kretzschmar et al. [16] presented a survey of the chat-bots in the mental-health domain. The authors compared the strengths and weaknesses of three existing conversational agents, namely Wysa 5 , Woebot 6 , and Joy 7 . The drawbacks of these systems are that some of these are rule-based, while others are primarily data collection module for an offline counselling. In comparison, ours is the effort in the development of an online counselling system. Dialogue-act Classification. Studies on dialogue systems have always fascinated researchers ever since ELIZA [45] , the first rulebased system was developed. The dialogue-act classification module is one of the most critical components of a dialogue agent which caters to the requirement of the dialogue system by serving at the natural language understanding helm of the dialogue system. Previous research treats the problem of dialogue-act classification either as a standalone text classification task or a sequence labelling task. Recently, Colombo et al. [7] suggested a sequenceto-sequence text generation approach for the dialogue-act classification. Earlier studies like Reithinger and Klesen [36] and Grau et al. [13] focused on lexical, syntactical, and prosodic features for classification. In another work, Ortega et al. [29] employed CNNs [18] and CRFs [17] . Lee and Dernoncourt [19] proposed a method based on CNNs and RNNs [37] that used the previous contextual utterances to predict the dialogue-act of the current utterance. Ahmadvand et al. [1] proposed a contextual dialogue-act classifier (CDAC) and used transfer-learning to train their model on human-machine conversations. The model proposed by Raheja and Tetreault [34] uses self-attention mechanism on RNNs to achieve impressive results on benchmark datasets. However, these works do not take into account the speaker-level information which is imperative in social-counseling based conversations. Hua et al. [14] proposed a method to detect relevant context in retrieval-based dialogue systems. Chen et al. [4] proposed a CRF-attentive structured network to capture the long-range contextual dependencies using structured attention mechanism. Yu et al. [46] proposed to classify concurrent dialogue-acts of an utterance by modelling the contextual features. Recently, Qin et al. [32] used co-interactive relation networks to jointly capture sentiment and the associated dialogue-acts with an utterance. Their model achieved significant results on Mastodon [3] and DailyDialog [21] datasets. Similarly, Saha et al. [38] jointly learns the dialogue-act classification and emotion recognition tasks in a multi-modal setup. To the best of our knowledge, the model by Shang et al. [40] is the first one which takes speaker transitions for DAC into account. It uses a modified version of CRFs to capture the speaker-change and achieves state-of-the-art results on SwitchBoard dataset [11] . Similar to earlier works, we also treat DAC as a dialogue-level sequence labelling task. We jointly take the global and local contexts of the conversation and the speaker of the utterance as the key factors for the classification. We hypothesize that such information offers crucial clues at different stages of the model. In contrast, the existing systems had incorporated the role of global context and speakers dynamic independently. In this section, we present our dialogue-act classification dataset, called HOPE. In total, we annotated ∼ 12.9 utterances with 12 dialogue-act labels carefully designed to cater to the requirements of a counseling session. The remaining section furnishes the details of data collection, annotation schemes, dialogue-act labels and necessary statistics. Data Collection. One of major hurdles we faced in the process of data collection was the unavailability of public counseling sessions, mainly due to the fact that they usually contain sensitive personal information. To curate this data, we carefully explored the web and collected publicly-available pre-recorded counselling videos on YouTube. To ensure confidentiality, we randomly assign synthetic names to all patients and therapists in all examples. In the next step, we extract the transcriptions of each video using OTTER (https://www.otter.ai/), an automatic speech recognition tool. Subsequently, we correct transcription errors to remove any noise (i.e., spelling or grammatical mistakes). The data collection process provides us 12.9 utterances from 212 counseling therapy sessions -all of them are dyadic conversations only. Dataset Annotation Scheme. Since the counseling conversations have inherent differences with the standard conversations (such as SwitchBoard dataset conversation), it demands a carefully designed set of dialogue-act labels capable of catering the requirements of counselling conversations. Hence, we, in consultation with therapist and counselling experts, design a set of 12 dialog-act labels that are arranged in a hierarchy. These labels are designed to capture the intents of both the patient and therapist, and also be easily comprehensible to assist in the development of a conversational dialogue system. A high-level annotation hierarchy is shown in Figure 1 . Each utterance in the dialogue belongs to one of the three categories 8 -speaker initiative, speaker responsive, and general or mixed initiative. Our annotation scheme assigns three distinct dialogue-act labels to the first two categories, while the remaining four labels belong to the general category. • Speaker initiative labels: When the speaker drives the conversation for the next few utterances. -Information Request (IRQ): This label is used as a request for some information, e.g., 'Tell me your name.'. -Yes/No Question (YNQ): The YNQ label is similar to IRQ; however, the expected response is a trivial yes or no answer. For example, the utterance, 'Did you complete your work?' shows how a query is raised with an expected answer of yes or no. -Clarification Request (CRQ): This label is assigned to those utterances in which a speaker usually asks the therapist for further clarification about topic that is being currently discussed. The distinction between IRQ and CRQ is the continuation of topics -IRQ is used whenever a discussion about a new topic or entity is started, and CRQ is used when the speaker wants to gather more information and delves deeper into the current topic at hand. For instance, if the therapist asks You're in a situation where there is alcohol? and follows it with another utterance And what sort of situations are you in?, the later utterance is an example of clarification request as the therapist delves deeper to seek causes of distress for the patient. ance is an answer in the form of a simple yes to a question that was previously uttered, e.g., 'Are you alright' or 'Yes'. -Greeting (GT): Each session usually starts with greeting by one speaker and an appropriate response from the other, e.g., 'Hello, how are you?' and 'I am fine, thank you.'. We tag each of these utterances as GT. -Acknowledgment (ACK): In normal conversation, very often, we utter (e.g., 'Yeah! You are right.') to acknowledge the other speaker or to show our agreement without an explicit information request, question, or command. We also observe such cases in our collected dataset; hence, we tag them as ACK. → +1 shows the co-occurance counts of two sequential utterances with & dialogue-acts, respectively. -General Chit-Chat (GC): Other utterances that do not belong to any of the above labels are tagged as GC, possibly because of the vagueness and the lack of sense in the context of the conversation. For example, the utterance 'It's a beautiful day today!' is an example of GC. Annotation Process. We employed three annotators 9 who are experts in linguistics. To ensure the understanding of the tasks and annotation scheme, we took a sample of the dataset and asked each annotator to annotate them as per the prepared set of guidelines. Following this, every annotation was discussed in the presence of the annotators and an expert therapist as moderator to ensure consistency. After a couple of annotation and discussion rounds, the whole dataset was made available for the annotation. After the annotation process, we compute Cohen's Kappa score [6] to measure the agreement among annotators. We obtain the inter-rater agreement score of 0.7234 -which falls under the 'substantial' [6] category. and +1 with corresponding dialogue-act labels and , each directed link ( → +1 ) between the two labels reflects their co-occurrence, and the strength of the link signifies their co-occurrence counts. Though a significant number of IRQ utterances are followed by ID utterance, in a few cases, they are followed by other dialogue-act utterances as well (e.g., ACK, GC, etc.). Similarly, ID utterances are not always preceded by IRQ utterances. We observe similar behaviour for YNQ & PA, NA, and CRQ & CD dialogue-act pairs as well. Table 2 provides the statistics of the HOPE dataset. In total, HOPE has transcripts for 12.9 utterances which are annotated with 12 dialogue-act labels. These utterances are evenly distributed between the patients and therapists with 6.38 and 6.47 utterances, respectively. We split the dataset into 70:20:10 ratio as the train, test, and validation sets, respectively. In contrast to the regular patient-doctor conversations (e.g., SOAP), the dialogue sessions in HOPE are usually lengthy (∼ 59 utterances per session). Moreover, the utterances in these sessions are themselves significantly longer as compared to other conversational datasets, with the average length of utterance of a patient being 103 words, whereas for therapist, it is ∼ 124 words. We represent a therapy session as a conversation dialogue consisting of a sequence of utterances where is the number of utterances in a dialogue. These utterances are uttered by the therapists and patients alternatively in a session. The objective of SPARTA is to assign a correct dialogue-act label to every utterance in the dialogue. SPARTA is a transformer-based architecture that incorporates the speaker-aware contextual information for the dialogue-act classification. In our analysis of the HOPE dataset, we observed that a few of the dialogue-act labels are majorly associated with the patient, while a few others are related to the therapist. To model the speaker dynamics within a conversation dialogue, we consolidate a speaker-aware (SA) module in addition to the speaker-invariant (SI) module. The latter does not consider the distinction between a therapist and a patient utterance, while the former module distinguishes between the two through a pre-trained speaker identification module. Moreover, we also propose a novel time-aware attention (TAA) mechanism that considers the positions of contextual utterances during the attention computation. We hypothesize that the recent past contextual utterances have higher significance than the distant past utterances; hence, as opposed to the standard attention mechanism, TAA focuses more on the nearby (local) utterances. For each utterance in a conversation dialogue, we extract the semantic representation through pre-trained RoBERTa model [23] , which is subsequently utilized to leverage the local ( ) and global ( ) contextual information within the dialogue. We incorporate a sliding-window based dynamic memory unit to compute local context through a time-aware attention mechanism. In parallel, we employ a GRU layer to capture the dialogue history as global context. Our analysis reveals that a few dialogues contain utterances which discuss a topic (or an entity) that has occurred at the initial stage of the dialogue, and to correctly exploit the semantic of the utterance, the global information is desirable. We repeat the process of local and global context extraction for both speaker-aware and speaker-invariant setups. Finally, we combine these two representations with the residual connections for the classification. Figure 3 shows a high-level architecture diagram of SPARTA. Utterance Representations. As mentioned above, SPARTA maintains two separate modules for capturing the speaker-aware and speaker-invariant information. For speaker-invariant representations, we employ a pre-trained RoBERTa language model which is further fine-tuned on DAC task. The speaker aware module is a RoBERTa model fine-tuned on Speaker Classification task. = ( ); = ( ). Local Context and Time-Aware Attention. At every point in a dialogue, the nearby utterances provide important clues in the prediction of a dialogue-act label for the current utterance. For example, if the previous utterance is an information request (IRQ), then there is a good chance that the next label should be either information delivery (ID) or yes-no-answer (YNA). Therefore, we exploit the local context maintained in a memory [ − : ] for each utterance in the dialogue, where is the fixed local-context size. We utilize TAA to learn the importance of contextual utterances based on their distance from the current position. At first, we pass the utterance representation ( ), as computed by RoBERTa in the previous step, through a tanh activation layer to obtain the pooler output, and subsequently project it as the query ( ∈ R 1× ) vector in the attention computation. On the other hand, the contextual memory [ − : ] is projected as the key ( ∈ R × ) and value ( ∈ R × ) matrices. Next, we encode the local context as follows: ; ∈ R , = 1 ; ∀ : 1 ≤ ≤ To extract the time-aware feature, we scale-down the dot product between the query and the key by a monotonically decreasing function of time. The inverse function was chosen based on its empirical advantage as shown in [2] . The hypothesis for scalingdown the dot-product is due to the fact that as we move deeper into the dialogue history their influence on the dialogue-act reduces accordingly. Similar interaction dynamics was used in [47] ; but to our knowledge, we introduce an inverse function for the first time to compute the fixed window size attention for local context. Following the above procedure, we compute local contexts and for both speaker-aware and speaker-invariant modules, respectively. Global Context. As the dialogue progresses, we maintain the global context of the dialogue through a GRU layer on top of the RoBERTa hidden representations. Fusion and Final classification. Finally, we fuse the local and global contexts of speaker-aware and speaker-invariant modules for the final classification. We also add residual connections for better gradient flow during backpropagation. Our validation results supplement the choice of concatenation as the fusion operation to be better than other operations such as global max-pooling, global mean-pooling, etc. Speaker-Aware Patient Therapist Figure 3 : Architecture of SPARTA. For each utterance , SPARTA computes the local-context , through a time-aware attention (TAA) mechanism on the sliding-window memory unit and the current utterance. The dialogue-level global context is maintained using a GRU . Finally, the speaker-aware and invariant local and global contexts are fused for the task. In this section, we report our experimental results, comparative study, and other analyses. Baselines. We choose the following existing systems as baselines. ▶ CASA [33] : It is a context-aware attention-based system for the dialogue-act classification. It uses RNNs at dialogue and utterance levels and computes context-aware self attention before the final classification. ▶ SA-CRF [39] : This recent baseline incorporates a CRF layer for the classification. Moreover, it consolidates the speaker-change information using a Bi-LSTM encoder. In addition to these recent baselines on dialogue-act classification, we also include other sequence-labelling classification systems. ▶ DRNN [42] : It is a novel Disconnected RNNs architecture which incorporates the position-invariant features for modelling. ▶ ProSeqo [15] : This proposed to efficiently handle the short and long texts using dynamic recurrent projections. ProSeqo avoids the use of store-and-lookup in pre-trained word-embeddings model through the context and locality-sensitive projections [35] . ▶ TextVDCNN [8] : This is a deep convolutional network with residual connections for text classification. The convolutional layer works at characterlevel, and k-max pooling is used to down-sample the output of convolutional layers for classification. ▶ TextRNN [22] : This was the first work to integrate RNNs into the multi-task learning framework. We use the uniform layer architecture as described in the paper. ▶ RoBERTa [23] : We use RoBERTa as a baseline in this work due to its superioirity on various benchamarks. RoBERTa is similar to [9] . It is an encoder-only language model trained with masked language modelling objective on vast amounts of unlabelled data in unsupervised manner. Experimental Results. For the experiments, we randomly split the HOPE dataset into 70 : 20 : 10 ratio for the train, test, and validation sets. To measure the performances of SPARTA and other baseline systems, we compute macro-F1, weighted-F1, and accuracy scores. We implemented our system in PyTorch [30] and utilized the pre-trained models from Huggingface Transformers library. The hyperparameters are listed in Table 6 . The experimental results of SPARTA is presented in Table 7 in appendix section. Since SPARTA incorporates three major components -local context, global context, and speaker-aware modules. The SPARTA-TAA system obtains the best scores of 60.29 macro-F1, 64.53 weighted-F1, and 64.75% accuracy, as reported in the last row of Table 2 (appendix). We have also shown our observations on the ablation study on all three key-factors: Contextual information, Time-aware attention, and Speaker dynamics in ablation section in the appendix. We also present the label-wise performance of SPARTA in Table 4 . We can observe that SPARTA consistently yields good scores for the majority of the dialogue-acts, except for the Acknowledgement (ACK) where it records F1-score of merely 47.86%. Even for the under-represented labels (ORQ,NA and PA) in HOPE, SPARTA reports good F1-scores of 59.09%, 64.38% and 55.84%, respectively for these three labels. Comparative Analysis: We compare SPARTA with various existing systems and other baselines. The comparative analysis is reported in Table 3 . Based on the type of modelling, we categorize the baselines into three groups -utterance-driven ( ), utterance+global context driven ( + GC), and utterance + global context + speakeraware driven ( + GC + SA). Comparatively, SPARTA incorporates local context in addition to utterance, global context, and speaker dynamics ( + LC + GC + SA). In the first category, the standard RoBERTa model attains the best macro-F1, weighted-F1 and accuracy of 43.97, 49.13, and 52.97%, respectively. In comparison, CASA [33] yields the improved weighted-F1 and accuracy scores at 55.95% (+6.82%) and 58.46% (+5.49%), respectively, with the global context as an additional information. Finally, we experiment with SA-CRF [39] which also includes the speaker-dynamics for the dialogue-act classification; however, its performance on HOPE is not at par with CASA [33] . It reports 35.97, 24.20, and 45.07% macro-F1, weighted-F1, and accuracy, respectively. In comparison, SPARTA-TAA obtains significant improvements over all baselines. It reports improvements of +8.64%, +8.58%, and +6.29% in macro-F1 (60.29), weighted-F1 (64.53), and accuracy (64.75%), respectively, as compared to CASA suggesting the incorporation of local context extremely effective. Note that ProSeqo [15] and CASA [33] are currently the state-of-the-art on switchboard dialogueact corpus benchmark 10 ; yet they report inferior scores on HOPE compared to SPARTA. Moreover, we also report the mean of the 3-fold cross-validation results for both SPARTA-MHA and SPARTA-TAA, and the results are consistent with the train-val-test split case. We also perform statistical significance T-test comparing SPARTA-TAA and the best performing baseline (CASA). We observe that our results are significant with > 95% confidence across macro- Error Analysis: In this section, we present two-way error analyses of SPARTA in terms of quantitative and qualitative evaluations. Quantitative analysis: We report the confusion matrix for SPARTA-TAA in Figure 4 . We observe three pairs with significant error-rates (≥ 25%) -YNQ:IRQ (26%), OD:ID (43%), ID:ACK (28%). For the prediction of information delivery (ID), SPARTA is confused most of the time with other classes -19% with PA, 20% with NA, 43% with OD, 13% with GT, 17% with GC, 14% with CRQ, 28% with ACK, and 22% with CD.. We can relate this behaviour to the diversity of the utterances with ID tag, i.e., the discussion in these utterances generally contains a fair segments of utterances from other dialogueacts (e.g., 'Yeah, That's something I always do.' could be easily be confused with PA). The other prominent error-case is found in IRQ:YNQ pair, we observe a confusion of 26% between IRQ and YNQ utterances because of the versatile questioning behaviour. For the remaining cases, error-rates are nominal. Thus, we articulate that SPARTA can be further improved with more balanced dataset. Qualitative analysis: Table 5 shows a sample session along with actual and the predicted dialogue-act labels for SPARTA and the best baseline model (i.e., CASA). Due to the length of the conversation, we truncate some of the utterances in between; however, the gist of the conversation is that the patient is stressed of losing her job and having drinking issues and the therapist is trying to understand the core problem. The conversation has mostly information request and information delivery types of utterances with a few other dialogueacts in between (e.g., CRQ GT, etc.). We can observe that for the first three utterances, SPARTA and the baseline are consistent with the actual labels. For the fourth utterance, the SPARTA-MHA model misclassifies the utterance as CD when the patient is clearly providing objective information which is necessary for further conversation. In the seventh utterance, the therapist wants more clarification about the drinking habits, and the patient provides the clarification but CASA and SPARTA-TAA models wrongly classify this utterance as ID when there is no objective information being provided here. Next, we notice that when the patient talks about stopping this and provides her opinion that she is not unrealistic, the SPARTA-MHA and CASA models predict the wrong label for this utterance as ID. In the next utterance, the therapist acknowledges this opinion, but the CASA model predicts the wrong label. So, not only the SPARTA-TAA model is able to capture the semantics of the utterances better, it also utilizes the contextual information in a better way by relating the past information about the speaker with the current utterances. General Discussion: The work presented in this paper was motivated solely by the dire need of understanding conversations that occur in counselling sessions and to design solutions that would help the therapists to better understand the intents of their patients. However, the proposed model can be easily adaptable to other domain (such as normal chit-chat-based conversation) as well. To ensure we do not deviate from the prime objective of this work, we restrict ourselves to explore the dialog-act classification in the counselling conversations only. We aim to tackle a very sensitive and a pervasive public-health crisis. We transcribe the data from publicly available counselling videos. The automatic transfer of utterance from the speech modality to the text causes some information loss, though we tried our best to recover them through manual intervention. Moreover, we consulted with mental-health professionals and linguists in preparing the annotation guideline. However, the annotator's bias cannot be ruled out completely. The names of the patients and therapists involved in these sessions have been systematically masked. Another important aspect of the current work is that the majority of the sessions in HOPE belong to the mental health professionals and patients based in United States. Hence, the effectiveness of SPARTA on data from other geographical or demographical regions may vary. We understand that the building computational models in mental-health avenues has high-stakes associated with it and ethical considerations, therefore, become necessary. No technology will work perfectly in solving the problems related to mental health [24] . It is important to note that we do not make any diagnostic claims. Further, the deployment of any such technology will be done keeping in mind the safety-risks and mitigating any sources of bias that may arise. Paying heed to the consequences of the COVID-19 pandemic on mental health, in this paper, we raised the attention on the much deserved research on dialogue system for mental-health counselling. To this end, we collected and developed the HOPE dataset for the dialogue-act classification in dyadic counselling conversations. We defined twelve dialogue-act labels to cater to the requirement of counselling sessions. In total, we annotated ∼ 12.9 utterances across 212 sessions. We also proposed SPARTA, a novel transformerbased speaker and time-aware joint contextual learning model for dialogue-act classification. SPARTA utilizes the global and local context in the speaker-aware and speaker-invariant setups while also using a novel memory-driven time-aware attention mechanism to leverage the local context. Our extensive ablation study and the comparative analysis established the superiority of SPARTA over several existing models. In future, we would like to extend our effort in the development of dialogue-systems for mental-health counselling by including other crucial tasks such as emotion recognition, Furthermore, in the quest of explanation, we visualize the speakeraware ( ) and speaker-invariant ( ) utterance representation in Figure 5 . We employ principal component analysis to project the representations into 2D. We can observe that the speaker-aware representation distinguishes between the patient's and therapist's utterances pretty well. In contrast, the patient and therapist utterances are mixed in the speaker-invariant representation. Considering the underlying task, where the therapist and the patient have the major contributions in speaker-initiative and speakerresponsive dialogue-acts, speaker-aware representation provides crucial assistance to SPARTA. We use PyTorch 11 and Pytorch Lightning 12 frameworks for all our experiments. We extensively used Hugging Face Library for implementation of transformer based NLP models. B.0.1 Hardware Utilization: The complete model training was done on a linux machine with following specifications: • GPU: TeslaV100 To verify the effectiveness of the time-aware attention to compute the local context, we also report the results with standard multi-head attention (MHA) mechanism. The first set of results (i.e., SPARTA-BS or the baseline version of SPARTA) in Table 7 represents the absence of local context in SPARTA. It yields macro and weighted F1-scores of 51.83 and 54.98 respectively with global context only. We obtain 57.70% accuracy for the same setup. On the other hand, incorporating the speaker-aware information into the system improves the trio of accuracy, macro, and weighted F1-scores by [1 − 2] points, suggesting a positive impact of speakerdynamics in SPARTA. Subsequently, we introduce the local context with multi-head attention in SPARTA-MHA. In comparison with SPARTA-BS, SPARTA-MHA reports 52.24 macro-F1 and 55.49% weighted-F1 with both the local and global contexts. Furthermore, we obtain 58.16 (+5.92%) and 63.26 (+7.77%) macro and weighted F1-scores, respectively, with the inclusion of speaker-aware module in the system as compared to SPARTA-MHA without the Speaker-Aware module. Moreover, it yields better accuracy at 63.45% (+5.22%). These results clearly indicate the importance of local context and speaker-aware modules for the dialogue-act classification. In the final stage of our experiments, we replace the multi-headed attention with the proposed time-aware attention mechanism (TAA). We summarize our observation as follows: • Contextual information: Dialogue-act labels in a counseling based conversation not only depend on the abstract information of the dialogue but also on the utterances in the immediate vicinity of the current utterance. As can be observed from Table 7 , the presence of both the global and local contextual information plays an importance role in SPARTA. Moreover, the absence of either of the component degrades the overall performance. This corroborates that they carry distinct and diverse contextual information. • Time-aware attention: The comparison between the standard multi-head attention and the novel time-aware attention highlights the importance of attending over the relative positions of utterances. As stated earlier, extensive experimental results show that TAA yields better performance compare to MHA for all possible configurations. • Speakers dynamics: For all combinations, we observe performance drop of [3 − 4] % without the speaker information. This is particularly apparent as a majority portion of the counseling conversation is driven by the therapist, thus the speaker-initiative dialogue-acts have higher relevance with the therapist utterances. Therefore, our intuition of incorporating the speaker-aware module as a critical component in SPARTA also corroborates the empirical evidences. In earlier phases of experiments, we also considered the role of emotion in dialogue act classification. We annotated a subset of our dataset with three emotion classes consisting of 'positive', 'negative' and 'neutral'. We found that around ∼ 70% of utterances by the patients belonged to 'negative' class whereas ∼ 90% of therapists' utterances belonged to 'neutral' class. Due to such imbalance, this data was not used in the final version of our proposed architecture. Contextual dialogue act classification for open-domain conversational agents Patient Subtyping via Time-Aware LSTM Networks. In SIGKDD Multi-task dialog act and sentiment recognition on Mastodon Dialogue act recognition via crf-attentive structured network Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation A Coefficient of Agreement for Nominal Scales Guiding attention in Sequence-to-sequence models for Dialogue Act prediction Very Deep Convolutional Networks for Text Classification BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Assessing population-level symptoms of anxiety, depression, and suicide risk in real time using NLP applied to social media data SWITCHBOARD: telephone speech corpus for research and development The distress analysis interview corpus of human and computer interviews Dialogue act classification using a Bayesian approach Learning to Detect Relevant Contexts and Knowledge for Response Selection in Retrieval-Based Dialogue Systems ProSeqo: Projection Sequence Networks for On-Device Text Classification Can Your Phone Be Your Therapist? Young People's Ethical Perspectives on the Use of Fully Automated Conversational Agents (Chatbots) in Mental Health Support Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data Backpropagation Applied to Handwritten Zip Code Recognition Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks Natural language processing of clinical mental health notes may add predictive value to existing suicide risk models DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset Recurrent Neural Network for Text Classification with Multi-Task Learning RoBERTa: A Robustly Optimized BERT Pretraining Approach Assessing the accuracy of automatic speech recognition for psychotherapy SNAP-BATNET: Cascading author profiling and social network graphs for suicide ideation detection on social media A Demonstration of Dialogue Processing in SimSensei Kiosk Eduard Vieta, Antonio Vita, and Celso Arango. 2020. Position Paper How mental health care should change as a consequence of the COVID-19 pandemic Empathy-driven Arabic Conversational Chatbot Context-aware neural-based dialog act classification on automatically generated transcriptions PyTorch: An Imperative Style, High-Performance Deep Learning Library Psychological aspects of natural language. use: our words, our selves DCR-Net: A Deep Co-Interactive Relation Network for Joint Dialog Act Recognition and Sentiment Classification Dialogue Act Classification with Context-Aware Self-Attention Dialogue Act Classification with Context-Aware Self-Attention ProjectionNet: Learning Efficient On-Device Deep Networks Using Neural Projections Dialogue act classification using language models Learning Internal Representations by Error Propagation Emotion Aided Dialogue Act Classification for Task-Independent Conversations in a Multimodal Framework Speaker-change Aware CRF for Dialogue Act Classification Speaker-change Aware CRF for Dialogue Act Classification Discourse studies: A multidisciplinary introduction Disconnected Recurrent Neural Networks for Text Categorization Investigating Mental Health of US College Students During the COVID-19 Pandemic: Cross-Sectional Survey Study Emotion-Aware Chat Machine: Automatic Emotional Response Generation for Human-like Emotional Interaction ELIZA -A Computer Program For the Study of Natural Language Communication Between Man and Machine Modeling Long-Range Context for Concurrent Dialogue Acts Recognition Learning interaction dynamics with an interactive