key: cord-0485716-ym57j54w
authors: Fu, Ziwang; Liu, Feng; Wang, Hanyang; Shen, Siyuan; Zhang, Jiahao; Qi, Jiayin; Fu, Xiangling; Zhou, Aimin
title: LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences
date: 2021-12-03
journal: nan
DOI: nan
sha: 31b3ccf9d96e0aafa1e24877fa1cab79b04e477c
doc_id: 485716
cord_uid: ym57j54w

Learning modality-fused representations and processing unaligned multimodal sequences are meaningful and challenging in multimodal emotion recognition. Existing approaches use directional pairwise attention or a message hub to fuse language, visual, and audio modalities. However, those approaches introduce information redundancy when fusing features and are inefficient without considering the complementarity of modalities. In this paper, we propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition from unaligned multimodal sequences. Specifically, we first perform feature extraction for the three modalities respectively to obtain the local structure of the sequences. Then, we design a novel transformer with cross-modal blocks (CB-Transformer) that enables complementary learning of different modalities, mainly divided into local temporal learning,cross-modal feature fusion and global self-attention representations. In addition, we splice the fused features with the original features to classify the emotions of the sequences. Finally, we conduct word-aligned and unaligned experiments on three challenging datasets, IEMOCAP, CMU-MOSI, and CMU-MOSEI. The experimental results show the superiority and efficiency of our proposed method in both settings. Compared with the mainstream methods, our approach reaches the state-of-the-art with a minimum number of parameters.

Multimodal emotion recognition has attracted increasing attention due to its robustness and remarkable performance (Nguyen et al. 2018; Dai et al. 2021b ). The goal of this task is to recognize human emotions from video clips, which involves three main modalities: natural language, facial expressions and audio signals. Emotion recognition is applied in areas such as social robotics, educational quality assessment, and healthcare, where the analysis of emotion is particularly important during COVID-19 (Chandra and Krishna 2021) . Multimodality provides a wealth of information compared to single modality and can fully reflect emotional states. However, due to the different sampling rates of sequences from different modalities, Copyright © 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. the collected multimodal states are often unaligned. Manually aligning different modalities is often labor-intensive and requires domain knowledge (Tsai et al. 2019b; Pham et al. 2019 ). In addition, most of the networks with high performance cannot achieve a balance between the number of parameters and performance. To this end, we focus on the ability to learn the representation of fused modalities and efficiently perform multimodal emotion recognition on unaligned sequences.

In the previous works (Sahay et al. 2020; Rahman et al. 2020; Hazarika, Zimmermann, and Poria 2020; Yu et al. 2021; Dai et al. 2021a) , Transformers (Vaswani et al. 2017) are mostly used for unaligned multimodal emotion recognition. Typically, Tsai et al. (2019a) proposed the Multimodal Transformer (MulT) method to fuse information from different modalities in unaligned sequences without explicitly aligning the data. The approach learns the interactions between pairs of elements through a cross-modal attention module that iteratively reinforces features of one modality with features of other modalities. Recently, Lv et al. (2021) proposed the Progressive Modality Reinforcement (PMR) by introducing a message hub to exchange information with arXiv:2112.01697v1 [cs.CV] 3 Dec 2021

Conv1D BiLSTM each modality. The approach uses a progressive strategy to utilize high-level source modality information for unaligned multimodal sequences fusion. However, MulT only considers the fusion of features between modality pairs, ignoring the coordination of the three modalities. Besides, using a pairwise approach to fuse the modal features can produce redundant information. For example, the visual representations are repeated twice in the concatenation of visual-language features and visual-audio features. PMR considers the association among the three modalities, but fusing the modal features by designing a centralized message hub would sacrifice its efficiency. To be more specific, the information of the three modalities needs to interact closely and recursively with the message hub to ensure the integrity of the features, and such an operation requires a huge number of parameters. Meanwhile, this approach does not take into account the complementarity between modal information, while feature fusion can be accomplished by simply using the interaction between modalities without introducing a third party. What's more, recent methods are too high in the number of parameters to be applicable to realistic scenarios due to pre-trained models. Therefore, to address the above limitations, we propose a neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition from unaligned multimodal sequences. Figure  2 shows the overall architecture of LMR-CBT. Specifically, we first perform feature extraction for the three modalities respectively to obtain the local structure of the sequences. For the audio and visual modalities, we obtain information about adjacent elements by 1D temporal convolution. For the language modality, we use Bi-directional Long and Short Term Memory (BiLSTM) to capture the long term dependencies and the contextual information between texts.

After obtaining feature representations of the three modalities, we design a novel transformer with cross-modal blocks (CB-Transformer) to achieve complementary learning of the different modalities, which is mainly divided into local temporal learning, cross-modal feature fusion and global self-attention representations. In the local temporal learning part, audio and visual features are used to obtain adjacent element-dependent representations of the two modalities through the transformer. In the cross-modal feature fusion part, residual-based modal interaction approach is used to obtain the fused features of the three modalities. In the global self-attention representations part, the transformer learns high-level representations within the fusion modality. The CB-Transformer can adequately represent the fused features without losing the original features and can efficiently handle unaligned multimodal sequences. Finally, we splice the modal fusion features with the original features to obtain the emotional categories. We perform wordaligned and unaligned experiments on three mainstream public datasets of multimodal emotion recognition, IEMO-CAP (Busso et al. 2008) , CMU-MOSI (Zadeh et al. 2016b) and CMU-MOSEI (Zadeh et al. 2018 ). The experimental results demonstrate the superiority of our proposed method. Moreover, we achieve a better trade-off between the performance and the efficiency. Compared with the mainstream methods, our approach reaches the state-of-the-art with a minimum number of parameters.

We summarize our three main contributions as follows:

• We propose an efficient neural network to learn modalityfused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition from unaligned multimodal sequences (only 0.41M), which can effectively fuse the interactive information of the three modalities. • We design a novel transformer with cross-modal blocks (CB-Transformer) to achieve complementary learning of different modalities, which is mainly divided into local temporal learning, cross-modal feature fusion and global self-attention representations. The CB-Transformer can adequately represent the fused features without losing the original features and can efficiently handle unaligned multimodal sequences. • We obtain a better trade-off between the performance and the efficiency on three challenging datasets. Compared with the existing state-of-the-art methods, LMR-CBT achieves comparable or even higher performance with a minimal number of parameters.

Multimodal emotion recognition has attracted a lot of attention in recent years. This task requires the fusion of crossmodal information of temporal sequential signals. According to the approaches of feature fusion, it can be divided into early fusion (Morency, Mihalcea, and Doshi 2011; Pérez-Rosas, Mihalcea, and Morency 2013) , late fusion (Zadeh et al. 2016a; Wang et al. 2017 ) and model fusion. Previous works have focused on early or late fusion strategies. Early fusion strategies involve fusing the shallow intermodal features and focusing on mixed-modal feature processing while late fusion strategies involve finding the confidence level of each modality and then coordinating them to make joint decisions. Although better performance can be obtained using these fusion strategies in comparison to single modality learning, they do not explicitly consider the intrinsic connection between sequence elements from different modalities, which is essential for effective multimodal fusion. Subsequently, model fusion is gradually applied and more complicated models are proposed. Wang et al. (2019) used visual and auditory features to shift words in text with attention. Rahman et al. (2020) introduced a multimodal adaptive gate that integrates visual and acoustic information into a large pretrained language model. Hazarika, Zimmermann, and Poria (2020) incorporated a combination of losses including distributional similarity, orthogonal loss, reconstruction loss and task prediction loss to learn modality-invariant and modality-specific representation. Dai et al. (2021b) introduced sparse cross-attention to achieve end-to-end emotion recognition. Dai et al. (2021a) proposed a multi-task learning approach using weak supervision for multimodal emotion recognition. Yu et al. (2021) proposed a way to fuse features from different modalities by combining self-supervised and multi-task learning. Although selfsupervised and multi-task learning can effectively alleviate the problem of small samples, how to perform efficient cross-modal interactions is still a tremendously challenging issue for researchers. Therefore, the main motivation of this work is how to perform unaligned multimodal emotion recognition with a minimalist design excluding tricks like self-supervision or multi-tasking.

In order to fuse the information of unaligned multimodal sequences, early works have explored the dependencies between modal elements based on the maximum modal information criterion (Zeng et al. 2005) . However, the performance of those early approaches is far from satisfactory due to the shallow model structures. Tsai et al. (2019a) proposed a multimodal transformer (MulT) to learn intermodal correlations using a cross-modal attention mechanism. Sahay et al. (2020) proposed low rank fusion based transformers (LMT-MULT) to design LMF units for efficient modal feature fusion based on previous work. Lv et al. (2021) proposed progressive modality reinforcement (PMR) method. This method uses a message hub to interact with the three modal information and adopts the progressive strategy to fuse unaligned multimodal temporal sequences utilizing high level source modal information. Although those previous trials have made some performance improvement in unaligned multimodal emotion recognition, they still faces the problems of effective fusion of cross-modal features and the inability to ensure that information is not lost. In this paper, we mainly focus on reaching an accuracy-parameter balance by a novel information redundancy-free modal fusion strategy.

The multimodal emotion recognition task mainly involves three modalities, language(L), visual(V ) and audio(A). We define that the three modalities are obtained through feature extraction as ) represents the length of the sequence and d (.) represents dimensions of the extracted features. Our goal is to efficiently extract features of different modalities from the unaligned multimodal sequences and to obtain a fused representation across modalities. We expect the multimodal representation to accurately predict the emotion category of the sequence.

We propose a neural network to learn modality-fused representations with CB-Transformer (LMR-CBT), and the overall architecture of the network is shown in Figure 2 , where the cross-modal blocks and transformer encoder in CB-Transformer are located on the two sides of the figure, respectively. Next, we will describe the network in detail.

Feature Preprocessing. The feature preprocessing is performed separately according to the temporal structure of the different modalities. For audio and visual modalities, to ensure that each element in the input sequence has sufficient perception of its neighboring elements, we put the two modalities into 1D temporal convolution separately by setting different convolution kernel sizes. The specific formula is as follows:

In terms of language modality, we consider that the language itself is characterized by long-time dependencies and associative contextual information. BiLSTM can better capture bidirectional long-time semantic dependencies and identify the emotional representations of languages. We use a two-layer BiLSTM for feature extraction:

(2) where LN represents layer normalization. The purpose of layer normalization is to stabilize the distribution of each layer so that subsequent layers can learn the content of the previous layer in a stable manner. By the above operation, on the one hand, we can aggregate the features of adjacent elements, and on the other hand, we can pre-align the feature dimensions of unaligned multimodal data to the same dimension.

Transformer with Cross-modal Blocks. We design a novel transformer with cross-modal blocks (CB-Transformer). CB-Transformer is divided into three parts: local temporal learning, cross-modal feature fusion and global self-attention representations. In this module, there are two important components: the transformer encoder and the residual-based cross-modal fusion, represented using T ransEncoder and CrossM odal, respectively. For both components we will discuss in detail in Section 3.3 and 3.4.

In the local temporal learning, we use the transformer encoder, which is becoming increasingly popular in many areas such as computer vision and natural language processing due to its noticeable performance. We use this component to obtain temporal representations of audio and visual modality features that have undergone 1D temporal convolution. The specific process can be expressed by the following formula:

where P E(T {v,a} , d f ) ∈ R T {v,a} ×d f computes the embeddings for each position index, and Z

[0]

{V,A} represents the result embedded through position, and T ransEncoder represents the transformer encoder, which we will discuss in detail in Section 3.3. We use F {V,A} to represent the result of local temporal learning.

In the part of cross-modal feature fusion, we design a residual-based cross-modal fusion method, which takes F {V,A} andX L as inputs and the fused representation of the three modalities as outputs. The structure of residual can ensure that information is not lost. The specific formula is as follows:

where CrossM odal represents the residual-based crossmodal fusion, which we will discuss in detail in Section 3.4, andX F denotes the fusion features. We believe that the fused modal representation not only carries information from the language modality, but also fuses information from Algorithm 1 The algorithm of cross-modal fusion Input: the audio and video modal representation with local temporal learning F {V,A} ∈ R d f ; the language representation with BiLSTM processing:X L ∈ R T l ×d f ; the batch size bs. Output: the features that fuse the representation of the three modalities and the original text representation:X F ∈ R T l ×d f . attn softmax.append(X * F ); 9: end while 10:X F = Concat(attn softmax). 11: returnX F . all the three modalities to ensure effective interaction of information. Similarly, Transformer Encoder is used to extract the representation of the fused features in the global selfattention representations.

Through global self-attention representations, we can obtain high-level complementary representations of the fused modalities. The specific formula is as follows:

where F F represents the global self-attention learning results for fused representations.

Prediction. We carry out the emotion category prediction. Specifically, we perform a splicing operation on the fused modal representation and the audio/visual original modal representation to obtain I = [F F , F A , F V ]. After that, we get the final output of the emotional category through the two-layer fully connected network: (7) where d out is the output dimensions of emotional categories, W 1 ∈ R d f * 3 and W 1 ∈ R dout are weight vectors, b 1 and b 2 are the bias, σ denotes the ReLU activation function.

We'll introduce the details of transformer encoder used for both local temporal learning and global self-attention representations, as shown on the right side of Figure 2 . Firstly, following (Vaswani et al. 2017) , we abstract the data of the temporal series using the sinusoidal position embedding (PE). We encode the positional information of a sequence of length T via the sin and cos functions with frequencies dictated by the feature indices:

Next, transformer encoder is mainly composed of selfattention, Feedforward and Add&Norm. Self-Attention is the focus of transformer encoder. The specific formula is as follows:

{F,V,A} is represented by different projection spaces with different parameter matrices, where i represents the number of layers of transformer attention, i = 1, ..., D.

The Feedforward layer is a two-layer fully connected layer and the activation function of the first layer is Relu:

Our residual-based cross-modal fusion method could effectively fuse the information of three modalities with less information loss (on the left side of Figure 2 ). Specifically, the method accepts input for two modalities, which is calledX L and F {V,A} . We obtain the mapping representations of the features for the two modalities by a linear projection. And then we process the two representations by add and tanh activation function. Finally, the fused representationX L is obtained through sof tmax. We believe that the final fused information contains not only the complementary information of the three modalities, but also the features of the language modality:

X F = sof tmax(tanh(L(X L ) + L(F {V,A} ))) ∈ R T l ×d f (11) where L stands for a linear projection.

In this process, in order to alleviate the information loss of language features, we use a residual connection between the fused representation and the original language representation. We use the algorithm 1 to represent the entire process.

In this paper, we use three mainstream multimodal emotion recognition datasets: IEMOCAP, CMU-MOSI and CMU-MOSEI. The experiments are conducted on both the wordaligned and unaligned settings. The code will be publicly available after the paper is accepted.

IEMOCAP. IEMOCAP (Busso et al. 2008 ) is a multimodal emotion recognition dataset that contains 151 videos along with corresponding transcripts and audios. In each video, two professional actors conduct dyadic conversations in English. Its intended data segmentation consists of 2,717 (Wang et al. 2019; Dai et al. 2020) , we take four categories: neutral, happy, sad, and angry. Moreover, this is a multi-label task (e.g., a person can feel sad and angry at the same time). We report the binary classification accuracy and F1 scores for each emotion category according to (Lv et al. 2021) .

CMU-MOSI. CMU-MOSI (Zadeh et al. 2016b ) is a dataset for multimodal emotion recognition and sentiment analysis, which comprises 2,199 short monologue video clips from 93 Youtube movie review videos. It contains 1,284 training samples, 229 validation samples and 686 test samples. The audio and visual features are extracted at the sampling frequencies of 12.5 Hz and 15 Hz, respectively. Human annotators label each sample with a sentiment score from -3 (strongly negative) to 3 (strongly positive). We use various metrics to evaluate the performance of the model, consistent with those used in previous work (Tsai et al. 2019a) : 7-class accuracy (i.e. Acc 7 ), binary accuracy (i.e. Acc 2 ), and F1 score.

CMU-MOSEI. CMU-MOSEI (Zadeh et al. 2018 ) is also a dataset for multimodal emotion recognition and sentiment analysis, which contains 3,837 videos from 1,000 diverse speakers. Its pre-determined data segmentation in- Table 3 : Comparison on the IEMOCAP dataset under both word-aligned setting and unaligned setting. The performance is evaluated by the binary classification accuracy and the F1 score for each emotion class. *: reproduced from open-source code; †: from (Lv et al. 2021) . LMR-CBT achieves comparable and superior performance with only 0.34M parameters.

extracted at the sampling frequencies of 20 Hz and 15 Hz, respectively. In addition, each data sample is also annotated with a sentiment scores on a Likert scale [-3, 3] . We use the same performance metrics as above.

For feature extraction of the language modality, we convert video transcripts into pre-trained Glove (Pennington, Socher, and Manning 2014) model to obtain 300dimensional word embeddings. For feature extraction of visual modality, we use Facet (Baltrušaitis, Robinson, and Morency 2016) to represent 35 facial action units, which record facial muscle movements for representing basic and high-level emotions in each frame. For the audio modality, we use COVAREP (Degottex et al. 2014) for extracting acoustic signals to obtain 74-dimensional vectors. Table 1 shows the hyperparameters used in training and testing for each dataset. The kernel size is used to process the input sequences for the audio and visual modalities, and since BiLSTM is used for the language modality, no kernel size is involved. We train our model on a single RTX 2080Ti. The details are provided in the supplementary file. (Hazarika, Zimmermann, and Poria 2020) , Pro-gressive Modality Reinforcement (PMR) (Lv et al. 2021 ). Among these methods, LF-LSTM, MulT, LMF-MulT, and PMR can be directly applied the unaligned setting. For the other methods, we introduce the connectionist temporal classification (CTC) (Graves et al. 2006) module to make them applicable to unaligned settings.

Word-aligned setting. This setting requires manual alignment of language words with visual and audio. We show the comparison of our method with other benchmarks in the upper part of Tables 3-5. The experimental results show that the proposed method achieves a comparable performance level to PMR (Lv et al. 2021 ) on different metrics for the three datasets. Compared with LMF-MulT (Sahay et al. 2020) , which uses six transformer encoders, we achieve better performance on different datasets using half of the transformer encoders.

Unaligned setting. This setting is more challenging than the word-aligned setting, where cross-modal information is extracted directly from unaligned multimodal sequences to classify emotions. We show the comparison of our approach with other benchmarks in the lower part of Tables 3-5. Moreover, Figure 1 demonstrates that our proposed model reaches the state-of-the-art with a minimum number of parameters (only 0.41M) on the CMU-MOSEI dataset. Compared with other approaches, our proposed light-weight network is more applicable to real scenarios. We can draw the following conclusions from the experimental results:

• With the exception of MulT (Tsai et al. 2019a ), LMF-MulT (Sahay et al. 2020) , and PMR (Lv et al. 2021 ), most of the models perform poorly in the unaligned setting because they do not take into account the interactions between the modalities. In addition, the outstanding performance of MISA (Hazarika, Zimmermann, and Poria 2020) is due to the pre-trained model, which contains a 

Effectiveness of BiLSTM. For the language modality, we adopt BiLSTM to capture the long-time dependency and the contextual information association between texts. We replace BiLSTM with Conv1D for the comparison of the experiments, and the experimental results (on the upper part of audio modalities into visual modalities to obtain fused features. From the experimental results, as shown in the lower part of Table 2 , [V, A]->L achieves the best performance compared to the remaining two settings with the same number of parameters. Meanwhile, we note that the results are the worst when we obtain the fused features through the audio, which indicates that we do not obtain a high-level feature representation of the audio. Moreover, we analyze the reason is that the BiLSTM already has a good representation of the language modality in the feature processing stage and can make the performance work out.

In this paper, we propose a neural network to learn modalityfused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition from unaligned multimodal sequences. First of all, we perform feature preprocessing on each modality respectively. Unlike previous work, we use BiLSTM for the language modality to handle long-term dependencies and contextual information. Furthermore, we design a novel transformer with cross-modal blocks (CB-Transformer) that enables complementary learning of different modalities, which is mainly divided into local temporal learning, cross modal feature fusion and global selfattention representations. The CB-Transformer can represent the fused features without losing the original features, and can process unaligned multimodal sequences efficiently. Finally, we apply the proposed method to IEMOCAP, CMU-MOSI and CMU-MOSEI, respectively, and the experimental results show that our proposed method achieves comparable or better results compared to the existing state-of-the-art methods with the minimum number of parameters.

OpenFace: An open source facial behavior analysis toolkit

IEMOCAP: Interactive emotional dyadic motion capture database

COVID-19 sentiment analysis via deep learning during the rise of novel cases

Weakly-supervised Multi-task Learning for Multimodal Affect Recognition

Multimodal End-to-End Sparse Model for Emotion Recognition

Modality-Transferable Emotion Embeddings for Low-Resource Multimodal Emotion Recognition

COVAREP -A collaborative voice analysis repository for speech technologies

Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks

Progressive Modality Reinforcement for Human Multimodal Emotion Recognition From Unaligned Multimodal Sequences

Towards Multimodal Sentiment Analysis: Harvesting Opinions from the Web

Deep Spatio-Temporal Feature Fusion with Compact Bilinear Pooling For Multimodal Emotion Recognition. Computer Vision and Image Understanding

Glove: Global Vectors for Word Representation

Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities

Utterance-Level Multimodal Sentiment Analysis

Low Rank Fusion based Transformers for Multimodal Sequences

Multimodal Transformer for Unaligned Multimodal Language Sequences

Learning Factorized Multimodal Representations

Attention is All You Need

Select-additive learning: Improving generalization in multimodal sentiment analysis

Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors. Proceedings of the AAAI Conference on Artificial Intelligence

Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph

MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos

Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages

Audio-visual affect recognition through multi-stream fused HMM for HCI

We also find that the initial features of the three modalities are highly important but limited by preprocessing. In future work, we will build an end-to-end multimodal learning network and introduce the learning of more modalities, such as body postures, to explore the relationship between different modalities.