key: cord-0283061-jn51o5bk authors: Agrawal, Yudhik; Shanker, Ramaguru Guru Ravi; Alluri, Vinoo title: Transformer-based approach towards music emotion recognition from lyrics date: 2021-01-06 journal: nan DOI: 10.1007/978-3-030-72240-1_12 sha: e66b30d009bd569d9d497705b1c741399de82998 doc_id: 283061 cord_uid: jn51o5bk The task of identifying emotions from a given music track has been an active pursuit in the Music Information Retrieval (MIR) community for years. Music emotion recognition has typically relied on acoustic features, social tags, and other metadata to identify and classify music emotions. The role of lyrics in music emotion recognition remains under-appreciated in spite of several studies reporting superior performance of music emotion classifiers based on features extracted from lyrics. In this study, we use the transformer-based approach model using XLNet as the base architecture which, till date, has not been used to identify emotional connotations of music based on lyrics. Our proposed approach outperforms existing methods for multiple datasets. We used a robust methodology to enhance web-crawlers' accuracy for extracting lyrics. This study has important implications in improving applications involved in playlist generation of music based on emotions in addition to improving music recommendation systems. Information retrieval and recommendation, be it related to news, music, products, images, amongst others, is crucial in e-commerce and on-demand content streaming applications. With the staggering increase in paid subscribers for music streaming platforms over the years, and especially in these Covid times [1] , MIR systems have increased need and relevancy. Music Emotion Recognition has gained prominence over the recent years in the field of MIR, albeit relying on acoustic features [11, 29] and social tags [6] to identify and classify music emotions. Lyrics have been largely neglected despite the crucial role they play in especially eliciting emotions [14] , a vital factor contributing to musical reward [25] , in addition to reflecting user traits and tendencies [34] which in turn are related to musical preferences [26] . Despite a handful of studies reporting the superior performance of music emotion classifiers based on features extracted from lyrics than audio [16, 38] , the role of lyrics in music emotion recognition remains under-appreciated. Analyzing lyrics and its emotional connotations using advanced Natural Language Processing (NLP) techniques would make for a natural choice. However, NLP in MIR has been used for topic modelling [20] , identifying song structure via lyrics [13] , and mood classification [16] . In the context of Music emotion recognition [23, 38] , typically traditional NLP approaches have been used, which are limited to word-level representations and embeddings, as opposed to more modern NLP techniques that are based on context and long-term dependencies such as transformers [10, 40] . Lyrics can be treated as narratives rather than independent words or sentences, which therefore renders the use of transformers a natural choice in mining affective connotations. In this study, we use transformer model which, till date, has not been used for identifying emotional connotations of music based on lyrics. Analyzing affective connotations from text, that is, sentiment analysis, has been actively attempted in short contexts like reviews [4, 30] , tweets [3, 7] , news articles [35] amongst others with limited application to lyrics. Sentiment analysis has come a long way from its inception based on surveys and public opinions [21] to use of linguistic features like character n-grams [15] , bag-of-words [4] and lexicons like SentiWordNet [27] to state-of-the-art that employ context-based approaches [10, 33] for capturing the polarity of a text. The task of sentiment analysis has been approached using several deep learning techniques like RNN [7, 31] , CNN [7] , and transformers [10, 18] and have shown to perform remarkably better than traditional machine-learning methods [19] . Music emotion classification using lyrics has been performed based on traditional lexicons [16, 17] . The lexicons not only have very limited vocabulary but also the values have to be aggregated without using any contextual information. In recent years the use of pre-trained models like GloVe [32] , ELMO [33] , transformers [10, 37] are fast gaining importance for large text corpus has shown impressive results in downstream several NLP tasks. Authors in [9, 2] perform emotion classification using lyrics by applying RNN model on top of word-level embedding. The MoodyLyrics dataset [5] was used by [2] who report an impressive F 1 -score of 91.00%. Recurrent models like LSTMs work on Markov's principle, where information from past steps goes through a sequence of computations to predict a future state. Meanwhile, the transformer architecture eschews recurrence nature and introduces self-attention, which establishes longer dependency between each step with all other steps. Since we have direct access to all the other steps (self-attention) ensures negligible information loss. In this study, we employ Multi-task setup, using XLNet as the base architecture for classification of emotions and evaluate the performance of our model on several datasets that have been organized by emotional connotations solely based on lyrics. We demonstrate superior performance of our transformer-based approach compared to RNN-based approach [9, 2] . In addition, we propose a robust methodology for extracting lyrics for a song. MoodyLyrics [5] : This dataset comprises 2595 songs uniformly distributed across the 4 quadrants of the Russell's Valence-Arousal (V-A) circumplex model [36] of affect where emotion is a point in a two-dimensional continuous space which has been reported to sufficiently capture musical emotions [12] . Valence describes pleasantness and Arousal represents the energy content. The authors used a combination of existing lexicons such as ANEW, WordNet, and WordNet-Affect to assign the V-A values at a word-level followed by song-level averaging of these values. These were further validated by using subjective human judgment of the mood tags from AllMusic Dataset [24] . Finally, the authors had retained songs in each quadrant only if their Valence and Arousal values were above specific thresholds, thereby rendering them to be highly representative of those categories. MER Dataset [24] : This dataset contains 180 songs distributed uniformly among the 4 emotion quadrants of the 2-D Russell's circumplex model. Several annotators assigned the V-A values for each song solely based on the lyrics displayed without the audio. The Valence and Arousal for each song were computed as the average of their subjective ratings. Also, this dataset was reported to demonstrate high internal consistency making it highly perceptually relevant. Due to copyright issues, the datasets do not provide lyrics, however, the URLs from different lyric websites are provided in each of the datasets. In order to mine the lyrics, one approach is to write a crawler for each of the websites present in the datasets. However, some of those URLs were broken. Hence, in order to address this concern, we provide a robust approach for extracting lyrics using the Genius website. All the existing APIs, including Genius API require the correct artist and track name for extracting the lyrics. However, if the artist or track names are misspelled in the dataset, the API fails to extract the lyrics. We handled this issue by introducing a web crawler to obtain the Genius website URL for the lyrics of the song instead of hard-coding the artist and track name in Genius API. Using the web crawler, we were able to considerably improve the number of songs extracted from 60% -80% for the different datasets to ∼99% for each dataset. We describe a deep neural network architecture that, given the lyrics, outputs the classification of Emotion Quadrants, in addition to Valence and Arousal Hemispheres. The entire network is trained jointly on all these tasks using weightsharing, an instance of multi-task learning. Multi-task learning acts as a regularizer by introducing inductive bias that prefers hypotheses explaining all the tasks. It overcomes the risk of overfitting and reduces the model's ability to accommodate random noise during training while achieving faster convergence [41] . We use XLNet [40] as the base network, which is a large bidirectional transformer that uses improved training methodology, larger data and more computational power. XLNet improves upon BERT [10] by using the Transformer XL [8] as its base architecture. The added recurrence to the transformer enables the network to have a deeper understanding of contextual information. The XLNet transformer Model outputs raw hidden states, which are then passed on to SequenceSummary block, which computes a single vector summary of a sequence of hidden states, followed by one more hidden Fully-Connected (FC) layer which encodes the information into a vector of length 8. This layer finally branches out into three complementary tasks via a single FC layer on top for classification of Quadrant, Valence, and Arousal separately. As we feed input data, the entire pre-trained XLNet model and the additional untrained classification layers are trained for all three tasks. We use the following loss function to train our network. where L Q ,L V , and L A represents the classification loss on Quadrants, Valence, and Arousal, respectively. We use the AdamW optimizer [22] with an initial learning rate of 2e −5 and a dropout regularization with a 0.1 discard probability for the layers. We use Cross-Entropy Loss for calculating loss. A batch size of 8 was used. We also restrict the length of the lyrics to 1024 words. Lyrics of more than 99% of the songs had less than 1024 words. We leverage the rich information of pre-trained (XLNet-base-cased) model as they are trained on big corpora. As the pre-trained model layers already encode a rich amount of information about language, training the classifier is relatively inexpensive [37] . We also run our network on single-task classification and compare the results as part of our ablation study in a later section. For evaluating the effectiveness of our proposed model, we use the standard recall, precision, and F 1 measures. We provide results for both macro-averaged F 1 and micro-averaged F 1 . The micro-average F 1 is also the classifier's overall accuracy. We use Macro-averaged F 1 (F 1 -score) [39] as given in Equation 2. The scores are first computed for the binary decisions for each individual category and then are averaged over categories. where F 1 x , P x , R x denote F1-score, precision and recall with respect to class x. This metric is significantly more robust towards the error type distribution as compared to the other variants of the Macro-averaged F 1 [28] . We use multi-task setup to compare our performance on various datasets. For a fair evaluation of our method, we use the data splits for respective datasets, as mentioned in respective studies. All the results reported hereon are the average of multiple data splits. Tables 1 and 2 compares the results of our approach on MoodyLyrics and MER dataset respectively. These results demonstrate the far superior performance of our method when compared to studies that have attempted the same task. We also compare the performance of our approach by validating on an additional dataset, the AllMusic dataset comprising 771 songs provided by [24] . We follow the same procedure of training on the MER dataset and evaluating on the AllMusic dataset as mentioned by the authors. We get an improved F 1 -score of 75.40% compared to their reported 73.60% on single-task Quadrant classification in addition to improved Accuracy of 76.31% when compared to the reported Accuracy of 74.25%, albeit on a subset of the AllMusic dataset, in [5] . Our Multi-task method demonstrated comparable F 1 -score and accuracy of 72.70% and 73.95% when compared to our single-task Quadrant classification. Ablation Study: Owing to its large size and quadrant representativeness of the MoodyLyrics dataset, we perform extensive analysis with different architecture types and sequence lengths. In the initial set of experiments, we aimed to find the best model where we compared our baseline model with BERT transformer with same sequence length of 512, which resulted in inferior performance of an F 1 -score down by around 1.3%. We also compare the performance of our baseline model with our multi-task setup. Table 3 shows that we perform similar to our baseline method, but we saw a huge improvement in training speed as the latter converge faster. This also requires training different tasks from scratch every time, which makes it inefficient. In this study, we have demonstrated the robustness of our novel transformerbased approach for music emotion recognition using lyrics on multiple datasets when compared to hitherto used approaches. Our multi-task setup helps in faster convergence and reduces model overfitting, however, the single-task setup performs marginally better albeit at the expense of computational resources. This study can help in improving applications like playlist generation of music with similar emotions. Also, hybrid music recommendation systems, which utilize predominantly acoustic content-based and collaborative filtering approaches can further benefit from incorporating emotional connotations of lyrics for retrieval. This approach can be extended in future to multilingual lyrics. Spotify hits 130 million subscribers amid Covid-19 Emotion classification of song lyrics using bidirectional lstm method with glove word representation weighting Sentiment analysis of twitter data Sentiment analysis of online reviews using bag-of-words and lstm approaches Moodylyrics: A sentiment annotated lyrics dataset Music mood dataset creation based on last. fm tags Bb twtr at semeval-2017 task 4: Twitter sentiment analysis with cnns and lstms Transformer-xl: Attentive language models beyond a fixed-length context Music mood detection based on audio and lyrics with deep neural net Bert: Pre-training of deep bidirectional transformers for language understanding Prediction of multidimensional emotional ratings in music from audio using multivariate regression models A comparison of the discrete and dimensional models of emotion in music Lyrics segmentation: Textual macrostructure detection using convolutions Musical preferences. Oxford handbook of music psychology pp Codex: Combining an svm classifier and character n-gram language models for sentiment analysis on twitter text When lyrics outperform audio for music mood classification: A feature analysis Lyric-based song emotion detection with affective lexicon and fuzzy clustering method Emotionx-idea: Emotion bert-an affectional model for conversation Comparison of traditional machine learning and deep learning approaches for sentiment analysis Oh oh oh whoah! towards automatic topic detection in song lyrics Japanese opinion surveys: the special need and the special difficulties Decoupled weight decay regularization Music emotion recognition from lyrics: A comparative study. 6th International Workshop on Machine Learning and Music (MML13) Emotionally-relevant features for classification and regression of music lyrics Individual differences in music reward experiences Personality correlates of music audio preferences for modelling music listeners Sentiment classification of reviews using sentiwordnet Macro f1 and macro f1 Multi-modal music emotion recognition: A new dataset, methodology and comparative analysis A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts Sentiment analysis by using recurrent neural network Glove: Global vectors for word representation Deep contextualized word representations Personality predicts words in favorite songs Sentiment analysis in news articles using sentic computing A circumplex model of affect How to fine-tune bert for text classification? Sentiment vector space model for lyric-based song sentiment classification A re-examination of text categorization methods Xlnet: Generalized autoregressive pretraining for language understanding An overview of multi-task learning