key: cord-246317-wz7epr3n
authors: Wang, Wei-Yao; Chang, Kai-Shiang; Tang, Yu-Chien
title: EmotionGIF-Yankee: A Sentiment Classifier with Robust Model Based Ensemble Methods
date: 2020-07-05
journal: nan
DOI: nan
sha: 
doc_id: 246317
cord_uid: wz7epr3n

This paper provides a method to classify sentiment with robust model based ensemble methods. We preprocess tweet data to enhance coverage of tokenizer. To reduce domain bias, we first train tweet dataset for pre-trained language model. Besides, each classifier has its strengths and weakness, we leverage different types of models with ensemble methods: average and power weighted sum. From the experiments, we show that our approach has achieved positive effect for sentiment classification. Our system reached third place among 26 teams from the evaluation in SocialNLP 2020 EmotionGIF competition.

Natural language is often indicative of one's emotion. Hence, detecting emotions in textual conversations has been one of popular topics in the field of natural language processing (NLP) sentiment domain. Sentiment classifier can help researchers study such information on user's feeling. There are various tasks of sentiment classification, for example, Riloff et al. (2005) presents an information extraction (IE) system that automatically uses filtering extractions to improve subjectivity classification. On opinion extraction, Zhai et al. (2011) extracts different opinion feature, including sentiment-words, substrings, and key-substringgroups, to help improve sentiment classification performance. In recent years, Hazarika et al. (2018) proposes a multi-modal emotion detection framework, interactive conversational memory network (ICON), to extract multi-modal features for emotion detection.

In SocialNLP 2020 EmotionGIF, the challenge is to use tweet text and reply to recommend exactly 6 categories. In this paper, we propose an architecture to apply to the shared task. We preprocess original tweet data to pre-trained language model, then fine-tune to multi-label classification model. To build comprehensive emotion classifier, we design an ensemble scheme to get higher performance.

The shared task includes a first-of-its-kind dataset of 40,000 two-turn Twitter threads. Each thread contains 5 columns which are idx, text, reply, categories, and mp4.

Here are the explanations of 5 columns:

• idx: a unique identifier of each tweet

• text: the text of the original tweet

• reply: the text content of the response tweet

• categories: the categories of the response GIF, containing 1 to 6 categories out of a list of 43 categories

• mp4: the hash file name of the response GIF

The dataset is split into three JSON files, traingold, dev-unlabeled, and test-unlabeled. First including 32,000 threads is training data, and the others including 4,000 threads are validation data, and testing data. The difference between train-gold and dev-unlabeled, test-unlabeled is that the former consists of all the 5 columns while the latter two only consist of 3 columns, idx, text, and reply. Figure 1 is the subset of the correlation table which contains the frequency of co-appearance of any two categories. The figure illustrates the correlation between different categories and we can observe that some categories have strong connection while some categories have weak connection. 

Our study can be mainly divided into three topics, including multi-label classification, pre-trained models, and ensemble methods.

Multi-label classification is a generalization of multiclass classification. Nowadays the multi-label classification is increasingly used in many fields of NLP, such as semantic scene classification and sentiment classification. There are two main categories of multi-label classification approaches: problem transformation (PT) methods, and algorithm adaptation (AA) methods. Generally speaking, problem transformation methods will transform the multi-label classification problem into one or more single-label classification problem (Zhang and Zhou, 2014) , while algorithm adaptation methods usually use those algorithms having been adapted to multi-label task and needing no problem transformation. For problem transformation methods, a good strategy to achieve the goal is Classifier Chains (CC) (Read et al., 2011) , which classifies whether the original multi-label problem belongs to a label or not in chain structure, and is able to capture the interdependencies between the labels. And for algorithm adaptation methods, Multi-label k-Nearest Neighbors (MLkNN) (Zhang and Zhou, 2007) is one of the most popular. It relies on the maximum a posteriori (MAP) principle on training the k-Nearest Neighbors (kNN), which is a well known traditional machine learning algorithm, to determine which label subset each instance belongs to. Due to its promising results and simplicity, it has been applied to many practical tasks of text classification.

Most of the existing multi-label classification approaches solve the emotion classification by training the model on a large dataset. The idea is to find informative features which can reflect the emotion expressed in the text, so with this approach most studies aim to find efficient features leading to better performance (Jabreel and Moreno, 2016) . Also, deep learning models are introduced to solve the multi-label classification problem, and have been proved that such models are able to extract high-level features from raw data. For instance, Baziotis et al. (2018) , the winner of SemEval-2018 Task 1 competition: Affect in Tweets, proposes a Bi-LSTM architecture with attention mechanism. They leverage a set of word2vec word embeddings trained on a dataset of 550 million tweets.

Pre-trained models have been widely applied in a variety of NLP systems and achieve dramatically performance for downstream tasks. There are three major advantages for pre-trained models. First of all, since they are unsupervised learning, there will be unlimited corpus can be trained. Secondly, a strength pre-trained language model can generate deep contextual word representation which means a word token can have several representation in different sentences. Hence, through fine-tuning we improve downstream tasks more efficiently. Last but not least, using pre-trained models can reduce huge architecture engineering. This allows us don't need to design a deep learning network by ourselves and pre-train with massive cost.

BERT (Devlin et al., 2018) , Bidirectional Encoder Representations from Transformers, is one of state-of-the-art (SOTA) pre-trained model. There are two main tasks in pre-training stages. At the first task, called Masked LM (MLM), is to replace 15% of the words in each sequence to a [MASK] token and model need to predict these masked tokens. Encoder learns contextual representations during this stage. Second task, Next Sentence Prediction (NSP), the model takes pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original documents. In details, 50% of the inputs will be a pair in original documents in training, while the other 50% a random sentence from the corpus is chosen as the second sentence.

There are some variant models based on BERT like RoBERTa (Liu et al., 2019) and DistilBERT (Sanh et al., 2019) . DistilBERT, distilled BERT, reduces the size of a BERT model by 40%, while retaining 97% of its language understanding and being 60% faster. DistilBERT removes half number of layers on token-type embeddings and the pooler. Instead of focusing on efficiency, RoBERTa, robustly optimized BERT approach, finds BERT undertrained that is why they study carefully to modify key hyperparameters to improve performance. Since there is different discrepancy about whether to remove NSP (Devlin et al., 2018; Lample and Conneau, 2019; Yang et al., 2019; Joshi et al., 2020) , RoBERTa do some experiments and find that remove NSP can slighlty improve downstream tasks. Furthermore, RoBERTa uses bytes instead of unicode as the base subword units (Radford et al., 2019) . Using bytes makes model learn larger subword vocabulary.

In general, supervised learning can be defined as finding hypotheses (classifier) that are closed to the true function which can represent all the data points in training data. However, learning algorithms that only output one hypothesis would face three major problems, statistical, computational, and representational. Fortunately, ensemble methods construct a set of classifiers and then classify new data points by taking a vote of their predictions which could usually address the three problems just mentioned (Dietterich, 2000) . The first problem is that learning algorithms may give same accuracy with different hypotheses. By constructing an en-semble out of all of these accurate classifiers, the algorithm can use a simple and fair voting mechanism to reduce the risk of choosing the wrong classifier. The second problem is that many learning algorithms implement local search which may stop even if the best solution found by the algorithm is not optimal. An ensemble constructed by running the local search from many different starting points may provide a better approximation to the true unknown function than any of the individual classifiers. The third problem is that in most applications of machine learning, the true function cannot be represented by any of the hypotheses. By forming weighted sums of hypotheses, it may be possible to expand the space of representable functions. Hagen et al. (2015) introduces an approach with ensemble methods on twitter sentiment detection. Their ensemble method is a voting scheme on the actual classifications of the individual classifiers rather than averaging confidences. Their system proves a strong baseline in the SemEval 2015 evaluation.

These studies motivate us to transfer Emo-tionGIF task into multi-label classification problem since this task needs to infer most possible 6 categories of each tweet. To reduce huge architecture engineering, we adopt pre-trained models then focus on preprocessing and postprocessing stages such as ensemble methods to achieve better performance on the competition. In addition to solving those problems mentioned previously, each classifier has its strengths and weakness, if we can combine different types of classifiers to leverage others forte to cover its own drawbacks, we can obtain highly accurate classifiers by combining less accurate ones. By combining these three techniques, we could build a robust system on EmotionGIF task.

The main goal of the present work is to predict 6 most possible categories for each tweet in Emo-tionGIF task. We propose an architecture as in Figure 3 which includes three stages: preprocessing, model framework, and ensemble methods.

Tweet data don't have same structure as formal corpus (e.g. Wikipedia). There are multiple methods to clean up original tweet data. We perform some methods to normalize data, including five steps, but we do not convert to lower case. Here are the main five steps to normalize tweet dataset. We do these steps in order:

1. Transform weird punctuation such as and .

2. Transform apostrophes to original words. For example, hasn't will be converted to has not.

3. Mapping unknown punctuation which not in tokenizer's vocabulary. For example, β is unknown in RoBERTa tokenizer, this will be transformed to word beta.

4. Demojize: convert emoji symbols into their corresponding meanings. Also, if there are duplicate emojis, we will only retain one emoji to represent these duplicate emojis.

5. Detweetize and more words conversion: some words in dataset are in tweet style, which means these words are seldom seen in formal corpus. We replace these words by manually into common representations. Like idk will be replaced with I don't know. Moreover, there are many recent trends like COVID which haven't been seen in tokenizers before. Therefore, we transform these words to common words like virus which can be tokenized correctly in tokenizers.

Model framework is composed of two parts: enhanced pre-trained language model and fine-tuned multi-label classification model. Pre-trained model trains on formal corpus like Wikipedia instead of tweet dataset. To avoid our model overfitting and domain bias on the training data, we use provided 32,000 training set to further train on pre-trained language model. The enhanced language model understands more about tweet style sentences.

In EmotionGIF task, we treat as multi-label classification problem. Hence, we use enhanced pretrained model to fine-tune to multi-label classification model in downstream task.

To properly handle multi-label classification, we select BCEWithLogitsLoss as our loss function. BCEWithLogitsLoss combines a sigmoid layer and the BCELoss, and takes advantage of the log-sumexp trick for numerical stability as Equation (1) and Equation (2).

where N is the batch size,

(2) Our goal aims to get better performance instead of efficiency, we use RoBERTa-base, BERT-basecased, and BERT-base-uncased to individually train language model and fine-tune to multi-label classification model. Since RoBERTa and BERT use different input formats, and our dataset has pair of sequences text and reply in each tweet, we convert input sentences based on corresponding models. BERT format is to add a special token [CLS] at first and add [SEP] between sentences and the end. RoBERTa format is to add <s> at first and add </s> between sentences and the end. An example of representation is as Table 1 .

Since each classifier has its strengths and weakness, if we can combine different types of classifiers to leverage others forte to cover its own drawbacks, we can obtain highly accurate classifiers by combining less accurate ones. To attain the desired results, we combine three different types of models, RoBERTa-base, BERT-base-cased, and BERTbase-uncased. On account of different dropout weights in each training, the performance of each trained model may have a big gap compared with each others. By training 10 same type of models with different dropout weights and averaging their predictions, we can lower the risk of using single model with bad performance.

After training and averaging three types of model, we use Equation (3) Figure 4 : Visualization of equation y = x n with n = 1/4, 1/3, 1/2, 1, 2, 3, 4 P i for i = 1, 2, 3 are average predict scores from RoBERTa-base, BERT-base-cased and BERTbase-uncased respectively and w i for i = 1, 2, 3 are the weights corresponding to each model. To choose a reasonable N, we look into the property of power function. Figure 4 shows that the further away the probability is from 1, then the faster the probability is closer to 0 and vice versa. The probabilities that remain the highest at the end are the probabilities whose relative agreement (weighted down by the probability and the power) coming from each ensemble model is the highest (Laurae). We take advantage of the power weighted sum to enhance performance of model.

In EmotionGIF, we only have ground truth labels in training data. We use dev-unlabeled as our validation data. That is, we fine-tune hyperparameters based on validation data and use best models from tuning to predict testing data, test-unlabeled. In this section, our system gives some reasonable results from experiments. The source code for this paper is available as a Github repository 1 .

For both pre-trained language model and multilabel classification model, we use Adam (Kingma and Ba, 2014) as optimizer with epsilon 1e-8, learning rate 4e-5. Gradient accumulation steps and warmup ratio are 1 and 0.06. Max sequence length, number of epochs, and batch size are set to 113, 4, and 16. For pre-trained language model, we set block size to 96. For multi-label classification model, early stopping is used, which means beam search is stopped when number of beam sentences finished per batch. Early stopping patience is set to 3. Early stopping metric is eval loss and early stopping metric should be minimized. Most of these configurations are default arguments in Simple Transformers 2 .

In order to achieve ensemble methods, we train 10 of RoBERTa-base, 5 of BERT-base-cased and 5 of BERT-base-uncased. All of these models have been trained with above configurations.

The metric that will be used to evaluate entries is Mean Recall at k, with k=6 (MR@6). Table 2 shows an example how we evaluate our predictions. For each output, we will predict 6 categories out of a list of 43 categories as Prediction in Table 2 and calculate how many categories (N) that our predicted categories are identical to the answer. The MR@6 is N divided by the total amount of Answer. The final result is the average of the MR@6 for all Twitter threads.

Prediction MR@6 agree, thank you, thumbs up oops, scared, thank you, you got this, do not want, agree 1/3 

To check preprocessing methods, we use RoBERTabase tokenizer coverage to validate shown in Table  3 . From Table 4 shows first 6 out-ofvocab (OOV) tokens. Although there are still some unknown tokens, we can observe that some words like medium-dark might be able to be tokenized into expected tokens like medium, -, and dark.

Exploratory data analysis (EDA) is a initial investigations on data so as to discover pattern or to check assumption with the help of statistics. Through EDA, we find that convert word to lower case may cause unexpected tokens from tokenizer. For example, word Hug can be correctly tokenized when at different positions, while word hug cannot be tokenized as we expected. That is, hug will be tokenized into h and ug. Hence we don't convert all words into lower case in EmotionGIF task.

The experiment results of validation data are shown in Table 5 Table 6 is our system predict on testing set. Ensemble models achieve about 0.5662 MAR@6 score, while only using single type of model only gets 0.5404. This indicates single type of model may be slightly worse in testing data. Applying ensemble methods does solve this problem. Overall, our proposed system successfully outperforms with either using original pre-trained language model to fine-tune or EmotionGIF official baseline. Our approach achieves high MAR@6 score both on validation data and testing data in this competition.

In this work, we propose an system architecture combining with preprocessing, model framework, and ensemble models for EmotionGIF task. We intently convert some words to our desired format and increase the coverage of words recognized by tokenizer. Based on preprocessing data, We apply multi-label classification and pre-trained model when training models to make our work more sophisticated. Besides, we also show that ensemble models with power weighted sum outperform any single model with same parameters we trained.

In Section 2, we observe that there is an imbalance between categories. However, in the present work, we don't deal with it. Furthermore, we consider to replace multi-label classification with ranking classification due to its property of dependency in future work. The probabilities of multi-label classification are treated as independent, so there is no correlation among categories while ranking classification is the opposite. Since the category would have some connection with each other as Table 1 shown, we assume that it would be better to let our model regard the dependency between categories as critical.

Ntua-slp at semeval-2018 task 1: Predicting affective content in tweets with deep attentive rnns and transfer learning

Bert: Pre-training of deep bidirectional transformers for language understanding

Ensemble methods in machine learning

Webis: An ensemble for twitter sentiment detection

Icon: Interactive conversational memory network for multimodal emotion detection

Sentirich: Sentiment analysis of tweets based on a rich set of features

Spanbert: Improving pre-training by representing and predicting spans

Adam: A method for stochastic optimization

Crosslingual language model pretraining

Reaching the depths of (power/geometric) ensembling when targeting the auc metric

Roberta: A robustly optimized bert pretraining approach

Language models are unsupervised multitask learners

Classifier chains for multi-label classification

Exploiting subjectivity classification to improve information extraction

Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter

Xlnet: Generalized autoregressive pretraining for language understanding

Exploiting effective features for chinese sentiment classification

A review on multi-label learning algorithms

Ml-knn: A lazy learning approach to multi-label learning. Pattern recognition