key: cord-0979700-p50n4bnn authors: Srinivasan, R.; Subalalitha, C. N. title: Sentimental analysis from imbalanced code-mixed data using machine learning approaches date: 2021-03-20 journal: Distrib Parallel Databases DOI: 10.1007/s10619-021-07331-4 sha: e5e8050935c4aedd6d8f9352726fcfcf220df3be doc_id: 979700 cord_uid: p50n4bnn Knowledge discovery from various perspectives has become a crucial asset in almost all fields. Sentimental analysis is a classification task used to classify the sentence based on the meaning of their context. This paper addresses class imbalance problem which is one of the important issues in sentimental analysis. Not much works focused on sentimental analysis with imbalanced class label distribution. The paper also focusses on another aspect of the problem which involves a concept called “Code Mixing”. Code mixed data consists of text alternating between two or more languages. Class imbalance distribution is a commonly noted phenomenon in a code-mixed data. The existing works have focused more on analyzing the sentiments in a monolingual data but not in a code-mixed data. This paper addresses all these issues and comes up with a solution to analyze sentiments for a class imbalanced code-mixed data using sampling technique combined with levenshtein distance metrics. Furthermore, this paper compares the performances of various machine learning approaches namely, Random Forest Classifier, Logistic Regression, XGBoost classifier, Support Vector Machine and Naïve Bayes Classifier using F1- Score. Due to the rapid development of social media, users are allowed to share, discuss or communicate their information very easily [1] . The information spans many interests right from politics, reviews on different products, academics, and much more. Due to the lockdown of COVID 19, the number of online users on social media have increased extensively. Social media sites also provide free open source and userfriendly approaches so that users can write their information in own native language or code-mixed data [2] . Code-mixed is a way of writing scripts using different languages or at least using more than one language or expressing the views in the social media. Example 1 Code-mixed data: Trailer late ah parthavanga like podunga. Translated data: Those who watched the trailer late, please like it. In Example 1, "Trailer", "like" are English words are mixed with tamil words to represent the sentence. It was observed that mainly in India, people prefer code mixed language while communicating in the social media [3] . This due to the fact that India is a multilingual country where people speak in different Indian language while their educational medium is through English. This is the main reason behind why Indians widely mix English with native language to express their views on social media. Also, the same scenario is observed in many other multilingual countries. The application areas of the code-mixed data are Machine Translation (MT), Mixed Script Information Retrieval (MSIR), Language Identification, Sentimental analysis etc. Sentimental analysis (often known as Opinion Mining) is used to identify the emotions present in the given text. Sentimental analysis of a code-mixed social data is one of the challenging tasks of Machine Learning (ML) application. Preprocessing techniques that are normally used in sentimental analysis of a monolingual text includes, stemming, Parts of Speech Tagging (POS), Morphological analysis are insufficient to analyze sentiments in a code-mixed data. The reason behind this is that code-mixed data does not have a well-defined grammatical sentence and also consists of more unseen lexicons. To sum up, the major challenges of extracting sentiments in a code-mixed social text are: (1) No word order: In code-mixed data, word order is completely lost. User can construct their own structure to form a sentence. For example, "Thalaivaru vera vera vera level pannitaru" which literally means, "The leader has performed extremely well". In above code-mixed example, 1 English word "level" is present, whereas, the rest of the 5 words are Tamil words transliterated in English. It can be seen that the word "vera" is repeated twice to express the superlative performance of the leader. However, the meaning remains the same even when the word, "vera" is used once. These induces varied sentence structures that increases the complexity of analyzing sentiments in a code-mixed sentence. Tamil language sentences are already partially free word ordered, which at times become intractable to capture the context. These code-mixed sentences worsen the scenario even more. (2) Spelling variations: As the words used in code mixed sentences are user specific, there is no proper spelling rules. It creates a major problem to normalize those words while analyzing the sentiments. For example, the word "super" can be written as "superu", "sooper" and "suuuuupar" as the word "super" is used commonly by the Tamilians. While analyzing such words which are essential to capture the sentiments, it becomes difficult to normalize. (3) Creative Spellings: Apart from the native language, users also create their own spelling for English as well while writing their opinions on social media. For example, "Youtube" is written as "Utube". "Great" as "gr8". This also induces complexity while performing a sentiment analysis. (4) Abbreviations: In social media, people use abbreviations to represent a phrase. For example, FDFS represents "first day first show". This should be tackled while performing sentiment analysis. (5) No Capitalization: Capitalization is also not followed by the user on social media which makes the process of identifying the sentence beginnings. The above said challenges pertains to sentimental analysis of a code-mixed data. This paper also addresses yet another problem faced called, "imbalanced class label distribution" that exist in sentimental analysis even while doing for monolingual texts using Machine Learning approaches [4] . Imbalancing refers to the availability of more data tagged for one class compared to the other classes. Popular imbalanced datasets are credit card fraudulent, software defect prediction [5] , and airline data [6] . These imbalances in class labels affects the accuracy of the classification task. The removal of the minority class labels also would affect the overall accuracy of the classification task. The proposed work uses a sampling technique to solve the class imbalance problem in a code-mixed data. The main contributions of our work are as follows: (1) To classify the code-mixed data and non-code-mixed data from the given corpus using an enhanced spell-checking algorithm. (2) To create a lexicon dictionary for this code-mixed corpus. With the help of dictionary, the words with spelling variations in the corpus are normalized with the help of Levenshtein distance metric. (3) To extract sentiments using various machine learning techniques and apply the sampling methods to solve the class imbalance problem. The rest of the paper is organized as follows: Sect. 2, highlights the related work and Sect. 3, describe about the preprocessing and feature extraction techniques of the proposed methodology, Sect. 4 describes the result analysis of the proposed work, Sect. 5 describes the conclusion and future enhancement. The literature survey has been done in three dimensions namely on the works done on sentimental analysis for monolingual data, sentimental analysis done on code mixed data and sentimental analysis done on data that has imbalanced class label distribution. Nasukawa et al. was first to coin the term sentiment analysis in the year 2003 [7] but linguistics people had already done a little research on the sentiment before the year 2000 [8] . Sentimental analysis was once a highly challenging task in the field of NLP in the beginning of twenty-first century. Later, Sentimental analysis started reaching many other fields other than computer science. For instance, in 2007, sentimental analysis was used in management studies by Archak et al., to derive the price of the product [9] . Later sentimental analysis was applied for many tasks in management studies to automatically obtain the opinion of products, price prediction and feedback of the products [10] [11] [12] [13] . Liu et al. has identified three levels namely, document level, sentence level and entity level at which the sentimental analysis could be done. Later, different techniques have been investigated and applied to all these levels. The techniques can be broadly classified into three categories, namely, dictionary-based approaches, machine learning techniques and hybrid approaches. Dictionary based approaches mainly focusses on predetermined lexicons. Dictionary based approaches can work well on the monolingual data [14] as the standard lexicons are built and stored in dictionary. Dictionary based approaches does not need a training data and it is extremely hard for cross-domain or multilingual data [15] . Dictionary based approaches is entity specific, it is not suitable for all the domains. Machine learning methods, supervised approach, unsupervised and the semi-supervised approach are well distributed to the sentimental analysis for the monolingual data. The main disadvantage of the machine learning approach needs a large training data related to specific domain. Pollyanna Goncalves et al. compared an eight different types of hybrid approach for the sentimental analysis [16] . The above-mentioned techniques are well suited to analyses the sentiment for monolingual data. But the problem addressed in the paper is to analyzes the sentiments on the code-mixed data. Code-mixing is a current trend in the field of sentimental analysis and transliteration. The techniques for extracting the proper sentiment in the code-mixed data is difficult [17] . Vijay et al. had done a research on the Hindi-English code-mixed social text data [18] . Code-mixed data needs a lot of preprocessing step compare to the monolingual data. The author proposed a lot of preprocessing techniques related to the mixed script and achieves the 58.2% accuracy using Support Vector Machine (SVM) classifier. Shalini et al. proposed a distributed representation for extracting sentiments on different code-mixed text such as Kannada-English, Hindi-English, and Bengali-English [19] . The author has applied a variety of techniques to Sentimental Analysis on Indian Languages (SAIL) such as SVM, FastText, Bi-directional Long Short-Term Memory (Bi-LSTM) and Convolutional Neural Network (CNN). CNN are applied with different filter size and achieves an accuracy of 71.5% Kannada-English dataset, whereas Bi-LSTM produces a good accuracy of 60.2% for Hindi-English dataset and 72.2% for Bengali-English dataset. Choudhary et al. proposed a novel approach and outperforms a best result compared to state of the approaches in Sentimental Analysis of Code-Mixed Text (SACMT) by 7.6% accuracy and 10.1% in F-Score [20] . Mishra et al. proposed a different Machine Learning and Deep Learning approaches to process the sentimental analysis on the Indian Languages [21] . The author produced an output of 69% F1-Score in Bengali-English dataset and 58% of F1-Score in Hindi-English dataset. Adhering to the aforementioned methods, we introduced a technique called Levenshtein distance to preprocess the code-mixed social text data. Another problem addressed in this paper is class imbalanced distribution. To overcome the class imbalance problem in code-mixed social text, we review the class imbalance distribution approaches in the following section. Class imbalance is a major problem in the classification task which leads to minority samples being wrongly classified [22] . Many machine learning approaches have been used to solve the class imbalance problem in the past-decade. Haixiang et al. addressed one of the state-of-the-art techniques to solve the imbalance problem using two predominant methods [23] . The first method made changes in the preprocessing technique and applied cost sensitive learning methods to solve class imbalance problem. The second method made a minor change in the existing machine learning algorithms or generating hybrid model to solve the class imbalance issue. Li et al. proposed a minor change in the oversampling technique to solve the imbalance problem in the sentiment analysis [24] . The novel oversampling techniques produced improved F1-score of 81.5% compared to state-of-the-art approaches. Liu et al. introduced a new smoothing technique named as Random Over Sampling Expected Smoothing (ROSE) to handle imbalanced prior probability value [25] . The author proved that ROSE technique is more powerful compared to the other smoothing approaches. Lu et al. discovered a complexity measure called Individual Bayes Imbalance Impact Index (IBI3) to check whether the dataset is worth to apply sampling methods or not [26] . After completing the preprocessing, dataset is applied with vector space resampling methods to find out the best F1-score value. Especially, oversampling method is considered to solve the class imbalance problem for the code-mixed social text data. A variety of hybrid machine learning techniques have been proposed to solve the class imbalanced issue such as ROSE, oversampling methods, vector space resampling techniques etc. To the best of our knowledge, levenshtein distance has not been used as preprocessing technique so far to solve the spelling variations in the Tamil-English code-mixed data. In addition, resampling techniques have never been attempted for code-mixed data to solve the class imbalance problem. The proposed work attempts to use Levenshtein distance and resampling techniques to explore the Tamil-English code-mixed data and further analyze sentiments present in it using various machine-learning approaches. Chakravarthi et al. have created a dataset (Tamil-English) to extract sentiments from code-mixed social text data. The authors have created a bilingual dataset for Indian languages namely, Tamil-English and Malayalam-English [27, 28] . The dataset is scrapped from the Youtube comments by using tool called YouTube Comment Scraper tool. The proposed work mainly focusses on Tamil-English dataset and extracts sentiment from it. Table 1 describes the dataset used in the proposed approach. The dataset consists of 15,744 sentences and it is divided into three categories. 11,335 sentences are considered as training data, 1260 sentences are considered as validation set and 3149 sentences are used as test data. The sentiments Fig. 1 . Data preprocessing is an important step that helps to enhance and extract the meaningful insights from the data. Data preprocessing (also called as Data Cleaning) technique helps to remove the errors and inconsistencies present in the data [29] . Sometimes inconsistencies of the data that creates the illogical or missing important information that affects the accuracy of the data. The following actions are to be taken in the preprocessing stage: (1) Sentence are divided into tokens and special characters are removed (2) The characters are converted either into uppercase or lower case (3) Usually stop word removal forms one of the most important steps in preprocessing. Since the stop word removal was found to be a difficult procedure, we have replaced that by removing the words that has less than 2-character length. Fig. 1 Architecture of proposed methodology The dataset is first divided into Tamil-English script and English (monolingual) script. The sentences which are not in Tamil-English script will be eliminated in the initial step. The sentence classification algorithm is used to classify the code-mixed (β) and non-code-mixed (α) sentences. First, Sentences (S) are given as input to the algorithm and is divided into number of words w 1 , w 2, ,……, wn. For each word in the sentence, it is checked if it is English or not using language detector. The sentence containing non-English words are counted as code mixed sentences. After applying the sentence classification algorithm on the dataset, 541 sentences are identified as English script (α) and are removed. This is one of the reasons for the increased F1-Score. In social media, people tend to use many variants for a single word to express their varied emotion. For example, people type "wowww" for the word, "Wow" to express their increased degree of emotion. These word variations are to be normalized prior to feature extraction which is essential in capturing the sentiments more precisely. Recurrent character has to be by comparing it with the base word form. Similar to removing the recurrent characters, recurrent words are also removed. For example, "Thala vera vera vera level panitaru" which means the person has performed extremely well. The word, "vera" is repeated to express more degree of applause. Since this paper is targeting to detect positive emotion and not aiming at classifying the varying degrees in the positive emotion, these repeated words are removed as part of normalization procedure. The term Levenshtein distance may also be referred to edit distance was coined in the year 1965 and is used to find the distance between the words [30] . Levenshtein distance finds out the minimum number of single characters edits that is required to normalize it with the base word [30] . Levenshtein distance has four operations namely, identity, insertion, substitution and deletion. The distance can be calculated by using Eq. 1. In Eq. 1, d (i, j) is the distance between the i characters of string S 1 and j characters of string S 2 . For example: Let S 1 = Talaivar and S 2 = Thalaivar be two strings. Here S 1 is the source string and S 2 is target string (base word). S 1 [1…. m] and S 2 [1…. n] where m and n are length of the strings, S 1 and S 2 . When each character in S 1 is compared with the targeted string S 2 , the character 'h' is inserted in S 1 to match S 2 . The minimum edit distance required to normalize a from source string to the target string is 1. In this paper, we have set the maximum edit distance as two because it was observed, when the edit distance is greater than 2, the word transforms into a different word not matching with the meaning of the base word. These normalization procedures have aided in reducing the computational complexity and also resulted in increased F-score. Resampling techniques are very important to solve the class imbalance problem. Resampling can be done in different ways namely, Randomized Exact Test (also called as Permutation test), Cross-validation, Jackknife and Bootstrapping techniques [32] . Furthermore, hybrid techniques can be generated with the help of bootstrapping techniques. Oversampling and under sampling techniques are very helpful to solve the imbalanced class label distribution for multi-class classification [33] . This paper also addresses the problem of imbalanced class label distribution in multi-class classification task for code-mixed data. The problem of using under sampling techniques, number of sentences in the majority class can be equivalent to the number of sentences in the minority class. This training data consists of 368 sentences named as "Unknown state" considered to be minority class. After applying the under-sampling technique, 1840 sentences are considered as training data from the 11,335 sentences. The ratio of selected sentences for training from overall sentences is reduced to 1:6 approximately. After applying the under-sampling techniques, the size of the dataset gets reduced. Oversampling technique is just opposite to the under-sampling technique. Oversampling technique helps to increase the size of the training dataset by duplicating the minority class data. Oversampling could be best option to balance the data that suffers from imbalanced class label distribution [33] . This paper also addresses two different techniques in oversampling techniques namely, Synthetic Minority Over-Sampling (SMOTE) and Adaptive Synthetic (ADASYN). Term Frequency and Inverse Document Frequency (Tf-Idf) is a method to convert the text into a vector form. The Tf-Idf is proposed by jones in 1972 and it is useful for information retrieval and text classification process [31] . Tf-Idf is a statistical measure that helps to assign a weight for each word in the corpus. The term frequency (TF) is defined as number of times a word occur in a document whereas, Inverse Document Frequency (IDF) increases the weight for rarely occurring words but decreases the weight for frequently occurring unimportant words. Feature Extraction for sentimental analysis can be done using Bag of Words (BoW), Word2vec, Global Vector Representation (GloVe). BOW gives higher weightage to the frequently occurring words in the document. Word2Vec, GloVe are pretrained model to convert the text into vector. Since it is a code-mixed data, no pre-trained word embeddings are available. Also, code mixed data usually contains words that has less frequency but needs to be given weightage and hence TF-IDF has been chosen for extracting the features. After completing the preprocessing techniques, the next task is to apply the ML algorithms for the classifying the sentiments of the code-mixed data. There are atleast two problems that can occur while applying ML algorithms for any data. The first problem is to find relevant feature extraction technique and the second is to find a suitable classification algorithm. It was observed that when most suitable feature extraction techniques are chosen, any classification algorithm was able to classify the sentiments from the code-mixed data. The proposed work has analyzed by using various classification algorithms namely, Probabilistic models (Naïve Bayes classifier (NB)), linear classifier (Support Vector Machine classifier (SVM)), Decision Based (Random Forest Classifier and XGBoost classifier), and Statistical Model (Logistic Regression). This section describes about the evaluation metrics and results obtained for the dataset. Precision, Recall, F1-Score and Accuracy are commonly used evaluation measures for any classification problem but while using machine learning algorithms, for an imbalanced dataset appropriate evaluate metric has to be chosen. Various evaluation metrics were analyzed such as, Accuracy, Macro average F1 score and Micro average F1 score. Since accuracy and micro average F1-score are not preferred metrics for evaluating the class imbalance data, the macro average F1-score is chosen along with recall and precision as evaluation metrics. The formula for evaluation metrics is shown in equation: F1 Score is the weighted mean of precision and recall. In order to emphasize the importance of identifying the code-mixed data while analyzing the sentiments, we identified sentiments with and without code mixed data. Section 4.1 discusses about Recall = ratio of correctly predicted sentence to the overall observation in the actual class. precision = ratio of correctly predicted sentences to the total predicted sentences. the results that are obtained without classifying the code mixed and non-code mixed data. It was observed that when the sentiment analysis was done after removing the non-code mixed data followed by classifying codemixed and non-code mixed data, it resulted in a better F1-score. The next section describes about the results obtained without classifying code mixed and non-code mixed data. This dataset consists of 11,335 training sentences and 3149 test sentences. Two resampling techniques namely SMOTE and ADASYN techniques were used and their performance were compared. Table 2 shows the result obtained using SMOTE sampling technique and Table 2 shows the results obtained by using ADSYN technique. It was observed that these resampling techniques improved the F1-score by 50%. It can be observed from Tables 2 and 3 that, Logistic regression performs compared to other techniques, whereas naïve bayes fails to predict other classes except positive. Also, it can be observed that even after applying sampling techniques, the classification algorithms are more biased towards the positive class. This class imbalance is due to the non-separation of code mixed and non-code mixed data. The next section describes the influence of identifying the sentiments after classifying the code mixed and noncode-mixed data. Out of 15,744 sentences, 541 sentences are non-code-mixed sentences which are identified and removed before preprocessing the data. This resulted in a better distribution of sentiments across all categories overcoming the hurdles observed in Sect. 4.1. This can also be very well observed from Table 4 and 5 where an increased F1-score for each class was obtained and also resulted in an improved averaged F1-score computed for all classes. Averaged F1-score value increased by 2% after removing the non-code-mixed data. The code-mixed sentences also contained other Indian languages such as, Hindi, Telegu in this dataset. After removing the non-code-mixed data, sampling techniques was observed to perform better for even "Not_Tamil" category compared to the previous Tables 1 and 2 due to the usage o SMOTE and ADASYN resampling techniques. But it was observed that the class imbalance still existed despite all these efforts. So far, all the experiments were done excluding the Levenshtein distance preprocessing step. It was observed that Levenshtein distance had a major role in identifying the code-mixed data. In order to highlight the usage of Levenshtein distance when compared to the other preprocessing techniques, a separate analysis was done using Levenshtein distance which are shown in Tables 6 and 7. As discussed in Sect. 3.2.4 the Levenshtein distance helps to minimize the spelling error using minimum edit distance. In code-mixed data, spelling variations is a major problem but can be alleviated using Levenshtein distance. Tables 6 and 7 discusses the result after applying the Levenshtein distance. The overall performance is almost similar to the previous results shown in Table 6 . Since code-mixed data, also contained Hindi, Telugu and Malayalam language words many code-mixed words could not be identified. For Example, "PINK FILM COPY HAI", Here "Hai" is confused with the English "Hai". Since the corpus was built by extracting youtube comments, they are many inconsistencies such as spelling mistakes, many language code-mixing words etc. Figure 2 depicts the performance of each relation using various ML approaches. The Levenshtein distance has overcome the spelling mistakes present in the data and many strong features for each class has alleviated the class imbalance problem to an extent. Still the class imbalance problem was not completely tackled as features contributing to one class sometimes pull down the other class. For instance, "semaya irukumnu ethirpaarthen (I thought it would be so good)". This either pulls the sentence to positive or negative due to the ambiguous multi word context. Furthermore, it was observed that, a single ML algorithm performed better for a particular class and not performing well while looking at all the classes. This was the main reason behind attempting many ML classifiers to check their multi class handling capability. Out of all ML techniques, Logistic regression has shown better performance compared to that of rest of the classifiers. The proposed work has attempted to classify sentiments from a code-mixed data that contains, majorly Tamil and also, Hindi, Telugu, Malayalam words. The proposed work has brought a lot of observations while processing code mixed data at various levels namely, preprocessing, feature extraction and classification. This paper has proposed levenshtein distance as the preprocessing technique for Tamil-English code-mixed data. This has improved the results as it worked well with identifying spelling variations that persisted in the code-mixed data written on social media. Then the proposed approach experiments revealed that the class imbalance problem can be alleviated by removing the non-code mixed data. Also, the influence of using resampling techniques such as SMOTE and ADASYN were also discussed. The combination of levenshtein distance with sampling techniques helped to increase the F1-Score but there is still a gap observed in the class imbalance problem. The future work can target at improving the F1-score by finding a strong feature extraction techniques or hybrid approaches that can helps to solve the class imbalanced problem existing in the code-mixed social text data. Sentiment and emotion classification over noisy labels Detection of hate speech text in Hindi-English codemixed data Unsupervised sentiment analysis for code-mixed data 10 Challenging problems in data mining research Sentiment analysis on imbalanced airline data Using class imbalance learning for software defect prediction Sentiment analysis: capturing favorability using natural language processing Sentiment analysis and opinion mining Deriving the pricing power of product features by mining consumer reviews Machine learning based energy management at Internet of Things network nodes Online Consumer Review: Word-of-mouth as a new element of marketing communication mix Opinion mining using econometrics: a case study on reputation systems The effect of on-line consumer reviews on consumer purchasing intention: the moderating role of involvement Emotion detection in Hinglish (Hindi+English) code-mixed social media text Sentiment analysis on medical text using combination of machine learning and SO-CAL scoring Comparing and combining sentiment analysis methods Artificial immune systems-based classification model for code-mixed social media data Machine learning based resourceful clustering with load optimization for wireless sensor networks Sentiment analysis for code-mixed Indian social media text with distributed representation Sentiment analysis of code-mixed languages leveraging resource rich languages Code-mixed sentiment analysis using machine learning and neural network approaches An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics Learning from class-imbalanced data: review of methods and applications Imbalanced text sentiment classification using universal and domain-specific knowledge Performance evaluation of routing algorithm for Manet based on the machine learning techniques Bayes imbalance impact index: a measure of class imbalanced dataset for classification problem Corpus creation for sentiment analysis in code-mixed Tamil-English text A sentiment analysis dataset for code-mixed Malayalam-English Sentiment analysis of extremism in social media from textual information Binary codes capable of correcting deletions, insertions and reversals A statistical interpretation of term specificity in retrieval Resampling methods: concepts, applications, and justification A bootstrap-based iterative selection for ensemble generation