key: cord-0511450-3zcx9sa4 authors: Guaman, Mihaela; Ionescu, Radu Tudor title: The Unreasonable Effectiveness of Machine Learning in Moldavian versus Romanian Dialect Identification date: 2020-07-30 journal: nan DOI: nan sha: 40624cc06fe953a598a01d4476a855f9021d5f5d doc_id: 511450 cord_uid: 3zcx9sa4 Motivated by the seemingly high accuracy levels of machine learning models in Moldavian versus Romanian dialect identification and the increasing research interest on this topic, we provide a follow-up on the Moldavian versus Romanian Cross-Dialect Topic Identification (MRC) shared task of the VarDial 2019 Evaluation Campaign. The shared task included two sub-task types: one that consisted in discriminating between the Moldavian and Romanian dialects and one that consisted in classifying documents by topic across the two dialects of Romanian. Participants achieved impressive scores, e.g. the top model for Moldavian versus Romanian dialect identification obtained a macro F1 score of 0.895. We conduct a subjective evaluation by human annotators, showing that humans attain much lower accuracy rates compared to machine learning (ML) models. Hence, it remains unclear why the methods proposed by participants attain such high accuracy rates. Our goal is to understand (i) why the proposed methods work so well (by visualizing the discriminative features) and (ii) to what extent these methods can keep their high accuracy levels, e.g. when we shorten the text samples to single sentences or when we use tweets at inference time. A secondary goal of our work is to propose an improved ML model using ensemble learning. Our experiments show that ML models can accurately identify the dialects, even at the sentence level and across different domains (news articles versus tweets). We also analyze the most discriminative features of the best performing models, providing some explanations behind the decisions taken by these models. Interestingly, we learn new dialectal patterns previously unknown to us or to our human annotators. Furthermore, we conduct experiments showing that the machine learning performance on the MRC shared task can be improved through an ensemble based on stacking. In recent years, we have witnessed an increasing interest in spoken or written dialect identification, proven by a high number of evaluation campaigns [1, 2, 3, 4, 5, 6, 7, 8] with more and more participants. In this paper, we explore the Moldavian versus Romanian Cross-Dialect Topic Identification (MRC) shared task, which was introduced as a task in the VarDial 2019 evaluation campaign [6] , following the release of the MOROCO data set [9] . The shared task included two sub-task types: one that consisted in discriminating between the Moldavian and the Romanian subdialects and one that consisted in classifying documents by topic across the two sub-dialects of Romanian. However, our primary focus is on the Moldavian versus Romanian dialect identification task, which was further explored in the Romanian Dialect Identification (RDI) shared tasks held at VarDial 2020 [7] and VarDial 2021 [8] . While MOROCO is a relatively recent data set, the number of works that studied Romanian dialect identification from a computational perspective [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22] has constantly grown due to the organization of annual evaluation campaigns on the topic. Romanian, the language spoken in Romania, belongs to a Balkan-Romance group that emerged in the fifth century [23] , after it separated from the Western Romance branch of languages. The Balkan-Romance group is formed of four dialects: Aromanian, Daco-Romanian, Istro-Romanian, and Megleno-Romanian. We underline that, within its group, Romanian is referred to as Daco-Romanian. Noting that Moldavian is a sub-dialect of Daco-Romanian, which is spoken in the Republic of Moldova and in northeastern Romania, the Moldavian versus Romanian dialect identification task is actually a sub-dialect identification task. The Moldavian sub-dialect can be delimited from Romanian in large part by its phonetic features, and only marginally, by morphological and lexical features [24] . Hence, it is much easier to distinguish between the spoken Moldavian and Romanian dialects than the written dialects. This is a first hint that discriminating between Moldavian and Romanian is not an easy task, at least from a human point of view. It is important to add the fact that Romania and the Republic of Moldova have the same literary standard [25] . In this context, some linguists [24] believe that a dialectal division between the two countries is not justified. However, literary standards in the two countries are continuously evolving and, for example, this has led to different writings of words containing the vocal sound 'â' (in Romanian) or 'î' (in Moldavian) (see Table 8 ). Moreover, due to the geographical division between the two countries, people often use different words to denote the same concept (see Section 5) and it may happen that they do not understand each other when the discussion includes the respective concept. These differences justify a sub-dialectal division between Romanian and Moldavian. Although we often refer to Romanian and Moldavian as dialects to simplify the writing, they are really sub-dialects. Hence, we study the challenging Moldavian versus Romanian written sub-dialect identification task, since the data set available for the experiments is composed of written news articles [9] . We naturally assume that the news articles follow the literary standards. Furthermore, named entities are masked in the entire corpus. Considering all these facts, the dialect identification task should be very difficult. We analyze the difficulty of the task from a human perspective by asking human annotators from Romania and the Republic of Moldova to label news articles with the corresponding dialect. Given that the average accuracy of the human annotators is around 53%, the human evaluation confirms the difficulty of the task. Interestingly, the machine learning (ML) methods proposed so far [9, 10, 11, 12, 13, 15, 18, 22] attain much higher accuracy rates. For example, the top scoring system in the VarDial 2019 evaluation campaign [12] obtained a macro F 1 score of 0.895 for Moldavian versus Romanian dialect identification. Furthermore, the string kernels baseline proposed by Butnaru and Ionescu [9] seems to perform even better, with a macro F 1 score of 0.941. We therefore consider the machine learning systems for Moldavian versus Romanian dialect identification to be unreasonably effective. We can naturally suppose that the high accuracy rates of the ML systems can be influenced by different factors. The first factor to consider is that the ML models have access to a large training set from which many discriminative features can be learned, including features unrelated to the dialect identification task, such as features specific to the author style. The second factor is that the samples are full-length news articles formed of several sentences. This increases the chance of finding discriminative features in just about every sample. The third factor is that the news articles are collected from different publication sources from Romania and the Republic of Moldova, and an ML system could just learn to discriminate among the publication sources. Our main goal is to determine if the machine learning models catch any dialectal clues or if the high accuracy rates are purely based on alternative factors, such as those exemplified above. In order to explain the unreasonable effectiveness of machine learning systems, we conduct a series of comprehensive experiments on MOROCO, considering all the enumerated factors. First of all, we perform experiments considering only the first sentence in each news article, significantly reducing the length of the text samples. Second of all, we test the systems on a new set of tweets from Romania and the Republic of Moldova collected from a different time period, making sure that the publication sources in the training and the test set are different. This generates a cross-domain (or cross-genre) dialect identification task, with the training (source) domain being represented by news articles and the test (target) domain being represented by tweets. Our findings indicate that, even in this difficult cross-domain setting, the ML systems still outperform humans by a significant margin. We therefore delve into analyzing and visualizing the discriminative features of one of the best-performing ML systems. Our analysis indicates that the machine learning models take their decisions mostly based on morphological and lexical features, many of which were previously unknown to us. Our second goal is to establish if further performance boosts are possible by combining highly accurate models in a single pipeline. To this end, we first reimplement and evaluate most of the top scoring methods from the related literature [9, 11, 12, 13, 15, 18, 22] . Then, we proceed towards combining the state-of-the-art methods through ensemble learning, considering an ensemble based on plurality voting and an ensemble based on classifier stacking. Our empirical results show that classifier stacking is useful, indicating that the features captured by the various machine learning models, ranging from string kernels to convolutional, recurrent and transformer networks, can complement each other towards making better decisions. Different from prior works on Romanian dialect identification [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22] , we make the following important contributions: • We introduce MOROCO-Tweets, a new set of over 5,000 Moldavian and Romanian tweets, enabling us and future works to study Romanian dialect identification in a cross-genre scenario. • We study the Romanian dialect identification task in new scenarios, considering models trained on sentences (instead of full news articles) and applied on sentences or tweets, showing how performance degrades as the scenario gets more difficult. • We propose an ensemble based on stacked generalization for Romanian dialect identification. • We study how native Romanian or Moldavian speakers compare to the ML models for dialect identification and categorization by topic, showing that there is a significant performance gap in favor of the ML models for dialect identification. • We present Grad-CAM visualizations [26] revealing dialectal patterns that explain the unreasonable effectiveness of the ML models. The newly discovered patterns were not known to us or to the human annotators. The remainder of this paper is organized as follows. We present related work on dialect identification in Section 2. We describe the machine learning systems and the ensemble learning methods in Section 3. We present the experiments in Section 4, followed by a discussion of the most discriminative features in Section 5. Finally, we draw our conclusions in Section 6. Dialect identification has been acknowledged in the computational linguistics community as an important task, with multiple events and shared tasks materializing this acknowledgement [2, 4, 5, 6, 7, 27, 28] . Naturally, some of the most wide-spread languages also tend to be the most well-studied in terms of dialect identification from a computational linguistics perspective. To our knowledge, it seems that Arabic is one of the most studied languages, considering modern setups, such as social media [29] , large and diverse corpora, such as QADI [30] , dialect recognition from speech [31, 32] or dialect identification from travel text and tweets [33] . Preliminary works dealing with Arabic dialect identification used various handcrafted and linguistic features. For instance, Biadsy et al. [34] employed a phonotactic approach to differentiate among four Arabic dialects with good accuracy. In the same direction of study, we can also mention the efforts involving experiments on the Arabic Online Commentary Dataset [35, 36] . More recently, Guellil and Azouaou [37] proposed an unsupervised approach for Algerian dialect identification. Another interesting study is conducted by Salameh et al. [38] , where the city of each speaker is identified based on the spoken dialect. The evaluation campaigns [1, 2, 4, 5, 39] represent one more proof that dialect identification is of much interest from the Arabic language perspective, as these campaigns included a shared task for Arabic dialect identification. We note that one of the most successful approaches in the Arabic dialect identification shared tasks is based on string kernels [40, 41, 42] . Among the well-studied languages from a dialectal perspective, there is also Chinese. Tsai and Chang [43] proposed a Gaussian Mixture Bigram Model in the differentiation of three major Chinese dialects spoken in Taiwan. Later, Ma et al. [44] had an attempt at distinguishing among three different Chinese dialects from speech. A semi-supervised approach, outperforming the initial Gaussian Mixture Models (GMM) for dialect identification, is introduced by Mingliang et al. [45] . In [46] , gender is employed as a factor in deciding the dialect of different Chinese utterances. A more recent work [47] employed deep bottleneck features, which are related to the phoneme level. Through deep bottleneck features, an attempt at suppressing the influence of redundant dialect information from features is made by the author. A number of works targeting dialect identification were also published for Spanish. The first such work [48] aims at differentiating Cuban and Peruvian dialects from Spanish. The same task is addressed later by Torres-Carrasquillo et al. [49] , with an approach based on GMMs, however less accurate than that of Zissman et al. [48] . In Huang and Hansen [50] , GMMs with mixture and frame selection are used for Latin-American Spanish dialect identification. More recently, Francom et al. [51] introduced the ACTIV-ES corpus, with informal language records of Spanish speakers from Argentina, Mexico and Spain. MOROCO [9] , the data set on which the current study is based on, comes as a response to the increasing interest in dialect identification with many research efforts for languages such as Arabic [35, 52, 53] , Spanish [51] , Indian [54] and Swiss [55] , trying to attract interest towards under-studied languages such as Romanian. The classification of Romanian in four dialects, i.e. Daco-Romanian, Macedo-Romanian, Aromanian and Megleno-Romanian, has been studied from a purely linguistic perspective for a few decades [56, 57, 58] . In a modern linguistic work [59] that studied Romanian and its dialects, the authors addressed the subject from a geographical, historical and etymological angle. In another modern study, Nisioi [60] proposed a quantitative approach in the investigation of the syllabic structure of the Aromanian dialect, proposing a rule-based algorithm for automatic syllabification. The aforementioned works are valuable studies performed from a social sciences perspective. However, we are interested in the computational nature of differentiating among Romanian and its dialects or sub-dialects. In this regard, to our knowledge, there is one single work [61] to study Romanian dialects from a computational linguistics perspective, before the VarDial 2019 evaluation campaign [6] . Ciobanu and Dinu [61] offer a comparative analysis of the phonetic and orthographic discrepancies between various Romanian dialects. However, the data set used in their endeavour to automatically differentiate among the aforementioned dialects, is rather small, containing only 108 words. Butnaru and Ionescu [9] introduced MOROCO, a data set of 33,564 online news reports collected from Romania and the Republic of Moldova. For each news article, the data set provides dialect labels as well as category labels. The authors applied two effective approaches in tackling the problems of dialect identification and categorization by topic: a character-level convolutional neural network, inspired by Zhang et al. [62] , and a simple Kernel Ridge Regression with custom string kernels, following Popescu and Ionescu [63] . We note that the data set proposed by Butnaru and Ionescu [9] was also used as benchmark in the first shared task on Moldavian versus Romanian Cross-Dialect Topic Identification (MRC), generating an additional set of publications [10, 11, 12, 13] . In the MRC shared task, the following sub-tasks were proposed: binary classification by dialect and cross-dialect categorization by topic. The participants proposed various approaches for the MRC shared task, ranging from various deep learning models based on word embeddings [11] or character embeddings [12] to shallow Support Vector Machines based on character n-grams [13] and voting schemes based on a set of handcrafted statistical features [10] . The set of tweets collected for this work was used as test set in the Romanian Dialect Identification (RDI) task organized at VarDial 2020 [7] . The original training and validation sets in MOROCO [9] were used as training set, whereas for the validation, we provided both the test set in the original MOROCO split as well as 200 tweets from MOROCO-Tweets. Our logic was that the provided in-domain data will help participants to achieve better performance in the final evaluation round. Among the interesting submissions received for RDI 2020, we acknowledge the SVM model of Popa and S , tefănescu [15] , which combines the powers of three multilingual transformer models trained on Romanian text samples (i.e. BERT, XLM and XLM-R) and two monolingual models (cased and uncased Romanian BERT) that targeted only Romanian during training. Furthermore, the authors used each sentence in the examples provided for training as standalone samples and they also employ decision thresholds at prediction time, aiming to maximize the macro F 1 score. The architectural choices and preprocessing, placed Popa and S , tefănescu [15] second in the RDI track organized at VarDial 2020. Another ensemble that participated in the 2020 RDI task was composed of two TF-IDF encoders and a five-layer neural network [16] . With separate encoders for Romanian and Moldavian, the text peculiarities in each of the two dialects are independently captured. The results, however, show a rather poor ability of the chosen model to discriminate among Romanian and Moldavian sentences. One explanation for the generalization issues of the model is in the lack of strong textual markers to differentiate, in writing, among the two dialects of interest. In the same spin of the RDI task, Jauhiainen et al. [17] proposed a method that relies on the product of relative frequencies of character n-grams. Their approach achieves an F 1 score that is 10% higher than the one obtained by Rebeja and Cristea [16] and more than 10% below the F 1 score obtained by Popa and S , tefănescu [15] . A different approach is proposed by Zaharia et al. [18] , who employ features ranging from character embeddings to Fast Text word embeddings [64] and transformer embeddings obtained through the fine-tuning of Romanian BERT [65] . All these complementary types of embeddings are then fed into a Bidirectional LSTM network suited for the classification by dialect of the provided samples. Perhaps surprisingly, the Naïve Bayes model trained on character n-grams presented by Ceolin and Zhang [19] achieves a macro F 1 score of 0.667, surpassing most of the previously described solutions that rely, to a certain degree, on deep learning or at least on more complex techniques. At VarDial 2021 [8] , the RDI task was reiterated for the third time, with more training data consisting of the entire MOROCO data set [9] . The set of tweets collected for the current work was provided as validation data, while for the final testing, we collected a new set of tweets. Our intuition was that providing participants with more tweets for validation, that could also be used for training, will lead to an important performance boost. However, compared to the overall results obtained at the RDI task of VarDial 2020, the results did not improve by a significant margin. The solution submitted at RDI 2021 by Jauhiainen et al. [20] achieved the best performance, with a macro F 1 score of 0.777, which did not fall far from the top scoring systems at RDI 2020. Their approach employed a Naïve Bayes model trained using the product of relative frequencies of character n-grams and the language model adaptation method of Jauhiainen et al. [66] . Using an approach based on transformers and knowledge distillation, Zaharia et al. [22] ranked on the second place at RDI 2021, with an F 1 score of 0.732. Interestingly, Ceolin [21] ranked third at the competition, bringing some improvements to the proposed CNN architecture, after using a data augmentation technique consisting of random swaps of the words in each sentence. In our study, we consider the best performing models in the MRC and RDI shared tasks [11, 12, 13, 15, 18, 22] along with the baselines proposed by Butnaru and Ionescu [9] , combining these approaches into ensemble models based on voting or stacking. Aside from dialect identification, we are also performing intra and cross-dialect categorization by topic throughout this work. Thus, we consider appropriate to include related work on topic classification. Text classification is the task of labeling natural language text as pertaining to a predefined number of categories [67, 68] . As one of the most fundamental tasks in NLP, text classification has been widely studied [69] . Examples of setups and applications are (but not limited to) social media [70] , healthcare [71, 72, 73] , information retrieval [74] , sentiment analysis [75, 76, 77, 78, 79] , content-based recommender systems [80] , document summarization [81, 82] , various business and marketing applications [83, 84, 85] , legal document categorization [86] . A variety of languages were targeted over time for the popular text classification task, including well-studied languages such as Arabic [87, 88] , Turkish [83, 89, 90, 91] , French [71, 92] , Spanish [72] and Indian [93] , as well as under-resourced languages such as Romanian [94] . The applied classification techniques range from shallow methods, such as Logistic Regression [95] , SVM [96] and Naïve Bayes [97] , to more complex and resource-hungry deep neural networks, such as CNNs [62, 98] , Hierarchical Attention Networks [99] and the powerful transformer-based methods that started to dominate the landscape in recent years [100, 101] . In the resourceful English language, the research community had the means to explore various topic classification techniques, from shallow methods, such as k-nearest neighbors, Multinomial Naïve Bayes and decision trees [102] , to deep forests [103] and Bayesian networks [104] . Non-English languages are targeted as well for topic classification. For example, advanced and powerful methods such as Hierarchical Attention Networks [105] , transformers [106] or hybrid Latent Dirichlet Allocation approaches [107] are employed in the classification by topic of Chinese text samples. For Spanish, we find a few works focused on topic classification considering both the linguistic approach [108] as well as the computational alternatives, e.g. an ensemble of shallow methods [109] . Categorization by topic has even been explored for understudied languages such as Korean [110] , Indonesian [111] or Romanian [112] , although the number of works are comparably lower. Remarkably, there are a few recent attempts at language-agnostic methods for topic classification [113, 114] . We emphasize that in-domain and cross-domain topic classification are common topics in the NLP research community [115] . Perhaps less common, at least for the Romanian language, the categorization underlined in this work is performed in cross-dialect and intra-dialect setups. To our knowledge, all the other works targeting Romanian cross-dialect and intra-dialect categorization by topic are related to this paper, in that they use the same data set, MOROCO, for training and evaluation [11, 12] . Throughout this section, we present in detail the most relevant models from the related literature [9, 11, 12, 13, 15, 18, 22] , which we have selected to build an ensemble. From Butnaru and Ionescu [9] , we select the Kernel Ridge Regression based on string kernels, since this is their best baseline. From Tudoreanu [12] , the winner of the Moldavian versus Romanian dialect identification sub-task, we select the character-level convolutional neural network (CNN), which is similar in design to the character-level CNN presented by Butnaru and Ionescu [9] . Onose et al. [11] applied three different deep learning models: a Long Short-Term Memory (LSTM) network, a Bidirectional Gated Recurrent Units (BiGRU) network and a Hierarchical Attention Network (HAN). Since these deep models are quite diverse, we included all their models in our study. From Wu et al. [13] , we consider the Support Vector Machines based on character n-grams. For efficiency reasons, we employed the dual form of their SVM, which is given by string kernels. Finally, from the more recent approaches proposed at RDI 2020 and 2021 [15, 18, 22] , we decided to include a fine-tuned Romanian BERT in our experiments. We did not include other recent models that represent minor variations of the previously selected models to be part of our ensemble. We underline that the considered methods form a broad variety that includes both shallow models based on handcrafted features and deep models based on automatically-learned character or word embeddings. Nevertheless, all methods are essentially based on two steps, data representation and learning, although in some models, e.g. the character-level CNN, the steps are performed in an end-to-end fashion. We next provide details about the data representations and the learning models considered in our experiments. Word Embeddings. Some of the first statistical learning models for building vectorial word representations were introduced in [116, 117] . The goal of vectorial word representations (word embeddings) is to associate similar vectors to semantically related words, allowing us to express semantic relations mathematically in the generated embedding space. After the preliminary work of Bengio et al. [116] and Schütze [117] , various improvements have been made to the quality of the embedding and the training time [118, 119, 120, 121] , while some efforts have been directed towards learning multiple representations for polysemous words [122, 123, 124] . These improvements, and many others not mentioned here, have been extensively used in various NLP tasks [125, 126, 127, 128, 129, 130, 131] . In the experiments, we use pre-trained word embeddings as features for the LSTM, BiGRU and HAN models. In the feature extraction step, we employ the same set of distributed word representations as Onose et al. [11] . We note that these representations are learned from Romanian corpora, such as the corpus for contemporary Romanian language (CoRoLa) [132, 133] , Common Crawl (CC) and Wikipedia [134] , as well as from data coming from the Universal Dependencies project [135] , that is added to the Nordic Language Processing Laboratory (NLPL) shared repository. In the remainder of this paper, we refer to these representations as CoRoLa, CC and NLPL, respectively. The distributed representation of the words in CoRoLa was learned using a feed-forward neural network, with words being initially represented as sums of the character n-grams [64] in each word [133] . The CoRoLa embeddings used in this work have an embedding size of 300, with a vocabulary size of 250,942 tokens. With the same embedding size, but with a vocabulary that is almost 10 times larger (i.e. 2 millions of words), CC [134] is the second set of pre-trained Roma-nian word vectors that we have tried out. As we have previously mentioned, CC has been trained on Common Crawl and Wikipedia, using FastText [64, 136] . The third set of word embeddings used in our experiments comes from the NLPL repository and contains vectors of size 100, with the biggest vocabulary of the three, i.e. 2,153,518 words. The NLPL embeddings are trained on the Romanian CoNLL17 corpus [137] , using the Skip-gram model from word2vec as learning technique [119] . Character Embeddings. Some of the pioneering works in language modeling at the character level are [138, 139] . To date, characters proved useful in a variety of neural models, such as Recurrent Neural Networks (RNNs) [140] , LSTM networks [141, 142] , CNNs [62, 143] and transformer models [144] . Characters are the smallest units necessary in building words that exist in the vocabulary, regardless of language, as the alphabet changes only slightly across many languages. Thus, knowledge of words, semantic structure or syntax is not required when working with characters. Robustness to spelling errors and words that are outside the vocabulary [141] constitute other advantages explaining the growing interest for using characters as features. In our paper, we employ three models working at the character level, an SVM and a KRR based on character n-grams [13] , as well as a character-level CNN [9, 12] . The CNN is equipped with a character embedding layer, generating a 2D representation of text that is further processed by the convolutional layers. We provide additional details about the CNN in Section 3.2. String Kernels. Lodhi et al. [145, 96] introduced string kernels as a means of comparing two documents, based on the inner product generated by all substrings of length n, typically known as n-grams. Of interest in determining the similarity are the n-grams that the two documents have in common. The authors applied string kernels in a text classification task with promising results. Since then, string kernels have found many applications, from protein classification [146] and learning semantic parsers [147] to tasks as complex as recognizing famous pianists by their playing style [148] or dynamic scene understanding [149] . Other applications of the method include various NLP tasks across different languages, e.g. sentiment analysis [150, 151, 152] , authorship identification [153] , automated essay scoring [154] , sentence selection [155] , native language identification [63, 156, 157, 158] and dialect identification [9, 40, 42] . Many improvements have also been added, incrementally, to the original method. These target the space usage [159] , versatility [160] and time complexity [152, 161] . In this work, we employ string kernels as described in [9] , specifically using the efficient algorithm for building string kernels of Popescu et al. [152] . We emphasize that the number of character n-grams is usually much higher than the number of samples, so representing the text samples as feature vectors may require a lot of space. String kernels provide an efficient way to avoid storing and using the feature vectors (primal form), by representing the data though a kernel matrix (dual form). Each cell in the kernel matrix represents the similarity between some text samples x i and x j . In our experiments, we compute the similarity as the presence bits string kernel [63] . For two strings x i and x j over a set of characters S , the presence bits string kernel is defined as follows: where n is the length of n-grams and #(x, g ) is a function that returns 1 when the number of occurrences of n-gram g in x is greater than 1, and 0 otherwise. While there is a broad spectrum of machine learning models, e.g. [162, 163] , we only consider models that have been previously used with success for Romanian dialect identification. Additionally, we integrate the individual models presented below into ensembles. the data points provided in the training phase into two classes [164] . To ensure a good generalization capability, the SVM aims at maximizing the margin that separates the points in both classes. The margin is chosen based on the points that are closest to the decision boundary. These points are called support vectors, and, not only do they give the name of the method, but they also influence the orientation and position of the hyperplane that is eventually used for classification during inference. Through the kernel trick, the SVM gains the power to classify data that is not linearly separable, since the data is mapped into a higher-dimensional space, where it becomes separable using a hyperplane [165] . For multi-class classification, multiple SVM classifiers need to be trained in a one-versus-one or oneversus-rest scheme. In our text categorization by topic experiments, we employ the one-versus-one scheme. Instead of using a standard kernel, we employ the SVM with the custom string kernel based on character n-grams defined in Equation 1. We note that our dual SVM based on string kernels is mathematically equivalent to the primal SVM based on character n-grams employed by Wu et al. [13] . We prefer the dual SVM because it is more computationally efficient, as explained in detail by Ionescu et al. [157] . Ridge Regression [166] , or linear regression with L 2 regularization for overfitting prevention, has been combined with the kernel trick [167] , enabling the method to capture non-linear relations between features and responses. The kernel version, known as Kernel Ridge Regression (KRR), is a state-of-the-art technique [165] used in several recent works [9, 151, 157] with very good results. KRR can be seen as a generalization of simple Ridge Regression, learning a function in the Hilbert space described by the kernel. The learned function is either linear or non-linear, with respect to the original space, depending on the considered kernel [165] . Although KRR can be used with any kernel function, we employ the KRR based on the kernel defined in Equation 1, as previously proposed by Butnaru and Ionescu [9] . In order to repurpose the trained regressor as a (binary) classifier, we round the predicted continuous values to the values in the set {−1, 1}. For the multi-class text categorization by topic tasks, we employ KRR in a one-versus-rest scheme. A type of artificial neural network based on convolving multiple sets of filters in a sequential manner is represented by the convolutional neural network. The rectified outputs yielded by the convolution operation are called activation maps and they are subject to pooling operations, which provide a downscaled version of the activation maps, implicitly reducing the amount of parameters and computations further used in the network. After repeating a number of convolutional blocks consisting of convolutions and pooling operations, a sequence of fully-connected layers typically follows, with the last layer having a number of units equal to the number of classes in the data set. Because CNNs are inspired by the mammalian visual cortex [168, 169] , such models have been found suitable, initially, for image classification [170, 171, 172, 173] . Afterwards, this approach has been adapted for natural language processing (NLP) problems [62, 174] . In NLP, the meaning of the inputs changes: instead of image pixels, we have documents represented as a matrix, using either word [175] or character embeddings [62] . One of the models that we employ in the experiments is a character-level CNN [62] with squeeze-and-excitation (SE) blocks, introduced by Butnaru and Ionescu [9] . Our motivation for this choice of algorithm lies in (i ) the good results obtained on MOROCO by Butnaru and Ionescu [9] and by Tudoreanu [12] , and also, in (i i ) the interpretability of the model through visualization techniques. We used the latter feature to get a better understanding of the CNN model's effectiveness in Section 5, based on Grad-CAM visualizations [26] . Long Short-Term Memory Networks. Recurrent Neural Networks (RNNs) [176] represent a type of neural model that operates at the sequence level, achieving state-of-the-art performance on language modeling tasks [177, 178] , among other problems involving time series. Their effectiveness is constrained by the length of the input sequence. RNNs must use context in order to make predictions, while they also need to learn the context itself, which can lead to vanishing gradients problems [179] , a major drawback of simple RNNs. This is solved in Long Short-Term Memory networks (LSTMs) [180] , which rely on an RNN architecture that uses a more complex structure for its base units. An LSTM unit has a cell that acts as a memory element, remembering dependencies in the input. The amount of information stored in this cell and its overall impact is controlled through three gates acting as regulators. The input and output gates control and select the information to be added into and outside of the cell. Later versions of LSTMs also use forget gates, enabling the cell to reset its state for optimization reasons [181, 182] . With these modifications in terms of structure and computation, LSTMs are able to selectively capture long-term dependencies without the technical challenges faced when working with simple RNNs, i.e. exploding and vanishing gradients. Onose et al. [11] showed that LSTMs are also useful in the dialect identification and categorization sub-tasks on the MOROCO data set. Hence, using this type of network in our experiments has been inspired by Onose et al. [11] . Bidirectional Gated Recurrent Units. Gated Recurrent Units (GRUs) [183] implement a simplified version of LSTMs having only input and forget gates, i.e. the output gate is excluded. With fewer parameters than LSTMs, the performance achieved by GRUs on various tasks, e.g. speech recognition, is similar to the one achieved by LSTMs [184] . Moreover, GRUs tend to outperform LSTMs on small data sets [177] . The roles seem reversed for problems such as language recognition [178] or neural machine translation [185] . We note that GRUs, as well as other types of RNNs, can use a bidirectional architecture, an adjustment made with the aim of addressing the need of knowing both the previous and the next context to understand the current word. Thus, a bidirectional Gated Recurrent Unit (BiGRU) model is composed of two vanilla GRUs, one with forward activations (i.e. getting information from the past) and one with backward activations (i.e. getting information from the future) [186] . BiGRUs are among the models that proved their efficiency in the experiments conducted by Onose et al. [11] on MOROCO, which is why we decided to include the BiGRU architecture in our set of models. Hierarchical Attention Networks. Proposed by Yang et al. [99] , Hierarchical Attention Networks (HANs) have been initially applied in document classification. The success obtained on this task is explained by the natural approach taken in HANs, reflecting the structure of documents through attention mechanisms applied at two levels: for words that form sentences and for sentences as components of documents. In the case of HANs, the attention mechanism uses context to spot relevant sequences of tokens in a given sentence or document. Essentially, the same algorithms, namely encoding and selection by relevance, are applied twice, at the word level and also at the sentence level [99] . As for the previously described methods, i.e. LSTM and BiGRU, the inclusion of HAN in our set of models to be used in the experiments has its motivation in the results obtained by Onose et al. [11] . Romanian BERT. The transformer architecture was introduced by Vaswani et al. [187] and showed a remarkable boost in performance compared to the state-of-the-art at the moment. One year later, Devlin et al. [188] applied the bidirectional training of transformers to address language modeling. They named the new model BERT (Bidirectional Encoder Representations from Transformers), and ever since, BERT was adopted by many NLP researchers as a state-of-theart transformer-based model. Perhaps one of the most beneficial features of BERT is represented by its multilingual training setup, comprising more than 100 languages. In recent years, more and more monolingual flavours of BERT started to be released, e.g. BERTje for Dutch [189] , CamemBERT [190] and FlauBERT for French [191] , AlBERTo for Italian [192] , among others. Of particular interest to our work is the Romanian adaption of BERT [65] , which was trained on more than 15GB worth of Romanian data and has been shown to surpass its multilingual counterpart in many tasks, e.g. named entity recognition [65] . In this work, we fine-tune the Romanian BERT (Ro-BERT) to be able to distinguish among the Romanian and Moldavian dialects. Another classification setup for which we fine-tune Ro-BERT is categorization by topic, where the model learns to discriminate among the six categories available in the data set. In order to obtain the probability for each class, we append a Softmax layer to BERT, either with 2 neurons for dialect identification or 6 neurons for classification by topic. F I G U R E 1 Overview of the proposed pipeline based on stacked generalization. Although the pipeline is illustrated for the task of dialect identification, the same architecture is trained for in-domain and cross-domain categorization by topic. Best viewed in color. Ensemble Models. The main idea behind ensemble models is to combine multiple learning techniques in order to obtain a model that achieves better results than any of its individual components [193, 194] . The model obtained via ensemble learning is typically more stable and robust [195] . There is proof that a significant diversity among the component models of an ensemble leads to better results than in the case where similar techniques are brought together into an ensemble [196, 197] . We use this hypothesis in the experiments conducted in this work. More precisely, our models cover different features as input, from the basic character-level properties of string kernels to the hierarchical selection of words and sentences of HAN. Furthermore, not only that we employ a diversity in the types of features, but we also use different, complementary learning techniques, ranging from shallow models, such as SVM and KRR, to deep models, such as CNN, RNN and BERT. We underline that our first motivation for including the above models into our ensemble is that diversity is more likely to generate an ensemble that surpasses its components in terms of accuracy. Our second motivation for including the specified models is that these have been used in top ranking systems at the MRC shared task. Hence, the ensemble is more likely to achieve state-of-the-art performance. Plurality voting is one of the ensemble approaches that we choose for our experiments. In this approach, the models in the ensemble simply vote with equal weights. The second ensemble learning approach that we consider for the experiments is stacking. Stacked generalization or stacking is an ensemble learning method that learns how to blend the predictions provided by multiple models, through meta-learning [198] . In our case, the predictions from all the models presented above (SVM, KRR, CNN, LSTM, BiGRU, HAN and Ro-BERT) are taken into consideration. These are known as level-zero models. We underline that we use three types of pre-trained word embeddings for LSTM, BiGRU and HAN, generating a total of nine recurrent models. We note that stacking is different from bagging in that the machine learning models forming the ensemble are different from each other, while being trained on the same data. Stacking uses a meta-model (also known as level-one model) to harness the capabilities of the level-zero models. We employ Multinomial Logistic Regression as our meta-classifier. As input to the meta-model, we consider a vector containing the hard labels as well as the soft scores (class probabilities) provided by each model. Through the metamodel, stacking can learn when to use or trust each model in the ensemble, thus being able to make predictions having superior performance than any single model in the ensemble. Our ensemble learning pipeline based on stacking is illustrated in Figure 1 . We emphasize that stacking is suitable when there are multiple distinct models achieving good performance levels, but on different data samples. In other words, if predictions of the level-zero models have a low correlation, then it is likely to achieve superior results through stacking. Since our level-zero models are different from each other, we believe that stacked generalization is a suitable method to obtain a good ensemble. Although stacking is designed to increase performance, to our knowledge, there is no guarantee that it will lead to superior results in all cases. Finally, we underline that ensemble learning has not been studied on MOROCO before our work. Hence, this is the first study to test out the effectiveness of ensemble learning in Romanian dialect identification. Although we are primarily interested in the dialect identification task, we present results for the full range of tasks proposed by Butnaru and Ionescu [9] , namely: • binary discrimination between Romanian (RO) and Moldavian (MD); • Romanian intra-dialect categorization by topic; • Moldavian intra-dialect categorization by topic; • cross-dialect categorization by topic using Moldavian as source and Romanian as target; • cross-dialect categorization by topic using Romanian as source and Moldavian as target. In this paper, we introduce an additional data set composed of tweets collected from Romania and the Republic of Moldova, which allows us to evaluate the machine learning models in a cross-genre dialect identification setting. The tweets were collected from a different time period, helping us to reveal any overfitting behavior of the models. The tweets. All tweets are pre-processed for named entity removal. We did not collect any topic labels from Twitter, since we are mostly interested in cross-genre dialect identification. We first evaluate the considered machine learning models on MOROCO, using the complete news articles, as in all previous works [9, 11, 12, 13, 15, 18, 22] . Since our aim is to determine the extent to which machine learning models attain good performance levels, we consider an additional scenario in which we keep only the first sentence from each news article. This essentially transforms all tasks into sentence-level classification tasks. As we keep the same number of data samples, the sentence-level classification accuracy rates are expected to drop, essentially because there are less patterns in the data. We include an even more difficult evaluation setting, testing the models trained at the sentence-level on tweets, while considering only the dialect identification task. As evaluation metrics, we employ the classification accuracy and the macro F 1 score. We note that the macro F 1 score is the official metric chosen for the VarDial evaluation campaigns that featured dialect identification tasks using MOROCO as support data set [6, 7, 8 ]. We have borrowed as many of the hyperparameters as possible from the works [9, 11, 12, 13, 15, 18, 22] proposing the models considered in our experiments, trying to replicate the previously reported results as closely as possible. When sufficient details to replicate the results were missing, we tune the corresponding hyperparameters on the validation data. We next present the hyperparameter choices for each machine learning model. We train an SVM with a pre-computed string kernel with C = 10 2 , which has been selected via grid search from a range of values starting from 10 −3 to 10 3 , considering a multiplication step of 10. The string kernel is based on character 6-grams. For KRR, the only parameter that requires tuning is the regularization λ. From a set of potential values ranging from 10 −5 to 10 −1 , with a multiplication step of 10, the best λ for our setup is 10 −2 . As for the SVM, the string kernel used in KRR is based on character 6-grams. Character-level CNN. We employ the same architecture and hyperparameters as Butnaru and Ionescu [9] . In HAN, we set the maximum sequence length to 150 words, which is also valid for the other word-based models Ensemble Models. While the pluraity voting strategy requires no hyperparameter tuning, the meta-learner used in model stacking, namely Logistic Regression, requires tuning of the regularization parameter C and the penalty. As penalty, we generally obtain better validation results with L 2 over L 1 , except for Moldavian intra-dialect categorization by topic. The parameter C is validated within 10 −3 and 10 3 , considering a step of 10. Depending on the task, we typically obtain the best validation results with C = 10 −1 or C = 1. An exceptional case is the sentence-level dialect identification task, where the optimal C is 10 −3 . In Table 1 , we present the dialect identification results of various ML methods in three different scenarios. In the first scenario, in which the models are trained and tested on full news articles, there are four individual models that surpass the 90% threshold for both evaluation metrics, namely the SVM, the KRR, the character-level CNN and the fine-tuned Romanian BERT. The ensemble models are also going beyond this threshold. In general, it seems that the dialect identification task on entire news articles is fairly easy. However, the high accuracy rates could also be explained by many other factors, namely by the fact that the models actually discriminate the news articles based on author style, publication source or the discussed subjects, which might be different in the two countries. In order to diminish the effects of such additional factors, we considered two additional scenarios, one that involves training and testing at the sentence level, and one that involves a cross-genre evaluation. In the second scenario, in which the models are trained and tested on sentences, we observe significant performance drops with respect to the first scenario. Indeed, the accuracy rates and the macro F 1 scores drop by roughly 10% for almost all models. The only model that does not register such a high performance decrease is HAN, but its scores in the first scenario are quite low. Although it is much harder to recover the author style, the publication source or the subject from the first sentence of each news article, these patterns are not completely eliminated. We therefore consider the third evaluation scenario, in which the models are trained on sentences from MOROCO and tested on tweets collected from different sources and from a different time period. We observe further performance drops in the third scenario. While some models are close to a random chance prediction, e.g. HAN, other models are close to 70% in terms of both accuracy and macro F 1 . As shown in Table 4 , the human-level performance in the Moldavian versus Romanian dialect identification task is much under the best performing ML models evaluated on tweets. In order to understand and explain this difference, we analyze the Grad-CAM visualizations [26] for one of the best performing models, namely the character-level CNN, in Section 5. Considering all three evaluation scenarios, the individual models attaining the best results are the SVM and the KRR models, both being based on string kernels. These two models are closely followed by the state-of-the-art Romanian BERT, which even outperforms SVM and KRR at the sentence level. We can say for sure that Ro-BERT precedes the character-level CNN, the latter model being ranked as the fourth best. The plurality voting strategy attains mixed results, failing to surpass the top two individual models in all three evaluation scenarios. However, our ensemble based on stacking seems to be more powerful, achieving the best results in each and every case. The last two columns in Table 1 indicate the training and inference times for the experiments conducted on full news articles. For each method, the reported training time (measured in minutes) represents the total amount of time required to extract features and to learn the model until convergence, while the inference time (measured in milliseconds) represents the average time required to extract features and to predict the label for one news article. All times are measured on machine with an Intel Xeon E5-2687W v4 3.00 GHz CPU, two Nvidia GeForce GTX 1080Ti GPUs, and 256 GB of RAM. In terms of traning time, the most efficient model is the char-CNN, which is followed by the LSTM and BiGRU models based on various word embeddings. In terms of inference time, the most efficient models are the char-CNN with 1 ms and the BERT model with 5 ms. These models are followed by the LSTMs based on word embeddings, each requiring less than 10 ms, and the SVM and KRR based on string kernels, each requiring 26 ms. The least efficient individual model is clearly HAN. We underline that it is natural for the ensembles based on voting and stacking to take more time during training TA B L E 2 Accuracy rates and macro F 1 scores of various machine learning models obtained at test time for the intra-dialect categorization by topic tasks. The results are report for two evaluation scenarios: (i ) full articles: training and testing on full news articles, (i i ) sentences: training and testing on the first sentence from each news article. The best results on each column are highlighted in bold. Accuracy Macro We report the intra-dialect categorization by topic accuracy rates and macro F 1 scores of various models in Table 2 . First of all, we note that the models generally attain better results within the Moldavian dialect as opposed to the Romanian dialect. In the first evaluation scenario, which is based on full news articles, all models, except HAN, surpass the 90% threshold in terms of accuracy rate for the Moldavian news articles. On both dialects, the best accuracy rates in the first evaluation scenario are obtained by the LSTM based on CoRoLa embeddings, surpassing even the ensemble models. The LSTM based on CC embeddings attains the top accuracy rates on both dialects in the second evaluation scenario, which is conducted at the sentence level. In general, we observe that deep learning models attain better accuracy rates than the shallow SVM and KRR, while the latter models are ranked second and third after the Romanian BERT model, by macro F 1 scores, among all individual models. We underline that the high differences between the classification accuracy, which is equivalent to the micro F 1 score, and the macro F 1 score of each classifier can be explained by the fact that the topic distribution in MOROCO is unbalanced [9] . The macro F 1 score is considered more relevant by the VarDial shared tasks organizers [6] , as it assigns equal weights to each class. Although SVM and KRR surpass other individual models, the best macro F 1 scores in most intra-dialect categorization experiments are attained by the ensemble based on classifier stacking. An exceptional case is the categorization of sentences written in Moldavian, where the macro F 1 score obtained by Ro-BERT marginally surpasses the score achieved by our ensemble. Comparing the categorization results of the ML models in the second evaluation scenario with those reported for the human annotators in Table 4 , we observe that the performance gap in favor of the machine learning models is smaller with respect to the gap observed in the dialect identification experiments. We conjecture that this observation indicates that the dialectal features are likely more subtle than the topical features. In Table 3 , we present the accuracy rates and the macro F 1 scores of the considered ML models for cross-dialect categorization by topic in two scenarios, one based on full articles and one based on sentences. In general, we notice that most of the patterns observed in the intra-dialect categorization experiments shown in Table 2 also apply to the cross-dialect experiments. Indeed, we observe that the deep learning methods typically yield superior accuracy rates with respect to the shallow methods based on string kernels, the best approach in most cases being the LSTM network. Nevertheless, the SVM and the KRR compensate by attaining better macro F 1 scores than most of the deep learning models. The two kernel approaches are consistently surpassed by Ro-BERT. As for the in-domain experiments, the ensemble based on stacking yields the top macro F 1 scores for both cross-dialect tasks performed on full articles. Ro-BERT slightly surpasses the ensemble meta-learner when it comes to cross-dialect categorization of sentences. In summary, we consider that the idea of combining the models into an ensemble via classifier stacking is very useful. Comparing the cross-dialect categorization results of the ML classifiers at the sentence level with those reported for the human annotators in Table 4 , we emphasize that, at least in terms of the macro F 1 metric, humans are generally better. We have asked ten human subjects to manually annotate a subset of 120 randomly selected samples from the MO- Another fact about the data set is that the samples considered for annotation contain only the first sentence of the original news articles. This made the task more challenging from a human perspective, as we took away most TA B L E 3 Accuracy rates and macro F 1 scores of various machine learning models obtained at test time for the cross-dialect categorization by topic tasks, namely MD→RO and RO→MD. The results are report for two evaluation scenarios: (i ) full articles: training and testing on full news articles, (i i ) sentences: training and testing on the first sentence from each news article. The best results on each column are highlighted in bold. The summary of the human annotation is presented in Table 4 . For dialect identification, the worst results are just below random chance, the accuracy of annotators #A1 and #A4 being 48.3%. Moreover, the accuracy averaged TA B L E 4 Accuracy rates and macro F 1 scores of ten human subjects that were asked to annotate 120 sentences with dialectal and categorical labels. The last row indicates the average values computed on all ten annotators. Human Annotated Data is the only one getting closer to the results reported for the ML models in Table 1 . We believe it is fair to compare the human performance at the sentence level with the performance of ML models applied on tweets. We hereby note that the accuracy of the best human annotator exceeds the accuracy of LSTM and HAN. However, SVM and KRR provide accuracy rates and macro F 1 scores that are about 10% higher than those of annotator #A3. The ensemble based on stacking is even better. This high difference between the ML models and the Romanian and Moldavian speaking annotators indicates that there are some subtle patterns undetected by humans. In order to discover these patterns, in Section 5, we analyze Grad-CAM visualizations pointing out what models, particularly the character-level CNN, focus on. Evaluating the human annotations for the categorization by topic task, we observe that the annotators are much better at discriminating between the six topics than identifying the dialect, the accuracy rates being between 63.3% and 75.0% and the macro F 1 scores being between 62.7% and 74.8%. These results are comparable to the ones obtained by the ML models in the intra-dialect and cross-dialect categorization experiments presented in Tables 2 and 3 , respectively. The previous statement is specifically valid for the results reported in the second scenario, in which models are trained and tested at the sentence level. speakers, hence the bias towards labeling more samples as Romanian, unless they found clues indicating otherwise. Additionally, the poor results confirm the difficulty of this binary classification task, from a human perspective. Figure 2 (b) displays the sum of confusion matrices computed on the ten human annotators, for the categorization by topic task. For the sports category, annotators were able to correctly classify almost all sentences, with an average of 1.6 false negatives per annotator. Since sports is less related to the other categories and sports news likely contain semantic clues right from the first sentence regarding the category of the content, it seems natural for people to find it more distinctive. Not the same stands for categories such as finance, politics, science or tech. Indeed, the highest confusions are between finance and politics and between science and tech, respectively. In Table 5 , we display six samples selected from the data set provided to the human annotators. Among the presented samples, the first three belong to the Moldavian dialect, while the last three belong to the Romanian dialect. For a better comprehension, the English translation of each sample is also included in Table 5 . We selected the samples considering three different cases: (d .i ) most annotators agree on the label, but the plurality vote label does not match the ground-truth label; (d .i i ) most annotators agree on a label that matches the ground-truth label; (d .i i i ) there are strong disagreements among annotators, such that a majority cannot be determined. The first and the sixth rows in is more likely to originate in Romania, as Romania is involved in receiving funds from the European Union. Finally, samples #S3 and #S5 are representative for case (d .i i i ). We notice that, in sample #S5, there is simply not enough context to infer the dialect, while sample #S3 does not bare any clues to indicate the dialect, although the sentence is longer. Interestingly, in the presented samples neither we nor the annotators were able to spot any dialectal clues. Although some samples were labeled correctly, the clues indicating the correct dialect are more related to the subject rather than the dialect. Until this point, we conclude that either the dialectal patterns are missing or they are very hard to spot by humans. The analysis provided in Section 5 reveals that the character-level CNN does learn some interesting dialectal clues, which we were not aware of. In Table 6 , we present sentences with category labels for two different cases: (c.i ) the correct category is chosen in unanimity; (c.i i ) there are disagreements among annotators, regardless of the final result of plurality voting. Each of these two cases is exemplified through one sentence for each of the six categories. Samples #S7, #S9, #S11, #S13, #S15, #S17 are representative for case (c.i ). The nouns "muzică" and "poezie" in example #S7 are strong clues for the culture category, hence the unanimity of votes in this direction. In sample #S9, the keyword "economie" gives the strongest clue for the finance category, while in sample #S11, the noun phrase "alegeri parlamentare" suggests that the sentence belongs to the politics category. However, sample #S13 does not seem to contain any specific phrase that can be considered a strong indicator for the science topic. Here, it is the entire context that reveals the nature of the sentence. The name of a famous football player has escaped our named entity removal process, representing the reason why sentence #S15 was unanimously classified as belonging to the sports topic. In example #S17, the construct "telefoane inteligente", which translates to "smartphones", is a very strong indicator for the tech category. Therefore, the annotators unanimously labeled #S17 as part of the tech sector. Examples #S8, #S10, #S12, #S14, #S16, #S18 are representative for case (c.i i ), having at least one wrong label among the manual annotations provided by the ten annotators. Only two out of ten annotators have correctly labeled sample #S8 as belonging to the culture topic. The other annotators were deceived by the fact that sample #S8 contains the word "politica", suggesting the politics label, or the word "podium", suggesting the sports label. The annotations of sample #S10 confirm the confusion between finance and politics observed in the confusion matrix depicted in Figure 2 (b). In sample #S10, we observe a reason suggesting that the label is politics, namely the presence of the noun phrase "coaliţia la guvernare". If the annotators would have considered the noun "accizele" (taxes) as more relevant, they would have been able to find the correct category, i.e. finance. Sample #S12 contains very few words along with many placeholders for named entities. However, most of the annotators know that "membri observatori" is a political function inside the European Union, hence the label politics. Sample #S14 presents strong disagreements among the annotators. This is expected due to the very short sentence lacking sufficient context to label the example. Misclassified sports samples were very few in the data set, as we can also see in Figure 2 (b). Sample #S16 is one of the few where three annotators did not mark the text as belonging to the sports category. Leaving aside the lack of context in sample #S18, we note that the noun phrase "piaţa online" (online market) might suggest the finance and the tech topics. The labels provided by the annotators are divided between these two topics, confirming our hypothesis about "piaţa online". So far, it remains unclear if there are any dialectal clues in the news articles from MOROCO. One hypothesis (H1) is that there are no dialectal clues, since Romanian speakers had a hard time distinguishing between the two dialects, as shown in Table 4 . In this case, the good performance of the machine learning models can be explained through other factors, e.g. subjects specific to each of the two countries. The alternative hypothesis (H2) is that the samples contain dialectal clues, since the machine learning models trained on news articles are able to classify tweets collected from a different time period. In this case, the low performance of human annotators can be explained if we consider that the dialectal clues are harder to spot than expected. In order to find out which hypothesis is valid, we analyze the discriminative features learned by the character-level CNN, which is among the top three individual dialect identification systems. We opted for the character-level CNN in favor of the better SVM and KRR, as it allows us to look TA B L E 6 Examples of sentences with ground-truth labels, as well as labels assigned by humans, for the categorization by topic task. Corresponding English translations are also provided for a better comprehension. We quantized the importance of each character using 10 shades of blue (for Romanian) or 10 shades of red (for Moldavian), the darker shades representing more relevant features and the lighter shades representing less relevant features, respectively. In order to extract the importance of each character, we used the weights learned by the last convolutional layer in the network as well as the spatial localization kept in the activation maps resulted upon convolving filters of predefined size over the input fed to the model. In the remainder of this discussion, we try to explain why the features considered important by the character-level CNN also make sense from a human perspective. We provide a set of visualizations for Romanian sentences in Table 7 . In sample #R1, the model focuses on the first four words, but the one indicating the dialect is "demarat". This word, which translates to "started", is used in Romanian to indicate that the start of a construction process. In Moldavian, the word "început" would have probably been used instead to express the same thing. We note that the word "început" is also commonly used in Romania, but in typically different contexts. Perhaps this is why the model also highlights the neighboring word "producţia" (production). Sample #R2 contains an entire Romanian proverb which is, as a whole, predictive for this dialect. It refers to people doing useless jobs, e.g. "cutting leaves to the dogs". In sample #R3, the CNN focuses on two separate groups of words, but we believe the dialectal clue is the "contra cost" expression, which is typically used in Romanian to express the fact that some product or service is not for free, but it requires some payment from the customer. Example #R4 contains a type of news that has dominated the Romanian media for months, namely the protest against the Romanian government on August 10th, 2018. Therefore, the features highlighted by the CNN have no dialectal clues, except perhaps for the word "miting", which is preferred instead of the synonym "protest", the latter one being more common in the Republic of Moldova. In samples #R5 and #R10, the CNN focuses on the nouns "compania" (singular of "company") or "companii" (plural of "company"), respectively. From our observations, in Moldavian news reports, writers use "întreprindere", while in Romanian news reports, the synonym "companie" is rather used. We note that "companie" and "întreprindere" exist in both Romanian and Moldavian, but the preference for one or the other depends on the dialect. Sample #R6 refers to what was a really hot and controversial topic in Romania, namely that of changing the definition of family ("definirea familiei") in the constitution of Romania. We can safely say this is not a dialectal topic. In sample #R7, the model focuses on the noun phrase "coaliţia la gurvernare" (governing coalition). In the Republic of Moldova, the same concept is expressed through the noun phrase "coaliţia de guvernămînt". Sample #R8 contains a Romanian saying, namely "nu stau cu mainile in san", which is used to express that the bankers took some action instead of waiting for something to happen. Sample #R9 contains the noun "torenţi", which is never used in the Republic of Moldova with the meaning of weather torrent, only with the meaning of web torrent. The CNN also considers as relevant the word "valabile", which is rarely used in the Republic of Moldova. Hence, sample #R9 contains more than one dialectal pattern. In summary, we find that the CNN discovers some interesting dialectal patterns, which we were unaware of before seeing the Grad-CAM visualizations. However, there is a small percentage of sentences, namely #R4 and #R6, that have no dialectal patterns, but are correctly labeled by the CNN because of the subjects that are related to events in Romania. We provide a set of visualizations for Moldavian sentences in Table 8 . Sample #M1 contains a highlighted noun phrase that is a clear indicator of the Moldavian dialect. Indeed, the noun phrase "cabinetul de miniştri" (the cabinet of ministers) is almost never used in Romanian, where the alternative "gurvernul" (the government) is preferred. The noun "migranţi" (migrants) is also unusual in Romanian, the forms "emigranţi" or "imigranţi" being used instead, depending on the context. In samples #M2, #M3, #M4 and #M6, we can observe a few highlighted words, such as "mîna" (hand), "făcîndu" (doing), "sînt" (are), "mîncare" (food) and "sfîrşit" (end), that reveal the same pattern used only in the Moldavian dialect, namely the use of the vowel "î" inside words. We note that the vowel "î" is used in Romanian only at the beginning of the words. The same sound is spelled by the vowel "â" anywhere else in the word, and the aforementioned Moldavian words would be written as "mâna", "făcându", "mâncare" and "sfârşit", respectively. For the verb "sînt" (are), even the sound is different, the correct Romanian spelling being "sunt". In addition, sample #M3 contains the adverb "împărtăşite" (distributed), which would likely be replaced by "partajate" in Romanian. In sample #M4, the CNN model focuses on the phrase "cei mai mulţi bani", the distinctive pattern being the placement of this phrase at the beginning of the sentence. In Romanian, the same sentence would be written as follows: "locuitorii capitalei cheltuie cei mai mulţi bani pe mâncare". In sample #M5, we can understand why the network has highlighted the phrase "ale republicii" (of the republic) as being a strong indicator for Moldavian, namely because Moldova is considered a republic. Romania was considered a republic only during the communist regime. Hence, example #M5 does not contain any dialectal patterns. In sample #M7, the verb phrase "a ajuns în faliment" (went bankrupt) is distinctive for the Moldavian dialect. In the Romanian dialect, the verb "a intrat" would be used instead of "a ajuns". Another distinctive verb phase for the Moldavian dialect is present in sample #M8, namely "a avizat pozitiv" (approved). In Romanian, this verb phrase would be replaced by the verb "a aprobat", the adverb "pozitiv" being implied by the verb. In #M9, we can observe that "briefing" is used to define a short press conference. To express the same concept, a Romanian speaker would use "declaraţie de presă" or "conferinţă de presă". Sample #M9 contains another dialectal pattern. In Moldavian, a political party is typically referred to as "formaţiune", whereas in Romanian, it is referred to as "partid". In sample #M10, the only highlighted dialect pattern that we found interpretable from our perspective is the use of the noun "condiţionalităţi" (conditions), since we would rather use "condiţii" in the Romanian dialect. As for the Romanian sentences, we notice that the character-level CNN finds some relevant patterns of the Moldavian dialect. We confess that we were not aware of many of the distinctive patterns among the two dialects discovered through the Grad-CAM visualizations. The same applies to our annotators. While both dialects contain about the same words, it seems that differences regarding the preferred synonym to express a certain concept play a very important role in distinguishing among the two dialects. This also explains why people living in Romania or the Republic of Moldova have such a hard time in distinguishing between the dialects. Many of the presented sentences are grammatically and syntactically correct in both dialects, but some word choices in one dialect seem rather unusual in the other dialect. We believe that untrained people can easily mistake such dialectal patterns with the style of the author. We consider that the presented examples elucidate the mystery behind the unreasonable effectiveness of machine learning in Moldavian versus Romanian dialect identification, revealing some interesting dialectal patterns, previously unknown to ourselves. In summary, we consider hypothesis H2 to be true. In this article, we studied dialect identification and related sub-tasks, e.g. cross-dialect categorization by topic, for an under-studied language, namely Romanian. We experimented with several machine learning models, including novel ensemble combinations, attaining very good performance levels, especially with the ensemble based on model stacking. For example, our ensemble based on stacking attains dialect identification scores above 94% on news articles, above 86% on sentences and up to 70% on tweets. Comparing the ML models with native Romanian or Moldavian speakers, we found a significant performance gap, the average performance of the human annotators being barely above the random chance baseline. For instance, the average accuracy of humans for dialect identification is about 53%. In order to find out why ML models attain significantly better results compared to humans, we analyzed Grad-CAM visualizations of the character-level CNN model. The visualizations revealed some interesting dialectal clues, which were too subtle to be observed by the human annotators or by us. We therefore reached the conclusion that the effectiveness of the ML models is explainable in large part through dialectal patterns, although the models can occasionally distinguish the samples based on their subject. In this regard, we believe that the newly-introduced crossgenre setting, in which the models are trained on sentences from MOROCO and tested on tweets collected from a different time span, is more representative for a fair and realistic evaluation. While our current study is focused on written dialect identification, we aim to address spoken dialect identification in future work. Since the spoken dialect bares more distinctive clues, it will allow us to include other Romanian subdialects in our study, e.g. those spoken in Ardeal or Oltenia regions. The authors thank reviewers for their valuable feedback leading to significant improvements of the manuscript. Overview of the 5th author profiling task at PAN 2017: Gender and language variety identification in Twitter Findings of the VarDial Evaluation Campaign Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign A Report on the Third VarDial Evaluation Campaign A Report on the VarDial Evaluation Campaign Findings of the VarDial Evaluation Campaign MOROCO: The Moldavian and Romanian Dialectal Corpus The R2I_LIS Team Proposes Majority Vote for VarDial's MRC Task SC-UPB at the VarDial 2019 Evaluation Campaign: Moldavian vs. Romanian Cross-Dialect Topic Identification Ensemble based on skip-gram and triplet loss neural networks for Moldavian vs. Romanian cross-dialect topic identification Language Discrimination and Transfer Learning for Similar Languages: Experiments with Feature Combinations and Adaptation Dialect Identification under Domain Shift: Experiments with Discriminating Romanian and Moldavian Applying Multilingual and Monolingual Transformer-Based Models for Dialect Identification A dual-encoding system for dialect classification Experiments in Language Variety Geolocation and Dialect Identification Exploring the Power of Romanian BERT for Dialect Identification Discriminating between standard Romanian and Moldavian tweets using filtered character ngrams Naive Bayes-based Experiments in Romanian Dialect Identification Comparing the Performance of CNNs and Shallow Models for Language Identification Dialect Identification through Adversarial Learning and Knowledge Distillation on Romanian BERT Istoria Limbii Române (History of the Romanian Language) Limba română -unitate în diversitate (Romanian language -there is unity in diversity) Miniature Empires: A Historical Dictionary of the Newly Independent States Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization A Report on the DSL Shared Task Overview of the DSL Shared Task Arabic Dialect Identification in Social Media Arabic Dialect Identification in the Wild Spoken Arabic dialect recognition using X-vectors ADI17: A Fine-Grained Arabic Dialect Identification Dataset Arabic Dialect Identification for Travel and Twitter Text Spoken Arabic Dialect Identification Using Phonotactic Modeling The Arabic Online Commentary Dataset: An Annotated Dataset of Informal Arabic with High Dialectal Content Arabic dialect identification with an unsupervised learning (based on a lexicon) Fine-Grained Arabic Dialect Identification The MADAR Shared Task on Arabic Fine-Grained Dialect Identification UnibucKernel Reloaded: First Place in Arabic Dialect Identification for the Second Year in a Row UnibucKernel: An Approach for Arabic Dialect Identification based on Multiple String Kernels Learning to Identify Arabic and German Dialects using Multiple Kernels Discriminative training of Gaussian mixture bigram models with application to Chinese dialect identification Chinese Dialect Identification Using Tone Features Based on Pitch Flux Semi-supervised learning based Chinese dialect identification Chinese dialect identification based on gender classification Chinese dialect identification based on DBF Automatic dialect identification of extemporaneous conversational, Latin American Spanish speech Dialect identification using Gaussian Mixture Models Gaussian Mixture Selection and Data Selection for Unsupervised Spanish Dialect Classification ACTIV-ES: a comparable, cross-dialect corpus of 'everyday' Spanish from Argentina, Mexico, and Spain DART: A Large Dataset of Dialectal Arabic Tweets The MADAR Arabic Dialect Corpus and Lexicon Automatic Identification of Closely-related Indian Languages: Resources and Experiments ArchiMob -A Corpus of Spoken Swiss German Compendiu de dialectologie română:(nord şi sud-dunăreană). Editura ştiinţifică şi enciclopedică Studii de dialectologie şi toponimie Limba română. Privire generală. I. Minerva Romanian-Speaking Communities Outside Romania: Linguistic Identities On the syllabic structures of Aromanian A Computational Perspective on the Romanian Dialects Character-level Convolutional Networks for Text Classification The Story of the Characters, the DNA and the Native Language Enriching word vectors with subword information The birth of Romanian BERT HeLI-based Experiments in Swiss German Dialect Identification Text classification and classifiers: a survey A survey on text classification: From shallow to deep learning Who is tweeting on Twitter: human, bot, or cyborg? In: Proceedings of ACSAC Deep learning versus conventional machine learning for detection of healthcareassociated infections in French clinical narratives Transfer learning applied to text classification in Spanish radiological reports A Venture Towards the Lesser Error in Classifying Medical Self-Reporters on Twitter Automatic text classification in information retrieval: A survey Thumbs up? Sentiment Classification using Machine Learning Techniques A survey of opinion mining and sentiment analysis. Mining Text Data Sentiment analysis of movie reviews using machine learning techniques Sentiment analysis for software engineering: How far can we go? Sentix: A sentiment-aware pre-trained model for cross-domain sentiment analysis Multi-label classification for recommender systems. Trends in Practical Applications of Agents and Multiagent Systems Multi-document summarization based on link analysis and text classification Improving multi-document summarization via text classification Text classification in the Turkish marketing domain for context sensitive ad distribution Automatic crowdsourcing-based classification of marketing messaging on Twitter Application of text classification and clustering of Twitter data for business analytics A dataset for Brazilian legal documents classification Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study Arabic text classification using deep learning models Text classification of web based news articles by using Turkish grammatical features The evaluation of word embedding models and deep learning algorithms for Turkish text classification Tuning the Turkish Text Classification Process Using Supervised Machine Learning-based Algorithms Toxic Comment Classification For French Online Comments Indian language text representation and categorization using supervised learning algorithm Clustering Word Embeddings with Self-Organizing Maps. Application on LaRoSeDa-A Large Romanian Sentiment Data Set Applied Logistic Regression Text Classification Using String Kernels Some Effective Techniques for Naive Bayes Text Classification Verbal aggression detection on Twitter comments: convolutional neural network for short-text sentiment analysis Hierarchical attention networks for document classification Generative adversarial learning for robust text classification with a bunch of labeled examples Text Classification of Manifestos and COVID-19 Press Briefings using BERT and Convolutional Neural Networks. arXiv preprint 2020 Topic Classification from Text Using Decision Tree, K-NN and Multinomial Naïve Bayes Optimizing Semantic Deep Forest for tweet topic classification News topic classification using mutual information and Bayesian network Hierarchical hybrid attention networks for Chinese conversation topic classification Pre-trained Contextualized Representation for Chinese Conversation Topic Classification A hybrid Latent Dirichlet Allocation approach for topic classification A linguistic approach for determining the topics of Spanish Twitter messages Ensembles of Methods for Tweet Topic Classification A Comparison of Oversampling Methods on Imbalanced Topic Classification of Korean News Articles Label Topic Classification of Hadith of Bukhari (Indonesian Language Translation) Using Information Gain and Backpropagation Neural Network Topic classification in Romanian blogosphere Language-Agnostic Topic Classification for Wikipedia Toward any-language zero-shot topic classification of textual documents Cross-Lingual Classification of Topics in Political Texts A Neural Probabilistic Language Model Proceedings of NIPS A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning Efficient Estimation of Word Representations in Vector Space Distributed Representations of Words and Phrases and their Compositionality GloVe: Global Vectors for Word Representation Improving Word Representations via Global Context and Multiple Word Prototypes Multi-Prototype Vector-Space Models of Word Meaning A Probabilistic Model for Learning Multi-Prototype Word Embeddings ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense Disambiguation Word embeddings quantify 100 years of gender and ethnic stereotypes Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach Vector of Locally-Aggregated Word Embeddings (VLAWE): A Novel Document-level Representation Learning Word Embeddings from Wikipedia for Content-Based Recommender Systems WSABIE: Scaling up to Large Vocabulary Image Annotation Using Word Embeddings in Twitter Election Classification The Reference Corpus of the Contemporary Romanian Language (CoRoLa) Computing distributed representations of words using the CoRoLa corpus Learning Word Vectors for 157 Languages Universal Dependencies v1: A Multilingual Treebank Collection Bag of Tricks for Efficient Text Classification shared task: Multilingual parsing from raw text to universal dependencies Lossless Compression Based on the Sequence Memoizer A Stochastic Memoizer for Sequence Data Generating Text with Recurrent Neural Networks Improved Transition-Based Parsing by Modeling Characters instead of Words with LSTMs Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation Character-Aware Neural Language Models Character-Level Language Modeling with Deeper Self-Attention Text Classification Using String Kernels Application of String Kernels in Protein Sequence Classification Using String-Kernels for Learning Semantic Parsers Using String Kernels to Identify Famous Performers from Their Playing Style Dynamic Scene Understanding for Behavior Analysis Based on String Kernels Single and Cross-domain Polarity Classification using String Kernels Improving the results of string kernels in sentiment analysis and Arabic dialect identification by adapting them to your test set HASKER: An efficient algorithm for string kernels. Application to polarity classification in various languages Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation Automated essay scoring with string kernels and word embeddings Sentence selection with neural networks using string kernels Can characters reveal your native language? A language-independent approach to native language identification String Kernels for Native Language Identification: Insights from Behind the Curtains Can string kernels pass the test of time in Native Language Identification? A Framework for Space-Efficient String Kernels A Fast Gapped k-mer String Kernel Using Counting Big Data Classification Efficiency Based on Linear Discriminant Analysis Rao-SVM Machine Learning Algorithm for Intrusion Detection System Support-vector networks Kernel Methods for Pattern Analysis Ridge regression: Biased estimation for nonorthogonal problems Ridge Regression Learning Algorithm in Dual Variables Learning deep architectures for AI. Foundations and Trends in Machine Learning Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position ImageNet Classification with Deep Convolutional Neural Networks Face recognition: A convolutional neural-network approach Backpropagation Applied to Handwritten Zip Code Recognition Learning methods for generic object recognition with invariance to pose and lighting Convolutional Neural Networks for Sentence Classification Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts Generalization of backpropagation with application to a recurrent gas market model Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling On the Practical Computational Power of Finite Precision RNNs for Language Recognition Gradient flow in recurrent nets: the difficulty of learning long-term dependencies Long Short-Term Memory Learning to forget: Continual prediction with LSTM LSTM: A search space odyssey Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation Light Gated Recurrent Units for Speech Recognition Massive Exploration of Neural Machine Translation Architectures Acoustic Modeling Using Bidirectional Gated Recurrent Convolutional Units Attention is all you need Pre-training of Deep Bidirectional Transformers for Language Understanding A Dutch BERT Model. arXiv preprint 2019 CamemBERT: a Tasty French Language Model Unsupervised Language Model Pre-training for French Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets Popular ensemble methods: An empirical study Ensemble-based classifiers Decision tree ensemble: Small heterogeneous is better than large homogeneous Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy Learning with ensembles: How overfitting can be useful Stacked generalization