key: cord-0046927-jb7g6lwf authors: Camus, Leon; Filighera, Anna title: Investigating Transformers for Automatic Short Answer Grading date: 2020-06-10 journal: Artificial Intelligence in Education DOI: 10.1007/978-3-030-52240-7_8 sha: 969044e66f4b4064078b2c8b787a11f79094300f doc_id: 46927 cord_uid: jb7g6lwf Recent advancements in the field of deep learning for natural language processing made it possible to use novel deep learning architectures, such as the Transformer, for increasingly complex natural language processing tasks. Combined with novel unsupervised pre-training tasks such as masked language modeling, sentence ordering or next sentence prediction, those natural language processing models became even more accurate. In this work, we experiment with fine-tuning different pre-trained Transformer based architectures. We train the newest and most powerful, according to the glue benchmark, transformers on the SemEval-2013 dataset. We also explore the impact of transfer learning a model fine-tuned on the MNLI dataset to the SemEval-2013 dataset on generalization and performance. We report up to 13% absolute improvement in macro-average-F1 over state-of-the-art results. We show that models trained with knowledge distillation are feasible for use in short answer grading. Furthermore, we compare multilingual models on a machine-translated version of the SemEval-2013 dataset. Online tutoring platforms enable students to learn individually and independently. To provide the users with individual feedback on their answers, the answers have to be graded. In large tutoring platforms, there are an abundant number of domains and questions. This makes building a general system for short answer grading challenging, since domain-related knowledge is frequently needed to evaluate an answer. Additionally, the increasing accuracy of short answer grading systems makes it feasible to employ them in examinations. In this scenario it is desirable to achieve the maximum possible accuracy, with a relatively high computational budget, while in case of tutoring a less computational intensive model is desirable to keep costs down and increase responsiveness. In this work, we experiment with fine-tuning the most common transformer models and explore the following questions: Does the size of the Transformer matter for short answer grading? How well do multilingual Transformers perform? How well do multilingual Transformers generalize to another language? Are there better pre-training tasks for short answer grading? Does knowledge distillation work for short answer grading? The field of short answer grading can mainly be categorized into two classes of approaches. The first ones represent the traditional approaches, based on handcrafted features [14, 15] and the second ones are deep learning based approaches [1, 8, 13, 16, 18, 21] . One of the core constraints of short answer grading remained the limited availability of labeled domain-relevant training data. This issue was mitigated by transfer learning from models pre-trained using unsupervised pretraining tasks, as shown by Sung et al. [21] outperforming previous approaches by about twelve percent. In this study, we aim to extend upon the insights provided by Sung et al. [21] . We evaluate our proposed approach on the SemEval-2013 [5] dataset. The dataset consists of questions, reference answers, student answers and three-way labels, represenenting the correct, incorrect and contradictory class. We translate it with the winning method from Wmt19 [2] . For further information see Sung et al. [21] . We also perform transfer learning from a model previously fine-tuned on the MNLI [22] dataset. 1 For training and later comparison we utilize a variety of models, including BERT [4] , RoBERTa [11] , AlBERT [10] , XLM [9] and XLMRoBERTa [3] . We also include distilled models of BERT and RoBERTa in the study [19] . Furthermore we include a RoBERTa based model previously fine-tuned on the MNLI dataset. For fine tuning we add a classification layer on top of every model. We use the AdamW [12] optimizer, with a learning rate of 2e−5 and a linear learning rate schedule with warm up. For large transformers we extend the number of epochs to 24, but we also observe notable results with 12 epochs or less. We train using a single NVIDIA 2080ti GPU (11 GB) with a batch size of 16, utilizing gradient accumulation. Larger batches did not seem to improve the results. To fit large transformers into the GPU memory we use a combination of gradient accumulation and mixed precision with 16 bit floating point numbers, provided by NVIDIAs apex library 2 . We implement our experiments using huggingfaces transformer library [23] . We will release our training code on GitHub 3 . To ensure comparability, all of the presented models where trained with the same code, setup and hyper parameters (Table 1 ). Does the size of the Transformer matter for short answer grading? Large models demonstrate a significant improvement compared to Base models. Table 1 . The improvement arises most likely due to the increased capacity of the model, as more parameters allow the model to retain more information of the pre-training data. The XLM [9] based models do not perform well in this study. The RoBERTa based models (XLM-RoBERTa) seem to generalize better than their predecessors. XLMRoBERTa performs similarly to the base RoBERTa model, falling behind in the unseen questions and unseen domains category. Subsequent investigations could include fine-tuning the large variant on MNLI and SciEntsBank. Due to GPU memory constraints, we were not capable to train the large variant of this model. The models with multilingual pre-training show stronger generalization across languages than their English counterparts. We are able to observe that the score of the multilingual model increases across languages it was never fine-tuned on, while the monolingual model does not generalize. Are there better pre-training tasks for short answer grading? Transfer learning a model from MNLI yields a significant improvement over the same version of the model not fine-tuned on MNLI. It improves the models ability to generalise to a separate domain. The models capabilities on the german version of the dataset are also increased, despite the usage of a monolingual model. The reason for this behavior should be further investigated. The usage of models pre-trained with knowledge distillation yields a slightly lower score. However, since the model is 40% smaller, a maximum decrease in performance of about 2% to the previous state of the art may be acceptable for scenarios where computational resources are limited. In this paper we demonstrate that large Transformer-based pre-trained models achieve state of the art results in short answer grading. We were able to show that models trained on the MNLI dataset are capable of transferring knowledge to the task of short answer grading. Moreover, we were able to increase a models overall score, by training it on multiple languages. We show that the skills developed by a model trained on MNLI improve generalization across languages. It is also shown, that cross lingual training improves scores on SemEval2013. We show that knowledge distillation allows for good performance, while keeping computational costs low. This is crucial in evaluating answers from many users, like in online tutoring platforms. Future research should investigate the impact of context on the classification. Including the question or its source may help the model grade answers, which were not considered during the reference answer creation. Automatic text scoring using neural networks Findings of the 2019 conference on machine translation (wmt19) Unsupervised cross-lingual representation learning at scale Bert: Pre-training of deep bidirectional transformers for language understanding Semeval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge. NORTH TEXAS STATE UNIV DENTON Ets: Domain adaptation and stacking for short answer scoring Softcardinality: Hierarchical text overlap for student response analysis Earth mover's distance pooling over siamese lstms for automatic short answer grading Cross-lingual language model pretraining Albert: A lite bert for self-supervised learning of language representations Roberta: A robustly optimized bert pretraining approach Fixing weight decay regularization in adam Creating scoring rubric from representative student answers for improved short answer grading Learning to grade short answer questions using semantic similarity measures and dependency graph alignments Text-to-text semantic similarity for automatic short answer grading Siamese recurrent architectures for learning sentence similarity Generating reference texts for short answer scoring using graph-based summarization Sentence level or token level features for automatic short answer grading?: use both Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter Fast and easy short answer grading with high accuracy Improving short answer grading using transformer-based pre-training A broad-coverage challenge corpus for sentence understanding through inference Huggingface's transformers: State-of-the-art natural language processing Acknowledgements. We would like to thank Prof. Dr. rer. nat. Karsten Weihe, M.Sc. Julian Prommer, the department of didactics and Nena Marie Helfert, for supporting and reviewing this work.