key: cord-0452183-dmh3nivz authors: Jahanshahi, Hadi; Ozyegen, Ozan; Cevik, Mucahit; Bulut, Beste; Yigit, Deniz; Gonen, Fahrettin F.; Bacsar, Aycse title: Text Classification for Predicting Multi-level Product Categories date: 2021-09-02 journal: nan DOI: nan sha: ae92efe22738f26999c2e7cd7ea23b8ca3e8ec8a doc_id: 452183 cord_uid: dmh3nivz In an online shopping platform, a detailed classification of the products facilitates user navigation. It also helps online retailers keep track of the price fluctuations in a certain industry or special discounts on a specific product category. Moreover, an automated classification system may help to pinpoint incorrect or subjective categories suggested by an operator. In this study, we focus on product title classification of the grocery products. We perform a comprehensive comparison of six different text classification models to establish a strong baseline for this task, which involves testing both traditional and recent machine learning methods. In our experiments, we investigate the generalizability of the trained models to the products of other online retailers, the dynamic masking of infeasible subcategories for pretrained language models, and the benefits of incorporating product titles in multiple languages. Our numerical results indicate that dynamic masking of subcategories is effective in improving prediction accuracy. In addition, we observe that using bilingual product titles is generally beneficial, and neural network-based models perform significantly better than SVM and XGBoost models. Lastly, we investigate the reasons for the misclassified products and propose future research directions to further enhance the prediction models. E-commerce platforms have been increasingly popular over the years. The interest in e-commerce has only increased with the COVID-19 pandemic, which resulted in the proliferation of e-commerce companies [4] . This in turn increased the competition in the e-commerce field, and lead to significant investments by the companies to enhance their platforms. To facilitate user navigation, e-commerce platforms list their items within appropriate categories. From the sellers' perspective, proposing the appropriate category for a product given its description can be cumbersome and time-consuming. The complexity of the task increases further by the introduction of multi-level categorization. For instance, a milk product can be categorized under the dairy category and the milk subcategory. As the number of products sold in an e-commerce platform increases, it becomes more difficult for the online platforms to keep track of hundreds or thousands of categories. The product category classification models aim to automate finding the right category for a given product. In most cases, the only available information is the product title and description. To automatically categorize the products, online retailers can use product category classification models instead of manually scanning all categories to find the suitable one for each product. At first glance, the product title classification problem can be considered as a variant of the widely studied text classification problems. While there are certain similarities between these two problems, in the product category classification problem, input titles might be significantly different in terms of length of each instance, the distribution of lengths, and the grammatical structure of the input [24] . Thus, various strategies have been proposed to extract the most information from short texts. These strategies include using context-relevant concept word embeddings [22] , using both word-level and character-level features to capture fine-grained subword information [17] , using word-cluster embeddings [14] and data augmentation [13] . Some examples of product titles and corresponding category and subcategory labels are shown in Table 1 . Such a product categorization has three significant benefits to an online retailer. First, it can assist buyers in navigation through online platforms. A high-quality categorization of the products results in a more efficient and satisfactory user experience. Second, it allows online retailers to control sales and marketing operations in a more organized manner. They can easily add new products to their system and track aggregated information about various product categories in real-time. Finally, online retailers can classify and trace available products of other online retailers. Using the predictions of product title classification models on other online retailers' datasets, companies can track aggregated information about the availability of various product categories. Research Goals. We consider a specific application of product title classification over grocery products. We aim to study grocery product title classifica- We summarize the contributions of our study as follows. • To the best of our knowledge, this is the first study that focuses on grocery product title classification. • We perform a comprehensive comparison of six different text classification models on the grocery product title classification task. Our experiments establish a strong baseline for this task by testing both traditional and recent NLP methods. • We investigate several strategies such as leveraging product titles in multiple languages and dynamically masking the infeasible subcategories for pretrained language models to obtain better predictive performance for the product title classification task. • We measure the generalizability of product title classification models by evaluating the trained model on six different datasets obtained from different online retailers. • We identify the challenges of grocery product title classification through a detailed analysis of the model predictions. Structure of the Paper. The rest of the paper is organized as follows. Section 2 provides a brief discussion on the relevant literature on hierarchical product category classification, which is followed up by the description of employed methodology and datasets in Section 3. Section 4 explores the results of within-and cross multilevel product category classification as well as insights into the products that the models fail to predict. Finally, Section 5 describes the limitations and threats to validity, and Section 6 provides concluding remarks along with future research directions. Hierarchical product category classification is a challenging task. It requires product instances to be carefully assigned to multiple levels of categories. Over the past years, the interest in this problem has increased with the rise of online shopping and the availability of large datasets. Yu et al. [24] provide one of the first studies in this area. They conduct an extensive numerical study to illustrate how linear SVMs can be used for large-scale multi-class title classification and identify the differences between product title classification and text classification. They use a dataset from a large internet company, which contains 29 classes, and propose a multi-class SVM model for the classification task. They also compare the effectiveness of different feature representations. Their experiments show that stemming and stop word removal are harmful and bigrams are effective. There have been significant improvements in NLP models over the past decade. For word representations, methods such as Glove [12] and Word2Vec [11] became increasingly popular. More recently, advanced NLP models such as BERT [3] , ROBERTA [9] and XLM [10] have been shown to achieve stateof-art for many language tasks. These models, also known as pre-trained language models (PTMs), are used slightly differently compared to the previous machine learning models that are considered for the NLP tasks. They are first trained on large-scale unlabeled corpora to leverage a good understanding of natural language. Then, depending on the task, a few layers are attached to the end of the "pre-trained" base model. Afterwards, the full network is fine-tuned end-to-end on a smaller task-specific corpus. There are additional advantages of using PTMs over the traditional methods. The same base model can be used for many NLP tasks with computationally inexpensive task-specific fine-tuning. Furthermore, for most cases, a small hyperparameter tuning setup that includes a range of batch sizes, learning rates, and the number of epochs is recommended for fine-tuning these models [3] . The adoption of pre-trained language models can also be seen in the most recent work on the product category classification domain [26, 16] . Most of the recent literature in product category classification problems can be found in the "Semantic Web Challenge" competition and case studies published by the competing teams [26] . The second part of the challenge focuses on multi-level product category classification. The considered dataset in the competition contains more than 15,000 product instances randomly sampled from 702 vendors' websites. The products are labeled in GPC hierarchy 1 . As baseline models, teams tested the same configuration proposed by [16] which Softmax function that explicitly considers the dependencies among different category levels [26] . The dynamic masking of the subcategories based on the predicted category reduces the complexity of the optimization problem by filtering out the child categories unrelated to the predicted parent category. To the best of our knowledge, our work constitutes the first study on multi-level classification for predicting grocery products categories. We create an extensive list of text classification models to address this problem. Unlike previous works, we leverage bi-lingual models to improve prediction performance based on Turkish and English product titles. Finally, we discuss the challenges of the grocery product category classification task through our detailed numerical study with datasets from different online retailers. In this section, we briefly discuss the datasets and the methods used for our multilevel product category classification task. Moreover, we provide more details on our experimental setup, including the evaluation metrics and parameter settings. Our datasets are obtained from online grocery markets in Turkey. Specifically, we mined product information from seven online grocery websites in this domain. We use one platform as the training set and others as the test sets. Table 2 shows the number of unique products that are mined from these websites. The mining phase is done at different times of the year to ensure all products sold by companies are included. As there are inconsistencies in the category and subcategory naming for different websites, we include only those products from testing sets whose categories and subcategories are available in the training set. The information related to the number of products, categories, and subcategories before and after the cleaning phase is shown in Table 2 . We select a medium-size dataset to test the performance of the algorithm. This process can be replicated using any other test set as the training set. However, we do not aim to incorporate the most comprehensive ones, e.g., Test Set-5, since the model may indicate a performance level that is not generalizable to other datasets. Another factor in choosing this specific dataset is its multilingual platform. As we aim to examine the model performance when using both Turkish and English titles, we choose to select a platform that provides this particular feature. Figure 1 shows the distribution of the products' textual information, which is used as the independent feature in the classification models. The textual data follow a similar length pattern and are mostly short. The average number of words used in each product title is 6.2. It is worthwhile to note that having fewer vocabularies may degrade the model performance and make the learning process more difficult. Figure 2 demonstrates the most frequent bigrams of product titles together with their frequencies in the training set. We observe that items such In our analysis, we consider various traditional text classification models such as XGBoost, SVM, and LSTMs, which we briefly summarize below. XGBoost with Weighted Word Embedding. XGBoost, a scalable tree boosting method [1] , creates a group of weak trees by adding instances with the highest contribution to the model's learning process. Textual information cannot be directly used by the XGBoost, and they need to be converted to numeric values. TF-IDF is frequently used to this end. However, frequency-based approaches overlook the semantics and syntax of the vocabularies. Therefore, as suggested by Stein et al. [15] , we use word embeddings as the numeric representation, and apply a weighted average of the vocabularies given a product title, where weights are the TF-IDF of each word [6] . While using TF-IDF as weights, we aim to give higher weights to the more important vocabularies. In our preliminary analysis, we compared this representation with TF-IDF only version as well as the simple word embedding average, and found it to be more efficient with a better overall performance. Moreover, we experiment with different word embeddings to identify the best-performing approach to convert the textual information given their context. SVM with Weighted Word Embedding. Support Vector Machine (SVM) is a widely used text classification method in different domains [19, 18, 5] . Sup-port vectors (i.e., data points that are closer to the hyperplane) are selected in a way that the classifier's margin is maximized. This model is able to independently learn feature space dimensions and can be used without feature selection. Similar to XGBoost, SVM needs a numeric representation of the vocabularies. In our experiments, we employ the same approach, i.e., word embedding weighted average using TF-IDF, for the sake of consistency. Bi-directional LSTM. Long short-term memory networks (LSTM), which is a particular type of recurrent neural network, can capture both long-and short-term effects of the textual information using the input, output, and forget gates [8] . LSTM's ability of when to learn new or relevant information and when to forget old or irrelevant information makes it a suitable tool in textual classification tasks. Since the title of a product can be lengthy and may include less relevant information, forget gates can filter out this kind of information. In this study, we use Bi-directional LSTM (BiLSTM) units, which learn the textual information from both directions, and then combine it into a single expression using the convolutional neural networks [7] . In our analysis, we employ two different LSTM architectures: one for predicting the labels using Turkish titles and another bilingual parallel LSTM which is fed by both Turkish and English titles (see Figure 3 ). We use two independent networks for category and subcategory prediction. Their inputs are market product names in English and Turkish. Accordingly, the proper word embedding language is used. However, their output dimensions differ and are equal to the number of categories and subcategories, respectively. Therefore, they are provided the product name and expected to return their associated category and subcategory. In this approach, we have two separate models for English and Turkish titles. For all the large pretrained language models (e.g., BERT, XLM, XLM-RoBERTa), we fine-tune the models by attaching two fully-connected layers to the output of the base models for the category and the subcategory. The contextual representation of the model input (e.g., product title) is ob- and Translation Language modeling (TLM) objectives [3] . XLM-RoBERTa. Liu et al. [9] propose a variety of enhancements over the original BERT architecture and achieve better results on various NLP benchmark datasets. Their primary modifications for the BERT model include using additional datasets, changing some initial hyperparameters, removing next-sentence pretraining objectives, and training with larger batch sizes. For our experiments, we use the multi-lingual version of this architecture pretrained on 2.5TB of CommonCrawl data in 100 languages using a masked language modeling (MLM) objective [20] . The architecture we employed for multi-level classification using pretrained language models is illustrated in Figure 4 . In the standard approach, where C is the number of categories and S is the number of subcategories. We then compute the Dynamic Masked Softmax instead of regular Softmax for computing the subcategory predictions as follows: where c and s are category and subcategory labels, θ is the model parameters, and y s is the predicted probabilities of the subcategories. This design can also be extended to more than two levels if needed. In our numerical analysis, we experiment with different configurations to measure the impact of Dynamic Masked Softmax using training three language models. Our experimental setup is illustrated in Figure 5 , which consists of two parts. In the first part, we apply five-fold cross-validation to the training set described in Section 2. At this stage, we perform experiments to measure the category and subcategory prediction accuracy for different models and word embeddings. Additionally, we investigate the advantage of using bilingual product titles and, finally, the benefit of applying dynamic masking of subcategories for the pretrained language models. In the second part, we take the best models trained on the training data and evaluate their performance on the test sets to measure the generalizability of the trained models. Evaluation Metrics. We use two generic metrics to assess the model performance: accuracy and weighted-average macro F1-score (WAF1). Accuracy is an easy-to-interpret metric that shows how often the model is correct. We compute the accuracy for both category and subcategory predictions. However, in comparing the models, we consider the weighted-average F1-scores calculated over all the categories for each classification level. We compute the F1-scores using the harmonic mean of precision and recall scores as follows: F1-score = 2 × precision × recall precision + recall F1-scores are calculated for each class independently, and a weighted average is taken to obtain WAF1 scores. We rank the models using the same aggregated metric proposed by Zhang et al. [26] , i.e., by averaging the WAF1 for the category and the subcategory. Parameter Settings. For the implementation of the BiLSTM and pretrained language models, we use the Tensorflow and Transformers [21] libraries. We fine-tune the bert-base-multilingual-uncased, xlm-mlm-100-1280, and jplu/tfxlm-roberta-base versions of the pre-trained language models in Transformers library for BERT, XLM, and XLM-RoBERTa, respectively. For all the pretrained language models, we use Adam optimizer with an initial learning rate of 3e-05, and a batch size of 16. During the training of each model, early-stopping is applied to avoid over-fitting. Training is stopped when no performance improvement is observed on the validation set after 10 epochs. We then store the model weights corresponding to the best performance on the validation set. We use a grid search to fine-tune the parameters of the SVM, XGBoost, and BiLSTM as well. This process is done using a separate validation set as discussed in sections 4.1 and 4.4. We use the scikit-learn library in Python for these implementations. Table 3 lists the hyperparameters used for each model. We focus on two particular sets of experiments: finding the best model using cross-validation on the baseline dataset and measuring generalizability on multiple datasets mined from various other online retail stores. We obtained our datasets from Getir, an online food and grocery delivery company that originated in Turkey and recently expanded its operations to the United Kingdom and the Netherlands. Since Getir primarily operates in Turkey, the product titles they collected usually have Turkish titles. As such, classification models can significantly benefit from multilingual word embeddings. In our numerical study, we first explore the performance of the baseline models together with recent deep learning approaches and choose top classifiers for the next step. Then, we utilize those models to predict the categories and subcategories of the other online grocery retailers. Finally, we provide detailed insights on the model performance, i.e., where the model fails in their prediction and how different word embeddings affect their performance. The baseline dataset provides full information related to 3,228 unique products that are sold online. For this experiment, we use the products' Turkish and English titles, categories, and subcategories. We take advantage of the availability of bilingual product titles and investigate different word embedding approaches. We also utilized a complete list of multi-class classification algorithms, including the more traditional machine learning algorithm, e.g., SVM and XGBoost, together with more recent deep learning algorithms, e.g., LSTM and BERT. Using the stratified cross-validation, we split the dataset into 5 folds, where one fold is used as the test set and the rest as the training set. This process is repeated five times for all folds. Using a 90-10 division, we create the validation set out of the training set and optimize the model parameters accordingly. This approach enables finding the optimal parameters and avoid overfitting. The self-prediction phase leads to a list of possible models to be applied in the cross-platform multilevel classification task. [26, 23] . We also note that dynamic masking of subcategories increases the model performance of all the pretrained language models. However, the use of bilingual titles only increases the accuracy of the BERT architecture. These results are discussed further in Section 4.2 and Section 4.3. Overall, the results show that BERT architecture with bilingual titles and dynamic masking performs best, followed by the Bi-LSTM model with bilingual titles. For the pretrained language models, we investigate the benefits of dynamic masking on multi-level product category classification. We train three pretrained language models with and without masking and measure category and subcategory prediction performance. The results are provided in Table 4 , where the models with dynamic masking have the superscript M We also train the models on Turkish titles with and without the masking. Table 4 shows that masked models always lead to better overall performance. This observation is intuitive since masking simplifies the classification task by reducing the number of subcategory classes to those under the predicted category. Thus, we observe that the category accuracy remains similar while the subcategory accuracy increases when the mask is employed. When the dataset contains product titles in multiple languages, it is possible to leverage this information for better prediction performance. Thus, we train BiLSTM, BERT M , XLM M , and XLM-RoBERTa M with multiple titles as well. For the BiLSTM model, we use the Glove embeddings, which perform better than Word2Vec and FastText embeddings. The BiLSTM model that uses both English and Turkish embeddings performs significantly better than the BiLSTM models that only use Turkish or English embeddings (see Table 4 ). For bilingual pretrained language models, the Turkish and English prod- As a market analysis, we crawl data related to six online retailers. The description of each dataset after data cleaning is reported in Table 2 . Not all online grocery shops have an English version. Therefore, we use only Turkish titles to be consistent for all the platforms. Moreover, in the mined datasets, some categories and subcategories do not exist in the baseline dataset. Ac-cordingly, we only consider the products whose categories and subcategories exist on our baseline dataset. We use 80 percent of our dataset as the training set and the rest as the validation set to identify ideal model parameters. We then examine the feasibility of cross-platform multilevel product category classification. BiLSTM, BERT M , XLM M , and XLM-RoBERTa M are selected as the top models in our experiments for within-platform classification. Table 5 summarizes the performance of the models for each dataset where the best To investigate the products in other (test) datasets for which the models fail to predict the category or subcategory, we visually compare the predicted values with the ground truth. Our observations of the misclassifications are as follows. • If a product does exist in the test set but not in the training set or has different wording than the training set, a misclassification may occur. • Some brand names have a general meaning that affects the model's prediction. For instance, the manufacturer "doguş", meaning "nativity" in Turkish, produces beverage products, while its meaning conveys a different understanding for the prediction models. • Product categories can be subjective. For instance, a company categorizes a product as a dairy product, whereas another one sets its category as beverages. This issue cannot be addressed in the cleaning phase as we deal with an extensive list of product names and categories in this work. On the other hand, a manual check may still have the subjectivity mentioned. Therefore, we rely on the category naming as is. • Some product titles convey a meaning that can be taken differently by a machine learning algorithm. For instance, a book name that is about cooking might be categorized as food. Table 6 lists some products where our proposed model fails to predict the exact category or subcategory. "Report to El Greco" is the name of a book and is categorized in Newspapers and Magazines; however, the model suggests paper products as its subcategory. Using the word "report" justifies this suggestion. Doguş company is famous for its diverse tea products in Turkey. Therefore, the model associates it with the tea subcategory, even when the word "sugar" exists in the product name. Granola includes almond and cashew that also commonly exist in snacks. Even though the model misclassifies the product, it might still be considered as a logical prediction. Moreover, high protein vanilla milk clearly belongs to the milk subcategory as suggested by the model. However, it was originally categorized as "Fitness and Form". We investigate the rationale behind such a category selection and find that this product includes zero sugar, has high protein, and is lactose-free. Therefore, the retailer decided to categorize it as a dairy product under the "fit & form" category. Lastly, the model categorizes "Menemen mixture", prepared ingredient for Turkish omelet, as "Canned and pickled", whereas it was originally categorized as "Spices". The rationale behind the new proposed subcategory is that this mixture also is sold in a jar. The model suggests an acceptable alternative for the current version without being aware of this fact. This evidence corroborates the subjectivity in naming conventions. In this regard, including more details on the product content than what the pure title suggests might be considered as a viable strategy to improve the prediction performance. Overall, we note that although there are some cases where the model is unable to give the exact category or subcategory name, the predicted values are justifiable given the input provided to the model. In this study, we evaluated a comprehensive list of text classification techniques to address the multilevel product category classification task. We ensure to cover both well-established and novel approaches in this area. However, NLP is a fast-developing field, and we aim to closely follow the trends and apply other methods for our prediction task in the future. Moreover, an in-depth grid search for parameter tuning of the current models might prove fruitful. In terms of construct validity, we used repeated stratified 5-fold cross-validation to alleviate the issue of random heterogeneity of subjects. Regarding the external sources, we mined datasets from six different online retail platforms in Turkey with distinct characteristics to ensure the generalizability of our findings. Note that we cover the most successful online retail companies in the field. Nonetheless, replication of our study for other languages and other countries might yield fruitful insights. Moreover, we consider only the title of products for this task. This information can be expanded by adding product descriptions, specifications, and prices. Incorporating additional information is set as a future step to enhance predictive performance. The datasets for other companies were extracted once a month from October 2019 to January 2021. We consider a unique list of products that are available during this period. However, some products might be out-of-stock while crawling the websites or not feasible to non-user viewers. Therefore, it is important to note that our analysis applies to the products that were open to public access during those days. When exploring marketing strategies, companies do not typically have access to full information about the products available in the marketplace. As such, they need to predict the missing information and match them with their category definitions to have a better sense of the market. In addition, the companies may aim to identify incorrectly classified products based on the products existing in their database to understand the recent market trends. In this paper, we studied the text classification strategies to automate the prediction of product categories and subcategories using the available information such as product titles. We analyzed the mined datasets related to top online grocery platforms in Turkey and utilized different machine learning algorithms to address the problem. We also designed a bilingual deep learning architecture that uses both English and Turkish product titles. After comparing the result of the models, we investigated the cases where the models fail to predict the expected categories, which can be particularly useful to pinpoint the cases where the current ground truth labels (i.e., categories) might be controversial. We plan to extend this research by adding additional information on the products, e.g., description, price, and ingredients, to enhance the predictive performance. A relevant venue for future research would be designing strategies to achieve better performance for certain categories. For instance, the trained models had low accuracy in the "Newspaper & Magazine" subcategory. Pre-training on a dataset about books or training an additional book/non-book classifier can increase the performance for this category without sacrificing performance on other categories. Finally, the similarity of some categories presents a problem both to the models and to the practitioners. A refined strategy can be developed to quickly determine categories that are likely to contain very similar products. Based on this information, companies can more effectively categorize the products. Xgboost: A scalable tree boosting system Cross-lingual language model pretraining Pre-training of deep bidirectional transformers for language understanding To buy or not buy food online: The impact of the covid-19 epidemic on the adoption of e-commerce in china A novel active learning method using svm for text classification Auto response generation in online medical chat services News text classification based on improved bi-lstm-cnn Bidirectional lstm with attention mechanism and convolutional layer for text classification A robustly optimized bert pretraining approach Xlm-t: Scaling up multilingual machine translation with pretrained cross-lingual transformer encoders Efficient estimation of word representations in vector space Glove: Global vectors for word representation A data augmentation approach to short text classification Improving medical short text classification with semantic expansion using word-cluster embedding An analysis of hierarchical text classification using word embeddings workshop data challenge Combining knowledge with deep convolutional neural networks for short text classification Research on web text classification algorithm based on improved cnn and svm Influence of word normalization and chi-squared feature selection on support vector machine (svm) text classification Huggingface's transformers: State-of-the-art natural language processing Transformers: State-of-the-art natural language processing Incorporating context-relevant concepts into convolutional neural networks for short text classification Bert with dynamic masked softmax and pseudo labeling for hierarchical product classification Probert: Product data classification with fine-tuning bert model Mwpd2020: Semantic web challenge on mining the web of htmlembedded product data The authors would like to thank the Getir company for supporting this study and providing data and feedback throughout.