key: cord-0259582-7xqyci87 authors: Ma, Yixuan; Liu, Shuang; Jiang, Jiajun; Chen, Guanhong; Li, Keqiu title: A Comprehensive Study on Learning-Based PE Malware Family Classification Methods date: 2021-10-29 journal: nan DOI: 10.1145/3468264.3473925 sha: b52e5b040f3dde3dfc8745d00446eea2805497d8 doc_id: 259582 cord_uid: 7xqyci87 Driven by the high profit, Portable Executable (PE) malware has been consistently evolving in terms of both volume and sophistication. PE malware family classification has gained great attention and a large number of approaches have been proposed. With the rapid development of machine learning techniques and the exciting results they achieved on various tasks, machine learning algorithms have also gained popularity in the PE malware family classification task. Three mainstream approaches that use learning based algorithms, as categorized by the input format the methods take, are image-based, binary-based and disassembly-based approaches. Although a large number of approaches are published, there is no consistent comparisons on those approaches, especially from the practical industry adoption perspective. Moreover, there is no comparison in the scenario of concept drift, which is a fact for the malware classification task due to the fast evolving nature of malware. In this work, we conduct a thorough empirical study on learning-based PE malware classification approaches on 4 different datasets and consistent experiment settings. Based on the experiment results and an interview with our industry partners, we find that (1) there is no individual class of methods that significantly outperforms the others; (2) All classes of methods show performance degradation on concept drift (by an average F1-score of 32.23%); and (3) the prediction time and high memory consumption hinder existing approaches from being adopted for industry usage. With the rapid development of information technology, more and more user personal information and property are transferred, stored * Corresponding author and even shared on the Internet. Driven by the high profit, underground malware industries have been consistently growing and expanding. It is reported by AVTEST [9] that there are more than 350 thousand new malicious samples registered every day, and the total number of malware has grown roughly 11 times over the past nine years, i.e., from 99 million in 2012 to 1,139 million in 2020. Various malware families evolve fast, and tend to attack social and health events to enlarge the damage they can cause. According to the 2020 SonicWall Cyber Threat Report [55] , over the past half year, attacks related to COVID-19 were increased. The biochemical systems at an Oxford university research lab currently studying the Covid-19 pandemic has been breached [5] . It is well recognized that malware family evolves fast and the concept drift, which is the change in the statistical properties of an object in unforeseen ways, has become a rather challenging issue in the domain [28] . Malware family classification is of great importance to understand the malware family evolving trend and to better detect malware [22] . This is particularly significant for portable executable (PE) malware, as Windows is one of the most widely used operating systems. According to MalwareBazaar [7] , a website which offers free malware sample upload and download services, and some existing research findings [18, 62] , PE files account for around 50% of the malware uploaded by user. A large number of research approaches have been proposed to solve the PE malware classification problem, among which learning-based methods [18, 40, 62] have recently become popular benefiting from the success of machine learning, especially deep learning techniques. Learning-based PE malware family classification techniques can mainly be categorized into three classes based on the malware file format adopted. The first class of methods convert PE malware files into images, and then apply image classification techniques to solve the problem [8, 18, 39, 40, 62] . The second class of methods directly take the PE malware file as a sequence of binary code and usually sequential models from the domain of Natural Language Processing (NLP) are adopted to solve the problem [10, 26, 43, 44, 46, 51] . More recently, a new type of methods, which de-compile the PE malware file into assembly code, and then adopt graph structure analysis on the control flow graph (CFG) of the assembly code [20, 30, 31, 65] . These three categories of methods are independently evaluated with different datasets, and they all show good performance on the evaluated dataset. The most recent approaches [62] are reported to achieve more than 98% F1-score on the BIG-15 [50] dataset. There are some survey papers that summarize existing methods of malware family classification [17, 45, 56, 61] and Iadarola et al. [23] conduct an evaluation of image-based texture feature extraction methods. However, there is still no existing study that systematically evaluates all three classes of learning-based methods on the same evaluation settings with the same datasets. Moreover, given that those PE malware family classification approaches report excellent performance on existing public of manually crafted datasets, it is not clear (1) which class of methods are more accurate and efficient; (2) how do different kinds of methods perform in circumstances of concept drift, which is a common and realistic problem for the fast evolving malware family; and more importantly, (3) what is the status of industry adoption of those methods and the future research directions to support industry requirements. In this work, we seek to answer the questions with a well designed empirical study. We first briefly review existing approaches and select 9 most representative approaches from the three categories: 4 image-based approaches (VGG-16 [62] , ResNet-50 [11, 48, 54, 62] , Inception-V3 [62] , IMCFN [62] ), 2 binary-based approaches (CBOW+MLP [43] , MalConv [32, 44] ) and 3 disassembly-based approaches (MAGIC [65] , Word2Vec+KNN [10, 13] , MCSC [34, 41] ). Our results suggest that (1) no individual class of methods outperforms the others and the binary-based method CBOW+MLP [43] achieves the best performance across different datasets; (2) All classes of methods show performance degradation, by an average of 32.23%, in the circumstance of concept drift and CBOW+MLP [43] shows the sharpest performance degradation (by 69.62%); (3) Industry mainly adopts sandbox and pattern database to detect malware families currently, and the long prediction time, fragility to malware evolution and the high resource consumption hinders the current learning-based methods from being applied to industry practices. The main contributions of this work are as follows: • We conducted the first large-scale and systematic empirical study on learning-based PE malware family classification methods. • We create two new PE malware family classification datasets, one for the normal classification purpose and one for the concept drift purpose, and we will make them public. • We are the first to conduct evaluations on the concept drift situation with a large number of representative methods. • We provide practical suggestions and future research directions to promote applicable research in the area of malware family classification. All experimental data and results are publicly available to encourage future research in the community: https://github.com/ MHunt-er/Benchmarking-Malware-Family-Classification. Image-based Techniques first transform the malware binary files into either grayscale or color images, and then adopt image classification models for malware family classification. Nataraj et al. [40] propose to transform malware binary files into grayscale images, and employ K-Nearest Neighbor (KNN) model for malware classification via leveraging the texture features of images. This is also the first work that visualizes malware files into images for classification purposes. Since then, many approaches have been proposed to further improve it. For example, Ahmadi et al. [8] adopt boosting tree classifiers to better capture multiple novel features, such as Haralick and Local Binary Patterns (LBP). Narayanan et al. [39] study the performance of various machine learning models with texture features extracted via Principal Component Analysis. However, those methods are typically inefficient due to the high overhead of extracting complex texture features. With the recent advances of deep learning models in image classification tasks, they also attract attention in the security community, and have been employed for malware family classification. For example, Convolutional Neural Networks (CNN) have been widely adopted by many approaches [18, 25, 29, 62, 63] , and have shown better performance compared to traditional machine learning approaches [18] . Vasan et al. [62] propose to adopt transfer learning, which fine-tune a CNN network pre-trained with the ImageNet dataset [15] , with the images of malware files to better conform the malware family classification task. The empirical results show that their architecture stands out among several deep learning models. Image-based approaches require no domain-specific knowledge, and many existing well-developed image classification models can be used directly. However, they also have drawbacks. For example, transforming malware into images introduces new hyper-parameters (e.g., image width) and imposes non-existing spatial correlations between pixels in different rows which might be false [17] . Binary-based approaches take the binary code of malware as input, which is considered as sequential information. Therefore, existing sequential models, especially from the Natural Language Processing (NLP) domain, are usually adopted for classification. Jain et al. [26] propose to extract n-grams from the raw byte sequences as features, and leverage several basic machine learning models like Naïve Bayes [49] and AdaBoost [16] for malware detection. However, with the increase of n, the computation cost increases exponentially. Raff et al. [44] propose the first end-to-end shallow CNN model that directly takes the entire binary file as input for malware classification. As a result, it will requires large memory for large malware files, and thus has limited processing capability. To solve the problem, the approach that focuses only on the PE-header of binaries [46] is proposed. It selects 328 bytes from the PE-header as inputs, and thus is not affected by the size of malware files. It is also shown to perform better than the compared approach that depends on domain knowledge features extracted by PortEX library [19] . Qiao et al. [43] treat each malware binary file as a corpus of 256 words (0x00-0xFF), adopt the Word2Vec model [38] to obtain the word embeddings, then represent the malware as a 16 Vasan et al. [62] color image Malimg ResNet- 50 Vasan et al. [62] color image Malimg Singh et al. [54] color image Self-constructed dataset * Rezende et al. [48] color image Malimg Bhodia et al. [11] grayscale image Malicia, Malimg Inception-V3 Vasan et al. [62] color image Malimg Image-based IMCFN Vasan et al. [62] grayscale image, color image Malimg CBOW+MLP Qiao et al. [43] byte embedding ascending matrix BIG-15 MalConv Related Raff et al. [44] raw byte values Closed dataset ★ Binary-based Krcál et al. [32] raw byte values Closed dataset ★ MAGIC Yan et al. [65] attribute control flow graph BIG-15, YANCFG Word2Vec+KNN Awad et al. [10] opcode sequences BIG-15 Chandak et al. [13] 20 most frequent opcodes Malicia, Self-constructed dataset * MCSC Ni et al. [41] opcode sequences Simhash image BIG-15 Kwon et al. [34] opcode sequences Simhash image BIG-15 * The malware samples of self-constructed dataset comes from various open source data release websites. ★ The dataset was provided by an anti-virus industry partner with whom the authors work, which cannot be accessed publicly. word embedding matrix in byte ascending order, and finally classify malware using Multi-Layer Perception (MLP). Binary-based approaches do not require domain knowledge and consider the contextual information in malware binaries. However, representing a malware sample as a sequence of bytes may present some challenges compared to other category of methods. First, by treating each byte as a unit in a byte sequence, the size of the malware byte sequences may reach several million time steps [44] , which is rather resource consuming. Second, adjacent bytes are expected be correlated spatially, which may not always hold due to jumps and function calls, and thus there might be discontinuities in the information within the binary files. Disassembly-based approaches first disassemble binary files into assembly code, and perform malware classification based on features such as Function Call Graph (FCG) and Control Flow Graph (CFG), that are extracted from assembly code. Kinable et al. [30] propose to calculate the similarity score between two FCGs through existing graph matching techniques and use it as the distance metric for malware clustering. Kong et al. [31] present a generic framework that can first abstract malware into Attribute Function Call Graphs (AFCGs) and then further learn discriminant malware distance metrics to evaluate the similarity between two AFCGs. The above approaches are computationintensive while calculating the similarity between graphs, which will bring huge performance overhead and cannot generalize well. Recently, Hassen et al. [20] propose to cluster similar functions of FCGs by using Minhash [12] , and then represent the graphs as vectors for classification via leveraging deep learning models. Similarly, Yan et al. [65] employ Deep Graph Convolution Neural Network (DGCNN) [66] to aggregate the attributes on the AFCGs extracted from disassembly files, which is the first attempt for DGCNN on malware classification tasks. There are also some approaches which extract features directly from assembly code. Specifically, the opcode sequence is usually adopted. Santos et al. [51] propose a feature representation method based on the frequency of the appearance of opcode sequence. Awad et al. [10] treat opcode sequences of each disassembly file as a document and apply the Word2Vec model [38] to generate a computation representation of the document. Finally, they use the Word Mover's Distance (WMD) [33] metric and K-Nearest Neighbour (KNN) to classify these documents. SimHash [41] and MinHash [57] adopt Hash projections to convert opcode sequences into vectors, which are then visualized into images for classification. The disassembly-based techniques can better capture the code structure features compared to other methods, but they usually require domain knowledge, such as assembly language and its corresponding analysis methods. RQ1: How do different PE malware family classification methods perform on different datasets? Although a large number of learning-based approaches are proposed to solve the malware family classification task, they are only evaluated independently with some specific dataset, e.g., public dataset BIG-15 [50] , some manually crafted dataset [54] or dataset provided by their industry partners which are not public available [44] . There is a lack of systematic study to evaluate the performance of different approaches consistently on the same experiment settings with multiple datasets. RQ2: How is the classification performance of various models affected by malware concept drift? Concept drift [28] , which is the change in the statistical properties of an object in unforeseen ways, is a realistic and critical issue in the PE malware family classification task. It is thus important to evaluate the performance of different approaches in the application scenario of concept drift. RQ3: What factors hinder the current learning-based PE malware classification approaches from being deployed in industry and the corresponding improvement directions? With the gaps identified by the previous research questions, our ultimate goal is to provide suggestions on how to make the current learning-based PE malware classification approaches applicable in real industry scenarios. In order to systematically study the performance of different techniques, we select 9 state-of-the-art learning-based PE malware family classification methods, which cover image-based, binary-based and disassembly-based techniques, for our empirical study. Table 1 lists the details of all methods adopted. family classification [43] . The key idea of this method is that the relationship of bytes in samples from the same family is similar, and is distinctly different from those samples of different families. Therefore, the vector matrix of raw bytes is a valid feature for malware classification. The raw binary is first pre-processed, removing 5 or more consecutive 0x00 or 0xCC (meaningless bytes). Each file is then taken as a corpus, which is considered to be composed of 256 words from 0x00 to 0xFF. The Continuous Bag-of-Word Model (CBOW) in Word2Vec [38] is used to obtain embedding vectors of 256 bytes in the file, and each file is represented as a byte vector ascending matrix. MLP takes these matrices as inputs and outputs the corresponding family categories. MLP consists of 3 FCs (512 × 512 × , where N is decided by the number of predicted classes), and for the first two FCs, a dropout layer is added. MLP is the most intuitive and simplest deep neural network. Similar to VGG-16, although its structure is relatively simple, it has many parameters and can fit the training data well. MalConv: This is the first end-to-end malware detection model that allows the entire malware to be taken as input. MalConv [44] first uses an embedding layer to map the raw bytes to a fixed 8 dimensional vector. In this way, it can captures high level location invariance in raw binaries by considering both local and global contexts. Then, MalConv use a shallow CNN with a large filter width of 500 bytes combined with an aggressive stride of 500. This allowed the model to better balance computational workload in a data-parallel manner using PyTorch [3] and thus can relieve the problem of the GPU memory consumption in the first convolution layer. As a shallow CNN architecture, MalConv conquers one of the primary practical limitations that reading the whole malware bytes is memory consuming, it also captures global location invariance in raw binaries. It allows the embedding layer to be trained jointly with the convolution layer for better feature extraction. This is an end-to-end malware detection method that classifies malware programs represented by Attributed Control Flow Graph (ACFG) using Deep Graph Convolutional Neural Network (DGCNN) [66] . It first to convert the ACFG, which abstracts the vertices of the Control Flow Graph (CFG) as discrete numerical value vectors, extracted from malware disassembly file into a numerical vector. DGCNN then transforms these un-ordered ACFGs of varying sizes to tensors of fixed size and order for malware family classification. Word2Vec+KNN: This method models malware disassembly file as a malware language, extracts the opcode sequence of it as malware document and uses the Word2Vec model to generate a computational representation of such document. This work choose Word Mover's Distance (WMD) [33] as the measure of semantic closeness between documents for KNN classification, which computes the cost of transporting all embedded words of document to all embedded words of document . MCSC: It first extracts opcode sequences from disassembly files and encodes them to equal length based on SimHash [37] . Then, it takes each SimHash value as a binary pixel and converts the SimHash bits to grayscale images. It trains a CNN structure modified from LeNet-5 [35] to classify these grayscale images. When training the CNN classifier, muti-hash and bilinear interpolation are used to improve the accuracy of the model, and major block selection is used to decrease the image generation time. We employ four different datasets for evaluation, i.e., BIG-15 [50] , Malimg [40] , MalwareBazaar and MalwareDrift, respectively. The first two datasets are adopted from prior approaches, while the last two are newly constructed for our study. Malimg [40] comprises of 9,435 malware collected from 25 families in 2011. This dataset only contains the greyscale images converted from the malware binary files and is widely used by image-based malware classification approaches. MalwareBazaar is a new dataset we constructed according to the MalwareBazaar website [4], which provides free and unrestricted download services for malware samples. We first choose the top-6 malware families from the data released in 2020, and then download the most recently uploaded 1,000 malware samples for each family. Then, we filter out samples that are not in PE format and further leverage Joe Security [36] and AVClass [52] to check the label of each sample to remove noise samples, which different website give inconsistent labels. As a result, we obtain a dataset with 3,971 PE malware samples from 6 families, which is summarized in Table 2 . To Make the dataset conform to different approaches, we further used IDA Pro [2] to convert the malware into binary and disassembly files. Please note that since MalwareBazaar is constructed from the latest malware samples, it can better reflect the tendency of the latest malware, which may provide a different perspective and complement with previous datasets, such as BIG-15. MalwareDrift is constructed based on the conclusions by Wadkar et al. [64] . Their experiments confirmed that code changes appear as sharp spikes in the 2 timeline statistic, which quantifies the weight differences in Support Vector Machine (SVM) trained with PE malware features over the sliding time windows. Code changes can also be understood as the evolution or drift of a malware family. We adopt the dataset used in [64] as well as the corresponding 2 timeline statistic graph, based on which we can decide the evolution time period of each malware family according to the spikes, and divide samples of each family into the pre-drift and post-drift parts. After eliminating families without obvious sharp spikes and families of small sample size, we finally obtain the drift dataset, including 3,125 samples from 7 families, as is shown in Table 3 . 3.4.1 Data Pre-processing. In order to conform to the requirement of different methods and provide a consistent experiment setting, we first perform data pre-processing for the adopted datasets. (1) For image-based methods, we first transform the malware files into the color image format as it has been shown that color images achieve better performance than greyscale images on the malware family classification task [62] . More concretely, we first transform the files into images of different width according to the file size, and then adopt the Nearest Neighbor interpolation image resize method [42] to resize the image into the size of 224*224. It has been shown that the original texture features remain sharp with this resize method [62] . (2) For the Word2Vec+MLP method, we follow the original settings [43] to remove all '0x00' and '0xcc' bytes that consecutively appear more than 5 times, and employ the Gensim library [47] for byte embedding. (3) For MalConv, we limit the malware file size to less than 2MB due to the reason that larger file sizes cause excessive GPU memory consumption [44] . (4) 3.4.2 Configurations. In order to provide a fair comparison, we adopt the default hyper-parameter settings for the methods if they are released in the original paper [10, 41, 65] , which are said to achieve the best performance of the corresponding methods. We only extract the opcode sequences in the Word2Vec+KNN approach [10] for the sake of time. For the other methods which do not report the hyper-parameters leading to the best performance, we perform an intensive manual tuning process and employ the settings that achieve the best performance. Particularly, we apply the early stopping mechanism to fairly compare the learning efficiency for different methods. Due to the space limit, we do not report the detailed experiment settings in the paper, the information is available in our open-source repository. As transfer learning is usually adopted for image-based methods in the malware family classification task to enhance the performance. We also evaluate the effect of transfer learning in our study. Following the standard process, we fix the structure (i.e., the number of layers and neurons per layer) of each model. We first perform pre-training with the ImageNet dataset [15] and then fine tune the models with our malware datasets in image format. To explore the effect of dataset size on performance, we use 10%, 50%, 80% and 100% of the malware data, respectively, for fine-tuning. Particularly, to evaluate the performance of different methods in the concept drift scenario, we use the pre-drift samples to train a model following a standard data partition of 8:2 for training and testing, and report the performance of models on the pre-drift data. Then, we load the trained model and test it on the post-drift data. Following the standard paradigm, we apply 10-fold cross-validation in our experimental comparison among different methods on different datasets, and 5-fold cross-validation on the concept drift experiment. We utilize the Macro Average Metrics of Precision ( ), Recall ( ) and F1-Score ( 1 ) to measure the multi-classification performance of a model on multiple malware families. Suppose there are malware families (with a total of samples) and we conduct -fold cross-validation. First we calculate the total Precision ( ), Recall ( ) and F1-Score ( 1 ) of -fold cross-validation for each family class (1 ≤ ≤ ). The formulas used to calculate the metrics are shown in fomulas 1-3, where , and represent the true positive, false positive and false negative of malware family in the -fold. Based on the per-family precision ( ), recall ( ) and F1-score ( ) (1 ≤ ≤ ) computed with the -fold data, the Macro Average of Precision, Recall and F1-score for multiple families are calculated with formula 5-7. We also compute the accuracy with formula 4. 3.4.3 Executing Environment. All of our experiments are conducted on a server with 2 Intel Xeon Platinum 8260 CPU @2.30GHz, 8 Nvidia GeForce RTX2080 Ti GPU (11GB memory), and 512GB RAM. Table 4 shows the experimental results of all adopted models on the three studied datasets. Particularly, Malimg was only used for image-based models as the original malware file is unavailable. From the table we can see that the method with CBOW+MLP model achieves the best performance in terms of F1-score compared with all the others on both BIG-15 and MalwareBazaar datasets, indicating the generalizability of it to conform different datasets. On the other hand, when considering the performance of different categories, there is no individual category that always outperforms the others. For example, for binary-based methods, though the CBOW+MLP model performs best, the MalConv model is not always better than the other methods. Base on this result, we can conclude that the data format of malware should not be a critical factor that impacts the classification performance, which is more likely decided by the model itself. Besides, by comparing the experimental results across different datasets for each model, most of models can achieve stable performance except VGG-16 and MAGIC, whose F1-scores vary greatly on the BIG-15 and MalwareBazzar datasets. We further analyze the results and observe that imbalanced data is a major reason for VGG-16. For example, it performs much worse on BIG-15 compared with other datasets, because there are only 42 samples belonging to the Simda family, which only accounts for 1.43%-10.55% of the other families and less than 0.4% of the total dataset. Therefore, VGG-16 can hardly learn the deep semantics from such limited samples as it has the largest number of parameters (see Table 6 ). While for MAGIC, the reason is that it has large runtime GPU memory consumption, especially when the input file size is large. For BIG-15, the largest disassembly file is 140MB, while for MalwareBazaar, the file size can exceed 1GB. We can only set the batch size to 1 when processing the MalwareBazaar dataset due to the limitations of our GPU memory. The small batch size could affect the performance of the method. Considering each individual category, though IMCFN in general achieves the best performance (average F1-score: 96.81%) compared with the other image-based methods, there is no evidence that it significantly outperforms the others. Similar to CBOW+MLP (average F1-score: 97.56%) and MCSC (average F1-score: 95.38%) for the binary-based and disassembly-based methods. The overall performance within each category is relatively close to each other. It has been demonstrated above that insufficient training data for large-scale networks (e.g., VGG-16) may cause big performance drop. Transfer learning has been widely adopted in the image processing domain to tackle similar issues and obtained good performance, and thus we also investigate its impact on PE malware classification task in this work. We utilize the pre-trained model on ImageNet [15] and use the corresponding malware image data to fine tune the last fully-connected layer. The experimental results are shown in Table 5 . Specifically, VGG-16, which did not perform well due to insufficient training data, was significantly improved from 87.28% to 92.46% on BIG-15 through transfer learning. However, according to the results, the impact of transfer learning was still limited and the performance may largely drop for some cases, e.g., IMCFN. As a result, we further employ a more aggressive strategy that permits to update the connection weights in the last representation layer during fine-tuning. Figure 1 shows results of the IMCFN model. From the figure, we can observe that opening both the fully connected layer and the convolution layers in general leads to better performance, especially on Malimg. We also measure the cost of fine-tuning the last fully-connected layer only and the aggressive strategy, where we find that the latter causes 1.5x-3x training time (less than 1 hour) compared with the former, yet gains an average of 4.14% performance increase in terms of F1score. More importantly, the aggressive strategy exactly advances the classification performance, in terms of F1-score , by 0.51%-2.94% compared with training from scratch, except the Malimg dataset: 97.59% (aggressive) and 97.84% (from scratch). The results indicate that converted malware images are different, in terms of image features, from typical images in ImageNet. Therefore, the features trained with ImageNet are not directly applicable to malware images, and thus we need to open the feature extraction layers (i.e., the convolution layers in the models we studied) in the fine-tuning process for a better performance. Finding 2. Transfer learning can potentially further improve the effectiveness of image-based PE malware classification methods, and internal feature extraction layers of the model should be opened for fine-tuning to achieve better transfer-learning performance. Besides effectiveness, another important factor that impacts the practicality of PE malware classification methods in industrial scenarios is the efficiency and requirement of hardware environment since some methods may be required to work on resource-critical devices, such as a network gateway device with limited computing resources. Table 4 and Table 5 show the time and consumption of resources in detail for each method during training of the corresponding models, and the average runtime resource consumption of each method is shown in Table 6 . From the tables we can see that image-based methods are CPU intensive compared to the other categories, while transfer learning is a possible solution as it can help to reduce 28.49%∼96.47% training time. In addition, according to Table 6, image-based methods take very small pre-process and prediction overhead. The model sizes are relatively larger than the other methods. Disassembly-based methods take relatively longer preprocessing time, due to the time-consuming disassemble process. Particularly, the prediction time for the Word2Vec+KNN method is much longer, because it requires computing the distances between the given input data with all samples in the training set, all such factors indeed restrict the application of existing methods in the industry, and should be taken into consideration in future research. Table 7 shows the evaluation results of different methods in the concept drift scenario. In particular, we choose the IMCFN as a representation of the image-based approaches as it shows the best overall performance (highest average F1-score) from the previous research question. In the table, we visualize the impact of concept drift for different methods with gradient color, the deeper the red color is, the higher the score drops. From Table 7 , we can observe that all existing methods suffer from a large performance drop while confronting the concept drift in real industry scenarios, where the reduction of F1-score is up to 27.07%-69.62%. The F1-scores for all methods on the post-drift dataset are no more than 45%, reflecting that existing methods fail to consider the scenario of concept drift and there is still much improvement space. By comparing the results on concept-drift and non-concept-drift datasets, we can see that it is vital to take concept drift scenario into consideration for method evaluation. Finding 3. All existing methods suffer from poor performance when facing concept drift from real industrial scenarios, and thus they should be seriously considered when evaluating PE malware family classification methods. According to the results, the stability of different methods also varies greatly. The method of CBOW+MLP, though performs best when used in the common machine learning scenario, has the sharpest decrease in the concept drift scenario, which is mainly due to its simple structure. On the contrary, MCSC and Word2Vec+KNN show the least decrease ratios. The reason is that both of them extract Opcode sequences from disassembly files, and focus on the local context connection of Opcode sequences which tends to be retained during malware evolving and thus those methods show stable performance on concept drift. An interesting finding is that though Word2Vec+KNN also has simple structures, it performs better than CBOW+MLP under concept drift, which credits to the KNN algorithm that computes the similarity distances between the coming sample with all others. As a result, it typically requires a much longer prediction time (see Table 6 ). In order to investigate whether there is a significant difference on different PE malware families, we present the results in Figure 2 . We can observe that all methods tend to perform consistently on -2) , which obtains the largest performance drop. Comparing the images, we can observe that the impact of concept drift is relatively smaller in Group-1 than in Group-2. We further use the difference Hash similarity [27] to measure the similarity between images before and after concept drift. The smaller the similarity value is, the more similar the images are. Finally, they are 26 and 32, respectively for Group-1 and Group-2, which indeed causes the big performance difference on different families. The ultimate goal of our work is to encourage deployment of the research models in real industry scenarios. Therefore, we conduct an interview with our industry partner, who is in charge of the virus detection product of a security company. In particular, we ask the following questions based on our study results. • What classification methods are currently adopted in the company and why. • What are the factors that affect choosing the suitable methods in real applicable scenario? • Is concept drift common in real application scenarios and what are the current status of handling concept drift? Currently, two mainstream approaches are adopted in industry application scenario, i.e., the sandbox and the pattern based approaches. Sandbox can be described as a virtual environment which execute the malware and extract runtime feature for classification. It is accurate yet is time and resource consuming. For instance, the sandbox can process around 5-10 samples per minute, which can be tolerated due to its high accuracy. Pattern based approaches are static detection methods which are based on the pattern/feature database. It is efficient in terms of time and resource consumption, yet is fragile to noises, obfuscation and concept drift. Industry usage of the PE malware classification methods are mainly limited by three factors, i.e., the prediction precision and recall, the predicting time as well as the resource consumption, and the main resource concerns are runtime memory and CPU usage. The first two factors decide the user experience and thus whether the corresponding method can be adopted, and the resource consumption decides what kind of devices can the methods be deployed on. As a concrete example, in one of their product which contains the learning-based malware classification model, they require the runtime memory to be below 1GB, which cannot be met by all of our studied methods. Another requirement is to be able to predict a malware within 0.1s with an accuracy above 93%, which fillters out most of the binary-based and disassembly-based methods. Concept drift usually happens due to malware evolving, e.g., in scenarios where existing malware wants to escape detection, and there could be new non-kernel functionalities such as the communication and message passing techniques being changed. This happens frequently and raises challenges to malware family classification. There is a lack of specific mechanism to handle this case and current practice usually use the sandbox methods for such scenarios. Another observation is that in addition to concept drift, they also need to tackle with the challenge of the fast evolving new malware families and features. Except for the fine-grained family classification as defined by the existing academic datasets such as BIG-15, our industry partners are more interested in the detection of malware families based on their malicious behaviors, i.e., Trojan, Rootkit or Ransomware. However, there is a lack of research on this direction, highly likely due to the fact that there is no such datasets available. Finding 5. Real industry application scenario requires the classification methods to be able to tackle the challenges brought by the fast evolving of malware families. Moreover, there should be a trade-off between the resource consumption and prediction accuracy in consideration of the deployment environment and customer feedback. Therefore, the future research should focus more on (1) how to handle the fast evolving of malware family rather than only evaluate with one or a few dataset; (2) a more light-weight model with high prediction accuracy, and (3) malware family classification from the malicious behavior perspective. Threats to internal validity mainly lie in the implementation of different methods. In order to compare the results fairly, we employ the reported best configurations for each model if available, while for others we report the best results we have obtained after an intensive manual tuning process. We believe this strategy will mitigate the bias involved by different model settings. Besides, due to the limit of physical memory, we set the batch size to 32 (originally 256) when training MalConv model, which may affect the results. However, we argue that it is reasonable and acceptable, especially in real application scenarios, where resources are critical according to the feedback from industry. In addition, we publish all our experimental data for replication and boosting future research. Threats to external validity mainly lie in the selection bias of methods and datasets studied. In order to perform a systematic comparison, we have adopted 9 different methods, covering the mainstream image-based, binary-based and disassembly-based techniques. To alleviate the impact of datasets, we employ two commonly-used datasets (i.e., BIG-15 and Malimg), and further construct two new datasets to reflect the latest progress of PE malware (MalwareBazaar) and the concept drift issue (MalwareDrift) from real industry practice. Threats to construction validity mainly lie in the randomness and measurements in our experiment. To reduce the impact of randomness, we applied a 10-fold cross-validation in our comparing experiments on BIG-15, Malimg and MalwareBazaar datasets, and used 5-fold cross-validation in studying the effects of concept drift on various classification methods, rather than repeating each experiment several times. Macro-average is widely adopted to measure the performance of multi-class classification. PE malware family classification has gained great attention and a large number of approaches have been proposed. In this paper, we first identify the gap of applying learning-based PE malware classification approaches in industry through a systematic empirical study, where we employ 9 different methods, covering the mainstream image-based, binary-based and disassembly-based techniques, and 4 different datasets for the experiment. Based on the obtained quantitative evaluation results in Section 4.1, 4.2, and the requirements from industry (Section 4.3), we conclude that: (1) There is no individual class of methods significantly outperforms the others; (2) All class of methods show performance degradation on concept drift, which is vital important in practice; (3) The prediction time and high memory consumption hinder existing approaches from being adopted for industry usage. We further provide actionable guidance on future applied research: (1) focus more on how to handle the fast evolving of malware family; (2) explore light-weight models with high prediction accuracy; (3) take into account the malicious behavior features for malware family classification. Secure hash and message digest algorithm library Top File Types | Statistics of MalwareBazaar Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification Malware Statistics and Trends Report by AV-TEST. Retrieved March Modeling Malware as a Language Transfer Learning for Image-based Malware Classification On the resemblance and containment of documents A Comparison of Word2Vec, HMM2Vec, and PCA2Vec for Malware Classification A method for classifying medical images using transfer learning: A pilot study on histopathology of breast cancer Im-ageNet: A large-scale hierarchical image database A decision-theoretic generalization of on-line learning and an application to boosting The rise of machine learning for detection and classification of malware: Research developments, trends and challenges Using convolutional neural networks for classification of malware represented as images Robust static analysis of portable executable malware Scalable Function Call Graph-based Malware Classification Deep Residual Learning for Image Recognition Predicting signatures of future malware variants Image-based Malware Family Detection: An Assessment between Feature Extraction and Classification Techniques Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Convolutional neural networks and extreme learning machines for malware classification Byte level n-gram analysis for malware detection A comparative study on image similarity algorithms based on hash Transcend: Detecting Concept Drift in Malware Classification Models Malware classification with deep convolutional neural networks Malware classification based on call graph clustering Discriminant malware distance learning on structural information for automated malware classification Deep convolutional malware classifiers can learn from raw executables and labels only From word embeddings to document distances Backpropagation applied to handwritten zip code recognition Detecting nearduplicates for web crawling Efficient estimation of word representations in vector space Performance analysis of machine learning and pattern recognition algorithms for malware classification Malware images: visualization and automatic classification Malware identification using visualization images and deep learning Comparison of interpolating methods for image resampling Malware Classification Method Based on Word Vector of Bytes and Multilayer Perception Malware detection by eating a whole exe A Survey of Machine Learning Methods and Challenges for Windows Malware Classification Learning the pe header, malware detection with minimal domain knowledge Malicious software classification using transfer learning of resnet-50 deep neural network An empirical study of the naive Bayes classifier Microsoft malware classification challenge Opcode sequences as representation of executables for data-mining-based unknown malware detection AVclass: A Tool for Massive Malware Labeling Very deep convolutional networks for large-scale image recognition Malware classification using image representation A state-of-the-art survey of malware detection approaches using data mining techniques Deep learning and visualization for identifying malware families Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning Going deeper with convolutions Rethinking the inception architecture for computer vision Survey of machine learning techniques for malware analysis IMCFN: Image-based malware classification using finetuned convolutional neural network architecture Image-Based malware classification using ensemble of CNN architectures (IMCEC) Detecting malware evolution using support vector machines Classifying malware represented as control flow graphs using deep graph convolutional neural network Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE An endto-end deep learning architecture for graph classification