key: cord-0045801-shf0hfwg authors: Pawlicki, Marek; Choraś, Michał; Kozik, Rafał; Hołubowicz, Witold title: On the Impact of Network Data Balancing in Cybersecurity Applications date: 2020-05-23 journal: Computational Science - ICCS 2020 DOI: 10.1007/978-3-030-50423-6_15 sha: a4e15c958526dd29ba80b4967b8f6dac39513da9 doc_id: 45801 cord_uid: shf0hfwg Machine learning methods are now widely used to detect a wide range of cyberattacks. Nevertheless, the commonly used algorithms come with challenges of their own - one of them lies in network dataset characteristics. The dataset should be well-balanced in terms of the number of malicious data samples vs. benign traffic samples to achieve adequate results. When the data is not balanced, numerous machine learning approaches show a tendency to classify minority class samples as majority class samples. Since usually in network traffic data there are significantly fewer malicious samples than benign samples, in this work the problem of learning from imbalanced network traffic data in the cybersecurity domain is addressed. A number of balancing approaches is evaluated along with their impact on different machine learning algorithms. The importance of cybersecurity rises with every passing year, along with the number of connected individuals and the growing number of devices utilising the Internet for various purposes [1, 2] . The antagonistic forces, be it hackers, crackers, state-sponsored cyberforces or a range of other malicious actors employ a variety of methods to cause harm to common users and critical infrastructure alike [3, 4] . The massive loads of data transmitted every single second exceeded the human capacity to deal with them long time ago. Thus, a myriad of machine learning (ML) methods were successfully implemented in the domain [5] [6] [7] . As rewarding as they are, AI-related approaches come with their own set of problems. One of them is the susceptibility to data imbalance. The data imbalance problem refers to a situation in which one or multiple classes have significantly more learning samples as compared to the remaining classes. This often results in misclassification of the minority samples by a substantial number of classifiers, a predicament especially pronounced if the minority classes are the ones that bear the greatest importance -like malignant cancer samples, fraud events, or, as in the case of this work, network intrusions. Additionally, the deterioration of a given model might go unnoticed if the method is only evaluated on the basis of accuracy. With the significance of the above-mentioned difficulty in plenty of high-stake practical settings, various methods to counter that issue have been proposed. These fall roughly into three categories: undersampling, oversampling and costsensitive methods. In this work numerous approaches to dataset balancing are examined, the influence each method has on a number of ML classifiers is highlighted and in conclusion the best experimentally found approach in the case of network intrusion detection is chosen. The major contribution and the unique value presented in this work comes in the form of highlighting the notion that the impact dataset balancing methods have on the behaviour of ML classifiers is not always a straightforward and intuitive one. A number of balancing approaches is thoroughly evaluated and their impact on both the dataset and the behaviour of classifiers is showcased. All of this in the context of a practical, vital domain that is network intrusion detection. In the era of big data, undersampling approaches need to be thoroughly researched as their reduced computational cost could become a major benefit in contrast to oversampling methods. The paper is structured as follows: in Sect. 2 the pipeline of network intrusion detection is illustrated and described, and the ML algorithms utilised are succinctly introduced, in Sect. 3 the chosen balancing methods are characterised. The focus of this research lies on the impact the balance of the instance numbers among classes in a dataset has on the performance of ML-based classification methods. In general, the step-by-step process of ML-based Intrusion Detection System (IDS) can be succinctly summarised as follows: a batch of annotated data is used to train a classifier. The algorithm 'fits' to the training data, creating a model. This is followed by testing the performance of the acquired model on the testing set -a batch of unforeseen data. In order to alleviate the data balancing problem present in the utilised IDS dataset an additional step is undertaken before the algorithm is trained (as seen in Fig. 1 ). The ML-based classifier block of Fig. 1 can be realised by an abundance of different machine learning methods. In fact, recent research showcases numerous novel approaches including deep learning [7, 8] , ensemble learning [9, 10] , various augmentations to classical ML algorithms [11] etc. In this work three basic models were chosen to put emphasis on the data balancing part. These are: -Artificial Neural Network [12, 13] -Random Forest [14] -Naive Bayes [15] These represent three significantly different approaches to machine learning and were selected to cover possibly the widest range of effects dataset balancing could have on the effectiveness of ML. The ANN in use is set up as follows: two hidden layers of 40 neurons, with the Rectified Linear Unit as activation function, and the ADAM optimizer, batch size of 100 and 35 epochs. The setup emerged experimentally. In the cases suffering from the data imbalance problem the number of training samples belonging to some classes is larger in contrast to other classes ( Table 2) . After SMOTE The conundrum of data imbalance has recently been deeply studied in the area of machine learning and data mining. In numerous cases, this predicament impacts the machine learning algorithms and in result deteriorates the effectiveness of the classifier [16] . Typically in such cases, classifiers will achieve higher predictive accuracy over the majority class, but poorer predictive accuracy over the minority class. In general, solutions to this problem can be categorised as (i) data-related, and (ii) algorithm-related. In the following paragraphs, these two categories of balancing methods will be briefly introduced. The focus of the analysis was on the practical cybersecurityrelated application that faces the data imbalance problem. Two techniques, belonging to this category, that are commonly used to cope with imbalanced data use the principle of acquiring a new dataset out of the existing one. This is realised with data sampling approaches. There are two widely recognised approaches called data over-sampling and under-sampling. Under-sampling balances the dataset by decreasing the size of the majority class. This method is adopted when the number of elements belonging to the majority class is rather high. In that way, one can keep all the samples belonging to the minority class and randomly (or not) select the same number of elements representing the majority class. In our experiments we tried a number of undersampling approaches, one of those was Random Sub-sampling. The effect random subsampling has on the dataset is illustrated in Fig. 3 . The results the method has in conjunction with the selected ML algorithms is showcased in Table 3 . There are also approaches that introduce some heuristics to the process of sampling selection. The algorithm called NearMiss [17] is one of them. This approach engages algorithm for nearest neighbours analysis (e.g. k-nearest neighbour) in order to select the dataset instances to be under-sampled. The NearMiss algorithm chooses these samples for which the average distance to the closest samples of the opposite class is the smallest. The effect the algorithm has on the dataset is illustrated in Fig. 3 , the results obtained are found in Table 4 Another example of algorithms falling into the undersampling category is called TomekLinks [18] . The method performs under-sampling by removing Tomek's links. Tomek's link exists if the two samples are the nearest neighbours of each other. More precisely, A Tomek's link between two samples of different class x and y is defined as d(x, y) < d(x, z) and d(x, y) < d(y, z) for any sample z. The effect removing Tomek-links has on the dataset is illustrated in Fig. 4 , the effect it has on ML models is found in Table 5 . A different approach to under-sampling involves centroids obtained from a clustering method. In that type of algorithms the samples belonging to majority class are first clustered (e.g. using k-means algorithm) and replaced with the cluster centroids. In the experiments this approach is indicated as Cluster Centroids. The results of the clustering procedure are illustrated in Fig. 4 and in Table 6 . On the other hand, the oversampling method is to be adopted when the size of the original dataset is relatively small. In that approach, one takes the minority class and increases its cardinality in order to achieve the balance among classes. This can be done by using a technique like bootstrapping. In that case, the minority class is sampled with repetitions. Another solution is to use SMOTE (Synthetic Minority Over-Sampling Technique) [19] . There are various modification to the original SMOTE algorithm. The one evaluated in this paper is named Borderline SMOTE. In this approach the samples representing the minority class are first categorised into three groups: danger, safe, and noise. The sample x is considered to belong to category noise if all nearest-neighbours of x are from a different class than the analysed sample, danger when only a half belongs to In Borderline SMOTE algorithm, only the safe data instances are over-sampled [19] . The effect of this procedure on the dataset is expressed in Fig. 2 . The results are placed in Table 7 . A final note concluding this section would be the observation that there is no silver bullet putting one sampling method over another. In fact, their application depends on the use case scenarios and the dataset itself. For the sake of clear illustration the original dataset's class distribution is depicted in Fig. 2 , the results the ML algorithms have achieved are found in Table 1 . Utilizing unsuitable evaluation metrics for the classifier trained with the imbalanced data can lead to wrong conclusions about the classifier's effectiveness. As the majority of machine learning algorithms do not operate very well with imbalanced datasets, the commonly observed scenario would be the classifier totally ignoring the minority class. This happens because the classifier is not sufficiently penalized for the misclassification of the data samples belonging to the minority class. This is why the algorithm-related methods have been introduced as a part of the modification to the training procedures. One technique is to use other performance metrics. The alternative evaluation metrics that are suitable for imbalanced data are: -precision -indicating the percentage of relevant data samples that have been collected by the classifier -recall (or sensitivity)-indicating the total percentage of all relevant instances that have been detected. -f1-score -computed as the harmonic mean of precision and recall. Another technique that is successfully used in the field is a cost-sensitive classification. Recently this learning procedure has been reported to be an effective solution to class-imbalance in the large-scale settings. Without losing the generality, let us define the cost-sensitive training process as the following optimisation formula:θ = min where θ indicates the classifier parameters, e i the error in the classifier response for the i-th (out of N ) data samples, and C i the importance of the i-th data sample. In cost-sensitive learning, the idea is to give a higher importance C i to the minority class, so that the bias towards the majority class is reduced. In other words, we are producing a cost function that is penalizing the incorrect classification of the minority class more than incorrect classifications of the majority class. In this paper we have focused on Cost-Sensitise Random Forest as an example of cost-sensitive meta-learning. This is mainly due to the fact the Random Forest classifier in that configuration yields the most promising results. These can be found in Table 10 . CICIDS2017 [20] is an effort to create a dependable and recent cybersecurity dataset. The Intrusion Detection datasets are notoriously hard to come by, and the ones available display at least one of frustrating concerns, like the lack of traffic diversity, attack variety, insufficient features etc. The authors of CICIDS2017 offer a dataset with realistic benign traffic, created as an interpolation of the behaviour of 25 users using multiple protocols. The dataset is a labelled capture of 5 days of work, with 4 days putting the framework under siege by a plethora of attacks, including malware, DoS attacks, web attacks and others. This work relies on the captures from Tuesday, Wednesday, Thursday and Friday. CICIDS2017 constitutes one of the newest datasets available to researchers, featuring over 80 network flow characteristics. The Imbalance Ratio of the Majority Class to the sum of all the numbers of samples of the rest of the classes was calculated to be 2.902. The sample counts for particular classes in the training set are showcased in Table 1 . CICIDS 2017 dataset consists of 13 classes -12 attacks and 1 benign class. As depicted in Fig. 2 , there is a wide discrepancy among the classes in terms of the number of instances, especially the benign class as compared to the attack classes. The number of instances in the respective classes in the training set is displayed in Table 1 . During the tests the initial hypothesis was that balancing the classes would improve the overall results. Random Subsampling (Table 3 ) along a slew of other subsampling methods were used to observe the influence dataset balancing has on the performance of 3 reference ML algorithms -an Artificial Neural Network (ANN), a RandomForest algorithm and a Naive Bayes classifier. Finally, Borderline SMOTE was conducted as a reference oversampling method. The results of those tests are to be witnessed in Table 4 , 7, 5 and 6. It is immediately apparent from inspecting the recall in the unbalanced dataset ( Table 1 ) that some instances of the minority classes are not recognised properly (class 1 and 13). Balancing the benign class to match the number of samples of all the attacks combined changed both the precision and the recall achieved by the algorithm. It also became apparent that none of the subsampling approaches outperformed simple random subsampling in the case of CICIDS2017. The tests revealed an interesting connection among the precision, recall and the imbalance ratio of the dataset. Essentially, there seems to exist a tradeoff between precision and recall that can be controlled by the number of the instances of classes in the training dataset. To evaluate that assertion further tests were conducted. Random Forest algorithm was trained on the Unbalanced dataset and then all the classes were subsampled to match the number of samples in one of the minority classes (Table 9 -1174 instances per class and Table 8 -7141 instances per class). The tests proved that changing the balance ratio undersampling the majority classes improves the recall of the minority classes, but degrades the precision of the classifier on those classes. This basically means that dataset balancing causes the ML algorithms to misclassify the (previously) majority classes as instances of the minority classes, thus boosting the false positives. Finally, a cost-sensitive random forest algorithm was tested. After trying different weight setups results exceeding any previous undersampling or oversampling methods were attained (Table 10 ). It is noteworthy that the achieved recall for class 13 is higher while still retaining a relatively high precision. A relationship between class 11 and class 13 was also discovered, where setting a higher weight for class 13 would result in misclassification of class 11 samples as class 13 samples and the other way round. To provide further insight into the effects of dataset balancing statistical analysis was performed with regards to balanced accuracy [21] . The tests revealed that: the cost-sensitive random forest has better results than simple random subsampling, with the t-value at 2.07484 and the p-value at 0.026308. The result is significant at p < 0.05. The random forest classifier over the dataset randomly subsampled down to 7141 samples in each majority class performed better than when just the 'benign' class was randomly subsampled with the t-value at 2.96206 and the p-value is 0.004173. The result is significant at p < 0.05. The cost-sensitive random forest was not significantly better than the random forest trained on the randomly subsampled dataset in the 7141 variant (the t-value is 1.23569; the p-value is 0.11623. The result is not significant at p < 0.05). Cutting the Tomek-links did not prove as good a method as random subsampling in the 7141 variant, with the t-value at 3.69827, the p-value at 0.000823. The result is significant at p < 0.05. Removing the Tomek-links wasn't significantly better than just using the imbalanced dataset, with the t-value at 0.10572. The p-value at 0.458486. Both the 7141 variant of random subsampling and the cost-sensitive random forest were better options over just using the imbalanced dataset, with the t-value at 2.96206. The p-value at 0.004173 for random subsampling and the t-value at 2.65093 and the p-value at 0.008129 for the cost-sensitive classifier. In this paper the evaluation of a number of dataset balancing methods for the ML algorithms in the cybersecurity doman was presented. The conducted experiments revealed a number of interesting details about those methods. Firstly, in the case of the CICIDS2017 dataset, random subsampling was just as good or better than other undersampling methods and the results were on par with Borderline SMOTE. Secondly, the final proportions of the dataset can bear just as much an impact on the results of ML classification as the choice of the balancing procedure itself. Thirdly, there is a relationship among the size of the majority classes, the precision and the recall achieved, which is simply expressed by the number of majority samples falsely classified as minority samples. Identifying core concepts of cybersecurity: results of two Delphi processes Cybersecurity issues in implanted medical devices Internet of things: a survey of technologies and security risks in smart home and city environments A scalable distributed machine learning approach for attack detection in edge computing environments Comparison of deep learning and the classical machine learning algorithm for the malware detection Machine learning techniques applied to detect cyber attacks on web applications Evaluation of convolutional neural network features for malware detection Comparison of three deep learning-based approaches for IoT malware detection Research on intrusion detection model using ensemble learning methods An ensemble approach for intrusion detection system using machine learning algorithms Machine learning approach to IDS: a comprehensive review Introduction to Deep Learning. UTCS A comparative performance evaluation of intrusion detection based on neural network and PCA Random forests Data Mining and Knowledge Discovery Handbook Solution to data imbalance problem in application layer anomaly detection systems KNN approach to unbalanced data distributions: a case study involving information extraction Two modifications of CNN Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning Toward generating a new intrusion detection dataset and intrusion traffic characterization The balanced accuracy and its posterior distribution Acknowledgement. This work is funded under the SPARTA project, which has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No. 830892.