key: cord-0471393-9dz2qged authors: Tann, Wesley Joon-Wie; Wei, Jackie Tan Jin; Purba, Joanna; Chang, Ee-Chien title: Filtering DDoS Attacks from Unlabeled Network Traffic Data Using Online Deep Learning date: 2020-12-12 journal: nan DOI: nan sha: 54c2e34a5e1f89bea854b3d6167fb564048d05c3 doc_id: 471393 cord_uid: 9dz2qged DDoS attacks are simple, effective, and still pose a significant threat even after more than two decades. Given the recent success in machine learning, it is interesting to investigate how we can leverage deep learning to filter out application layer attack requests. There are challenges in adopting deep learning solutions due to the ever-changing profiles, the lack of labeled data, and constraints in the online setting. Offline unsupervised learning methods can sidestep these hurdles by learning an anomaly detector $N$ from the normal-day traffic ${mathcal N}$. However, anomaly detection does not exploit information acquired during attacks, and their performance typically is not satisfactory. In this paper, we propose two frameworks that utilize both the historic ${mathcal N}$ and the mixture ${mathcal M}$ traffic obtained during attacks, consisting of unlabeled requests. We also introduce a machine learning optimization problem that aims to sift out the attacks using ${mathcal N}$ and ${mathcal M}$. First, our proposed approach, inspired by statistical methods, extends an unsupervised anomaly detector $N$ to solve the problem using estimated conditional probability distributions. We adopt transfer learning to apply $N$ on ${mathcal N}$ and ${mathcal M}$ separately and efficiently, combining the results to obtain an online learner. Second, we formulate a specific loss function more suited for deep learning and use iterative training to solve it in the online setting. On publicly available datasets, our online learners achieve a $99.3%$ improvement on false-positive rates compared to the baseline detection methods. In the offline setting, our approaches are competitive with classifiers trained on labeled data. Distributed Denial-of-Service (DDoS) attacks are well established as a significant threat to our present-day Internet network security, denying legitimate users access to shared and essential resources. The earliest reported DDoS attack in 1999 [2] started a wave of denial-of-service attacks that are distributed in nature. Even though more than two decades have passed, these attacks-simple to set up, difficult to stop, and very effective-are more popular than ever as they remain potent [29] . The allocation of extra resources is a typical protective approach to handle such DDoS traffic. However, catering excess resources goes hand in hand with additional costs. There is a growing urgency to find practical and inexpensive methods, which can effectively abate disruptions by filtering malicious traffic during an attack. We focus on devising such a filtering system. A primary form of DDoS defense, anomaly detection mechanisms [18, 20] , faces growing difficulty in detecting malicious traffic as application-layer attacks are increasingly sophisticated. There are several significant challenges. First, the attacks are often domainspecific with varying characteristics, making it hard for defense methods to generalize them. Hence a two-class classifier trained on historic attacks might not be applicable to different victims. Second, the time constraint is a consideration in the real-time environment. Online models have to be computationally efficient to respond to requests rapidly during an on-going attack. Third, while online machine learning seems to be an effective way to handle these evolving dynamics, the lack of labeled requests at the time of the attacks, required for supervised learning algorithms, poses another challenge for this learning paradigm. We must address these challenges before deploying them in the real world. Existing DDoS defenses generally can be classified as statisticalbased or machine learning-based detection. Statistical methods typically involve some fast calculation of network traffic properties, such as entropy scoring of network packets [6, 11, 27, 34] , IP filtering [19, 26] , and IP source traceback [13, 35] . On the other hand, deep learning exploits intrinsic properties but could be computationally intensive. Most deep learning approaches for DDoS in the literature adopt offline learning, where the training is conducted much earlier (e.g., at least a few days) before the attack commences [14, 15, 24, 32, 37] . There are fewer works on online learning that performs incremental updates as more attack requests arrive. Çakmakçı et al. gave an up-to-date survey of online DDoS detection [4] . Most online learning methods employed unsupervised clustering (e.g., k-means) on the attack day data [4, 7, 17] . In this paper, we explore how deep learning is leveraged to enhance detection performance, especially in the application-layer traffic, which exhibits intrinsic statistical properties. We formulate a two-class learning problem, which utilizes data from two operational periods, normal-day traffic N , and the attack-day traffic M that contains a mixture of unlabeled legitimate and malicious traffic. We also assume a rough estimate of the proportion of attack requests in M. The learning objective is to accurately predict requests from N as normal, and proportion of requests in M as attacks. To solve this problem, we first employ techniques inspired by the statistical method PacketScore [33] , which takes a principled approach to estimate conditional probabilities (the likelihood that a request is an attack given that it is observed in the mixture traffic) of network requests. We perform the learning in a two-step process. (1) We apply unsupervised learning on the normal-day traffic N to obtain a model that learns the distribution of N . (2) When an attack commences, a similar unsupervised learning model rapidly learns the distribution of the mixture M. Now, the attack Figure 1 : A framework that utilizes both N and M to rank network traffic based on their final / scores. The higher the final score, the more likely the traffic is normal. To achieve computation speedup, the embedding of is transferred to . and normal requests can be differentiated based on the differences between two conditional probability distributions and . We then rank each request by ( )/ ( ), where a higher score indicates a higher chance of normality, as illustrated in Figure 1 . In our construction, both models and are LSTM recurrent neural networks, designed for application-layer DDoS, where a "request" consists of a sequence of events (e.g., an HTTP GET to the landing page, followed by subsequent HTTP GET requests). To meet the online requirements, we design model to achieve quick online updates by relying on transfer learning. When an attack occurs, the optimized embedding of is transferred to , significantly reducing the training needed for . We call this particular construction of online learning the N-over-D nets. However, there are two main challenges of generalizing statistical methods to a deep-learning approach. First, the learning models normalize data values for effective training and smooth gradient flow, and even though the range of values is maintained, the resulting model-calculated probabilities are not exact. While we designed our LSTM to get close to the exact likelihood, it could be distorted when taking the / division. Second, models and are separately trained during each operational period. There is no joint training of the models, and we do not fully leverage all available information for each model. Hence, we formulate a machine learning loss function specifically to address these challenges. As N-over-D follows a statistical method to get probabilistic measures, it might not attain optimality in the learning paradigm. We design an online iterative two-class classifier for this particular loss function that jointly trains on both normal-day N and mixture M traffic. This online classifier does not have the ground truth label information. It uses pseudoground-truth labels, taking all traffic during attacks as malicious and labeling them accordingly. We train the classifier, with a similar model architecture as the LSTMs in N-over-D, using these inexact labels to improve iteratively. This enhanced iterative classification approach achieves remarkable detection performance. Although the iterative classifier takes a slightly longer training time than N-over-D (3-5 times longer), it is nevertheless suitable for attacks with a short timeframe; even used for classifying unlabeled historic M. Comprehensive experiments demonstrate the ability of our approach to mitigate a range of DDoS attacks. For example, in the CICIDS2017 dataset, while the model achieves accuracy and a false-positive rate of (32.5%, 69.97%), both the online N-over-D and iterative classifier models accomplish marked improvements of (93.3%, 0.47%) and (93.5%, 3.40%), respectively. Moreover, both the N-over-D and the iterative classifier outperform existing online detection methods [4, 5, 17] by a large margin in either accuracy or false-positive rates (see Section 7 for details). (1) We formulate DDoS detection on unlabeled data as a machine learning optimization problem with normal-day N traffic, mixture M traffic, and the estimated proportion of attack as training inputs. (2) We propose an online approach N-over-D, inspired by statistical methods, extending one-class unsupervised learning to a two-class learning method. It deploys an LSTM-based model on N and M separately and incorporates transfer learning to speedup training in the online setting. (3) We design a specific loss function to address the challenges of N-over-D, proposing an iterative online method for this loss function, which jointly trains on both normal-day N and mixture M traffic data. Organization. We organize the rest of the paper by first presenting the background information and challenges of the DDoS problem in Section 2. We introduce the problem in Section 3 and propose an appropriate conditional probability learning approach in Section 4. We then detail the iterative classification approach in Section 5 and the online LSTM architecture of our approaches in Section 6. In Section 7, we describe the datasets and evaluate our approaches on these datasets, comparing them with state-of-the-art methods. We also analyze the performance of our mitigation process and discuss several practical issues. Finally, we summarize related work in Section 8 and conclude the paper in Section 9. This section gives an overview of application layer network services and describes the challenges of detecting DDoS attacks in such systems. We also motivate the need for innovative defenses that can perform online network traffic monitoring. A surge in application-layer DDoS attacks, using a network of widely distributed botnets, are increasingly disruptive to current network services [29] . Symantec's telemetry shows that it is often small and medium-sized retailers, selling goods ranging from clothing to gardening equipment to medical supplies, that are on the receiving end of the majority of attacks [30] . It is a serious global issue that could potentially affect any business that operates online, especially the essential services during critical times, such as the COVID-2019 pandemic. Observed in Q1 2020, there is an unexpected spike in the number of attacks; noticeable changes in the distribution of DDoS attacks by type doubled by 80% against Q1 2019 [12] . Several unsupervised learning mitigation strategies [4, 7, 17] have been proposed to counter such attacks in an online manner. These learning-based methods, employing rudimentary machine learning algorithms, e.g., random forest, logistic regression, decision tree (see Sec. 8.2 for details), perform clustering and simple distance measure using a few elementary features. While they can capture some basic network traffic properties, they are unable to learn deeper representations of legitimate and malicious requests to better distinguish between the two types of traffic. Moreover, the optimization goal of these methods is not clear. Common application-layer flooding attacks such as HTTP flood attacks are designed to generate an overwhelming amount of traffic, increasing the servers' load to severely impact a target. A popular tactic used by attackers to disrupt legitimate access is to disguise their traffic as flash crowds. A flash crowd is a surge in traffic to a particular Web site that causes the site to be virtually unreachable [36] . Even though attacks and flash crowds originate from different motivations, such attacks' covertness makes them hard to detect. These attacks typically send numerous HTTP packets abruptly and rapidly, flooding a target server with harmless HTTP requests, evading most defense systems. Any defense system designed to detect and mitigate such attacks must differentiate between attack and normal traffic fast, minimize losses, and perform online mitigation. In this section, we formalize the learning problem in the DDoS setting. We consider two operational periods, a normal period and an attack period. In practice, the attack period can be readily detected as it is characterised by an unusual high volume of requests. Let N be the requests logged during the normal period, and M the requests during the attack period. The mixture M contains a mixture of both legitimate requests and attack requests. We assume that based on the volume, we have a rough estimate of , which is the proportion of attack requests in M. While M is unlabeled, all requests in N are normal. Given N , M, and , the goal is to derive a two-class learning method that differentiates attack from normal, so that N would be classified as normal, and proportion of M would be classified as attacks. Offline vs Online Setting. The online setting consists of three stages. • In the pre-processing stage, computational intensive processing can be performed on N , giving some intermediate model and representation 0 . • The on-line learning stage is carried out when the attack commences. The time is divided into short intervals of ℓ minutes (e.g., ℓ = 1), and the mixture M is also divided according to M 1 , M 2 , . . ., where M contains the mixture within the -th interval. During the + 1-interval, training is performed on M and some intermediate representation , giving the updated +1 and a model +1 . There are stringent computing resource constraints, e.g., the training should be completed within ℓ minutes. • In the filtering stage, the trained model +1 is deployed to filter the incoming requests. To meet the real-time requirement, instead of applying the model on each request, it is applied to the mixture to obtain an updated blacklist of attack identities (e.g., IP addresses). The blacklist is then activated to filter out the attack requests. In the offline setting, the training can utilize the full M and N , and there is no real-time requirement. DDoS in the application layer. A request in network-layer DDoS such as Syn flood typically consists of a single packet. In contrast, a request in application-layer DDoS, which is the focus of this paper, could be a sequence = ( 1 , 2 , . . . , ) of sub-requests 's. For instance, a visit to a web-service could consist of an HTTP-GET to the landing page, followed by others. The intrinsic characteristic of such a sequence could be learned to differentiate attacks. In our notation, we view each request as a sequence of sub-requests, and a user corresponds to an IP-address. In evaluating performance, instead of measuring the actual byte counts, the reported measures are calculated based on evaluation metrics such as accuracy, false-positive rate, F1 scores. We propose the N-over-D framework to address the formulated learning problem, extending a one-class unsupervised learning anomaly detection into the two-class classifier in a three-step process. A one-class learning model is trained on a training set containing samples of the same class, giving a predictive model that, on input , output the probability of having in that class. (1) During pre-processing, one-class learning is applied to N to obtain a model N that predicts the distribution . (2) During online learning, a similar process is applied to M to obtain a model that predicts . (3) During filtering, the predicted probabilities, ( ) and ( ), are combined to give a model that predicts ( ), the likelihood that is from the attack given that is observed in the mixture. The last step calculates ( ) from ( ) and ( ), following similar techniques in PacketScore [33] . Note that, From (·), we can classify as an attack when ( ) > 0.5. However, the accuracy would rely on the accuracy of the estimated . Typically, during filtering, a threshold is set so that a certain predetermined percentage of requests would be allowed. It is tolerable to accept attack requests as long as the system can handle the load. Such a threshold can be selected by sorting probabilities (·) of the received requests and determining the appropriate cutoff. Since only the ranking matters, we can ignore the constant terms and focus only on the ratio / , which does not depend on . Moreover, to facilitate fast online learning, we could deploy transfer learning from to so that we can train readily. We design model to achieve quick online updates. When an attack occurs, the optimized embedding of is transferred to , thereby significantly reducing the training needed from a random initialization of . Choice of one-class classification. Not all one-class classification is applicable in the N-over-D approach. The candidate should meet the following two requirements. (1) Firstly, Equation (1) assumes that the output score of the model is the predicted probability. Nevertheless, many models transform the probability and output a score that preserves the ranking but distorted the actual value. The ratio of such distorted values from the two models might not preserve the ratio ranking. Hence, it is essential that the score should be the probability, or with some additive and multiplicative constants. (2) Secondly, as indicated earlier, we need a mechanism to transfer learning from to . Limitations. The N-over-D approach has a limitation in that it relies on the accuracy of the one-class classification, and the classification is applied to the normal and mixture almost independently (except some information that is transferred from to via transfer learning). Intuitively, more accurate results could be attained via joint-learning. For instance, some combination of features values could have few samples from N but a large population from M. When applied independently, might overgeneralize those features and predict a higher probability. However, with the additional information that there is indeed a high population from M, a more refined and accurate boundary could be learned. In this section, to address the limitations of the N-over-D nets, we propose a method that optimizes a specifically formulated loss function designed for deep learning, which directly trains on both N and M. We employ a binary classification model, similar to the model architecture of and , and iteratively update it. Let the trained classifier be the model obtained after the -th iteration. (1) At each interval ℓ during an attack, we assume all traffic is malicious. (2) We then train classifier on the newly observed data to update the model +1 . (3) Using classifier +1 , we predict and rank the observed data, selecting the top predicted attacks to retrain the classifier; by performing this procedure iteratively, the classifier directly minimizes our defined loss function (Equation 2). Following the goal in the two-class learning problem (Section 3), we define the following loss function. where ( ) is the model-predicted probability of the observed input being the attack, and (M) contains |M| samples from M with the highest model-predicted probability (·). The set (M) can be obtained by ranking the samples in M according to the predicted probabilities and selecting the highest |M|. In other words, the three summations correspond to the three sets N , (M), and (M − (M)). The loss function favors the first and third set to be classified as "normal", and the second set to be classified as "attack". The iterative classifier is able to perform fast online learning because it uses one supervised learning model, and it is suitable for attacks with a short timeframe. We assume all requests during the normal-day traffic as legitimate and all the traffic during an attack period, consisting of a mixture of both normal and attack traffic, as malicious and label them accordingly. By making such assumptions, we construct the initial set of inexactly labeled training data. Next, we train the classifier using these inexact labels. We then predict the same initial training data set with this trained classifier, retain the top 40% of traffic predicted as malicious to form a reduced attack dataset, and construct the next set of training data. By iteratively performing this prediction and reducing attack data, it allows the iterative classifier to repeatedly improve its model performance by increasingly retaining the traffic during an attack that is most likely to be truly malicious. Furthermore, we design another classification model that trains on the entire training data of each attack using actual ground truth information. In this setting, we suppose that all normal and attack data is available during model training and that the labels accurately distinguish the attack from the normal traffic. The full classifier uses actual ground truth labels effectively provides us a soft performance upper bound. By taking a classification approach that leverages data label information, we can (1) study how this extra information boosts our approach if labels are available, and (2) construct powerful comparison methods. The N-over-D nets ( Figure 2 ) takes a coordinated predictive approach that leverages sequential modeling and online learning to discriminate between normal and DDoS traffic. Our approach is inspired by the classic statistical DDoS mitigation work [33] that assigns each network request a score based on both probabilities that it is legitimate and malicious, which captures the differences in score distribution of attack and legitimate requests. N-over-D consist of two "partners"-a normal-day model and a model for the DDoS duration. The goal of is to capture the distribution over the request during normal network traffic conditions and predict the likelihood that the traffic is legitimate. At the same time, estimates the probability that each network unit of data is a malicious DDoS attack. Given incoming traffic, and evaluate the traffic for normality and abnormality respectively and make a joint decision / on the legitimacy of the traffic. We provide details of the input features, model architectures, and design choices below (Sections 6.1 to 6.3). The data processing procedure for online learning widely differs from that of a static dataset of requests. Hence, we prepare the data into a streaming form; sequences of sub-requests, that are grouped by source and destination IP addresses. Let 0 be the starting interval of the traffic and are the following ℓ-min intervals, where 1 is the first interval, 2 the second, and so forth. By extracting the intervals of network traffic into arrays, we prepare the data for sequential online learning, which allows the LSTM to learn the distribution of both normal and DDoS attack traffic. In the rest of this section, we describe in detail each stage of the preparation process (see Appendix for details of the data preprocessing procedure algorithm). During each interval , the preparation procedure captures network traffic that is collected for an allocated amount of time ℓ. The request sequences are then constructed using the sub-requests gathered during the interval. We give an example of a sequence of subrequests ( Table 1 ). The example sequence consists of sub-requests with attributes, consisting of both dynamic and static attributes. In our case, we chose eight dynamic attributes that capture the information that varies from request to request and two static attributes that are most common for all requests for a sequence. Sequences longer than are split to form a new sequence, while sequences shorter than are zero-padded at the end. The model for normal-day traffic is made up of two particular components: (1) a probabilistic sequential learning model that learns to predict sequences of temporal features, and (2) a histogram frequency distribution approximate representation to model the static features extracted from the network traffic. We model using a parameterized LSTM network and a Histogram method . LSTM. At each interval , the LSTM takes as input the sequence of requests = ( 1 , 2 , . . . , ) that are the temporal features in the sequences. We set = 200, balancing between efficiency of the model and memory requirements to learn the data. The LSTM learns from inputs at each step to output the predictions (ˆ1,ˆ2, . . . ,ˆ); the training process is illustrated in Figure 3 , and the hyperparameters are detailed below. Input layer: Recall in Section 6.1 that we preprocess the requests into intervals of sequences, simulating a real-time data streaming process. By creating an environment suitable for sequential online learning, it enables the LSTM to leverage its modeling capabilities to learn the correlation between traffic requests in the same sequence. This layer takes as inputs the data sequences of each interval , represented as a 3D tensor × × , where is the length of the sequences and each is represented as a × matrix. The matrix contains samples and features. Embedding layer: The inputs are then passed to the embedding layer , which creates a dense matrix that is learned through backpropagation of the training process. This layer reduces all features into a dense vector representation of dimension 16. The embedding, a dense matrix in a linear space, achieves two important functions. First, by using an embedding with a much smaller dimension than the feature size, it reduces the dimension of feature representations to the embedding size and reduce the model complexity. Second, the input feature embedding groups semantically similar features in an Euclidean space, and allow for a better feature representations. LSTM layer: The outputs from the embedding layer then form the compact inputs to the LSTM layer, which is a powerful machine learning architecture to learn from sequence data. LSTMs are designed to predict the presented observations at each timestep sequentially. By taking a sequence { 1 , 2 , . . . , } as input, the LSTM constructs a corresponding sequence of hidden state representations {ℎ 1 , ℎ 2 , . . . , ℎ }. A single-layer LSTM uses the hidden representations {ℎ 1 , ℎ 2 , . . . , ℎ } for estimation and prediction. In our case, we use two LSTM layers, where the second layer uses the hidden states of the previous layer as inputs. Every hidden state in each layer performs memory-based learning to remember relevant features using previous inputs. Previous hidden states and current where ℎ , , and are parameters of the layer and tanh(·) represents the standard hyperbolic tangent function. The LSTM is made up of these cells that are specifically designed for series data to trace the history from previous network requests. In order to overcome both the vanishing gradient and long-term dependency issues, each LSTM cell retains an internal cell state and a hidden state that is the output. Both and ℎ are computed via three gate functions to retain both long and short term storage of information. The forget gate , input gate , and output gate control the flow of information. In this work, we use the LSTM without peep-hole connections to handle complex sequences with long-range structure. We design each cell to consist of 32 units, the dimensionality of the output space, which is passed to the prediction layer (see Appendix for details). Prediction layer: Following the process (Figure 3 ), the LSTM takes in at each step and makes a predictionˆof the request at the next step + 1. We set the output function as the softmax function: where is the total number of possible outputs. The softmax function converts a real vector to a vector of categorical probabilities, where the elements of the output vector are in range (0, 1) and sum to one. We can interpret the results as a probability distribution and take the maximum as the prediction ( +1 | 0 , . . . , ) of the next step in the sequence, where 0 is the start of sequence symbol. We start all sequences with a this symbol and the probability of observing it at step 0 is one. Hence, by assuming that the state of each time step depends on previous time steps where the current information has a dependency on previous data, we can use standard conditional probability theory to determine the joint probability of each sequence of requests by calculating: and assigning this computed probability as the LSTM prediction ( ). When training the LSTM network , our objective is to minimize the difference between the model prediction and observed request at each timestep . The loss function we minimize is defined as: where is the total number of possible next request, , is the indicator if ground-truth label is the correct prediction for observation , in the -th step of the sequence, andˆ, is the LSTM predicted probability of the input at time . As the possibilities can be exceedingly large, it could cause an explosive increase in model computational complexity whereby drastically increasing the time required to train the model. A large presents a potential model robustness issue in large datasets. Moreover, there could be other considerations, such as memory limitations and GPU computational constraints. Hence, we employ a feature hashing trick [28] to reduce and solve the mentioned issues of a large number of request possibilities. Histogram. For each sequence of unique source and destination IP addresses, the static features are inputs for the histogram model . The Extra Info and List of Protocols attributes are the responses of the server after receiving requests and the list of network protocols used. They generally remain constant throughout the entire sequence. Since the two features are observed to be static, we assume independence between them and the dynamic features. A histogram model is used to capture the approximate frequency distribution of the static attributes. We count the number of specific Extra Info and List of Protocols pairs, and calculate the histogram probability prediction of each pair: where is a static feature pair, is a function that counts the number of observations of a particular pair, and is the total number of static feature pairs. Model prediction. To predict how likely a sequence belongs in the normal traffic, we incorporate both the LSTM prediction ( ) and the Histogram prediction ( ). By taking the product of the two predictions using dynamic features and static features: we obtain the probability of a sequence of network requests, which serves as the numerator of the N-over-D framework. The model , similar to the normal-day model , also consists of a parameterized LSTM and a Histogram model that learns to predict both sequences of dynamic features and static features of the traffic, respectively. The goal of is to learn meaningful generalization of the mixture traffic during DDoS to predict the probability of each network request to be malicious. It is designed to perform fast online learning, updating the models on the evolving attack traffic every minute. The model architecture is designed with a significant difference to promote the online learning design. We initialize the embedding layer with the previously trained embedding during normal traffic. Since the embedding is a compact representation of the input network requests, we considerably minimize the training of by transferring the learned embedding layer rather than initializing randomly. It significantly reduces the number of model parameters, which decreases the computational cost, to achieve fast learning and inference during attacks. In addition, we employ a similar Histogram model on the static features of network requests during the DDoS attacks. Model prediction. To predict how likely a sequence belongs in the malicious traffic, we incorporate both the LSTM prediction ( ) and the Histogram prediction ( ). By taking the product of the two predictions using dynamic features and static features: we obtain the probability of a sequence of network requests during DDoS attacks, which serves as the denominator of the N-over-D framework. N-over-D performs by first learning to model the normal traffic data distribution and predict the likelihood ( ) that the presented network request sequences are legitimate. Next, during DDoS attacks, the online-learner is trained for accelerated prediction ( ) of the incoming DDoS traffic to estimate the probability that each sequence is malicious. Finally, we adopt an integrated approach to decide the normality of a sequence by taking the division: Since a high ( ) indicates high legitimacy and a low ( ) suggests a small chance of maliciousness, the higher the joint probabilities ( ), the more likely the network traffic is normal. Hence, we differentiate between normal and attack traffic by ranking the combined probabilities of each sequence in the DDoS traffic of each 1-min interval, and discarding request sequences with relatively low joint probabilities. In this section, we present the evaluation of our approaches performed on various DDoS attacks. We demonstrate the effectiveness of the models and compare them with state-of-the-art methods. To validate the N-over-D and Iterative Classifier, we use three separate volumetric DDoS attacks from two publicly available network datasets, (1) CICIDS2017 [25] dataset 1 , which contains generated traffic resembling real-world attacks, and (2) The attacks last for 17 and 20 mins, respectively. In order to obtain normal traffic to learn the meaningful generalization of legitimate users, we retrieve traffic data prior to the attacks. For attack (1), we use normal network traffic on Wednesday (9:00 a.m. -9:47 a.m.), and for attack (2), we take Friday (3:00 p.m. -3:56 p.m.) data to train the model . In each attack, after training on the respective normal traffic, we perform online learning and detection using when an attack occurs, demonstrating progressive improvement of the detection performance. We leave the last five intervals as the test set, to evaluate our models and compare with state-of-the-art methods. : The x-axis is the rejection ratio (0-1), and the y-axis is the corresponding false rejection rate. If the rejection ratio is at 0.8, it means that 80% of the requests are rejected, and the graph shows the rate of false positives. Table 2) of the CAIDA07 test data shows 80% attack traffic. We present a detailed evaluation of the proposed approaches on both the CICIDS2017 and CAIDA07 datasets. Setting the rejection rate at 50% and online interval ℓ at 1-min, we measure the performance of iterative classifier (ℓ = 1) and N-over-D (ℓ = 1). We also set the interval (ℓ = ∞) as all the data during attack for both approaches, and present the Accuracy (ACC), False Positive Rate Table 3 illustrate these results. As demonstrated, both N-over-D and iterative classifier achieve low error rates and high accuracy scores across various attacks. These results validate the robust design of our approach. In Table 3 , the results demonstrate that the iterative classifier, which directly optimizes the loss function (Equation 2), achieves better performance than N-over-D, which estimates conditional probabilities of the normal-day and mixture traffic distributions. The iterative classifier is also more robust in FP rates across various rejection ratios (0-1), demonstrated in Figure 4 . However, its empirical runtime (Table 4) is roughly three to five times slower than that of N-over-D. Moreover, the iterative classifier is sensitive to the estimated proportion of attack in the loss function, whereas N-over-D is not. In our experiments, we use = 0.6 as a standard, but in other attacks where the attack proportion could be much higher, setting an appropriate is crucial to the detection performance. In addition, we compare our approaches, N-over-D and Iterative Classifier, with state-of-the-art DDoS detection methods on the CICIDS2017 and CAIDA07 datasets. We note that the other studied methods leverage on the labels of the data for their classification models. However, as mentioned in the problem formulation, labeled data is a commodity that is often unavailable during the time of an attack. CICIDS2017. We compare our approach with three online and four offline methods. The three online approaches include a CNN classifier LUCID [5] that uses labels and an unsupervised method, Smart Detection [17] that employs simple decision tree learning algorithms, and E-KOAD [4] . A network flow graph-based detector DeepGFL [31] , and four deep-learning classifiers introduced by Cyber Security in IoT Networks [24] are offline methods. The four deep learning models are the MLP, 1D-CNN, LSTM, and 1D-CNN+LSTM models. We note that the closest comparison, E-KOAD, presented the performance of the attack in the CICIDS2017 dataset that they achieved the best results. However, their smallest reported interval, ℓ, is 2 mins, while we used 1-min intervals. Hence, we report the results for ℓ = 2 min too. We present the evaluation results of all the models and contrast it with our proposed models ( Table 5) . The four learning models [24] are constructed by combining LSTM, CNN, and fully connected layers. Out of these architectures, designed for DDoS detection in Internet of Things (IoT) networks, the 1D-CNN+LSTM model performs the best. While it produces good classification results, the other models seem to suffer from low true-positive rates. As for DeepGFL [31] , it uses a graph representation of low-order features to perform classification. The results exhibit its weakness in identifying true positives, leading to low Table 5 : Performance comparison with existing methods on the CICIDS2017 dataset. We note that while N-over-D and Iterative Classifier do not use label information for identifying attacks, almost every other compared method leverages these labels to achieve the reported results. recall and F1 scores. Even though LUCID [5] , a lightweight deep learning DDoS detection system, presents superior classification results, the model applies a CNN to network traffic flows to detect attacks that heavily rely on the accuracy of each traffic flow's labels to achieve such performance. On the other hand, N-over-D and Iterative Classifier achieve one of the highest scores in every evaluation metric while retaining low false-positive rates. It demonstrates competitive performance without utilizing the label information of the data. CAIDA07. The compared methods include a self-organizing map detection scheme SOM [1] , a support vector machine classifier SVM [23] , a hybrid model SVM-SOM [21] , and an enhanced history-based IP filtering method eHIPF [22] . We present the evaluation results of all the models and contrast it with the N-over-D performance (Table 6 ). Table 6 : Performance comparison with existing methods on the CAIDA07 real-world dataset. We note that while N-over-D does not use label information for identifying attacks, the compared methods consider these labels to achieve such results. Methods [1, 21, 23] that adopt the SOM and SVM models produced relatively lower evaluation results, ranging from around 4%-8% false-positive rate and 92%-98% true-positive rate. As these are basic machine learning architectures, the models might not capture the high complexity of the network traffic as well as more complex neural architectures. Even though eHIPF [22] yields outstanding results, it is a system that invokes an intricate process that requires data labels. The performance of N-over-D, without the need for label information, proves to be very close to that of eHIPF. While we show in the above results (Table 3 ) that the N-over-D and Iterative Classifier are effective against a range of real-world and synthetic DDoS attacks, the contrast between traffic distributions during normal and DDoS periods for this effect has not been illustrated. We study the performance of various online parameters ℓ, analyze these distributions from an empirical perspective, and discuss the pros and cons of the proposed approaches. First, we present the performance (Table 7) for the proposed online approaches based on various intervals ℓ. It clearly shows that the iterative classifier outperforms N-over-D. As interval ℓ increases from 1 to 10 min, it significantly improves the iterative classifier's detection ability. There is a notable increase in performance from ℓ = 1 to ℓ = 2 min. However, increasing intervals ℓ did not affect N-over-D much, and the performance is not sensitive to this onlinelearning parameter. Next, in Figure 5 , we examine the differences in the normalized sequence score values of LOIC attack traffic. There is a clear distinction between the distribution of the predicted values and the values expected by of the same traffic. It can be seen that assigns lower values for the attack traffic as it predicts the likelihood of the traffic as normal. On the other hand, both online and offline assigns higher scores to the same attack traffic as it predicts a high likelihood for these sequences to be attacks. Hence, by taking the scores, we obtain a more accurate measure of the probability that a sequence of traffic is normal, with a higher score indicating legitimacy. Our N-over-D approach provides effective cover with practical considerations, but it is not without limitations. Due to the use of LSTM recurrent neural networks and sequences, it requires some time to collect the traffic data in order to form them into sequences before we begin the learning and mitigation process. This could potentially cause an undesired delay, which is easily circumvented by using another type of training algorithms instead of the LSTM in this framework. Also, since accurately estimating the attack and normal distribution difference is fundamental in retrieving the scores, our approach is most effective on volumetric DDoS attacks. It is not designed for low-rate attacks, and we have yet to study the performance of N-over-D on these types of attacks. In this section, we examine some of the most relevant works that take statistical and machine learning approaches in DDoS defenses. There is a large body of work on statistical approaches to DDoS mitigation since the early 2000s [38] . Common methods generally involve some statistical measure of network traffic properties, such as entropy scoring of network packets [6, 11, 27, 34] , IP filtering [19, 26] , and IP source traceback [13, 35] . These statistical and packet-based defense mechanisms are widely popular in defending against DDoS attacks. Statistical methods leverage variations of distributions in traffic attributes to discriminate between the distribution of traffic during normal usage and attack periods, and identify DDoS attacks. Entropy-based detection approaches generally score packets based on some statistical metrics. In one of such works [34] , it proposes a semi-Markov model for normal network browsing behavior. Through simulating the browsing dynamics of legitimate web browsers, the authors observed that a critical condition for successful attacks: the number of active bots of the botnet must not be lower than the number of active legitimate users. Based on the findings, they defined a new fine correntropy metric to detect DDoS attacks. Another proposed method [6] combines an entropy-based and a packet-score-based method for first characterizing incoming packets before filtering the malicious packets. Results show that the combined scoring method can differentiate between DDoS attacks and normal traffic. Similarly, a statistical solution [11] proposes a joint entropy-based security scheme against DDoS attacks. It generates baseline profiles, determines parameters when attacks are detected, and mitigates the DDoS attacks. IP filtering, based on IP source address filtering, is another practical way to defend against DDoS attacks. Unlike methods that are based on monitoring the traffic volume, the IP filtering scheme [26] uses a history of legitimate IP addresses to decide if incoming IP packets are anomalous. The authors present several heuristic methods to make the IP address database accurate and robust to increase the effectiveness of the scheme. Peng et al. [19] uses a sequential nonparametric change point detection method to monitor the increase of new IP addresses, and demonstrate its effectiveness for DDoS attacks. IP source traceback methods tackle the problem of identifying the DDoS attack sources. As monitoring the traffic attributes and IP filtering could provide limited information to prevent DDoS attacks, traceback methods [13, 35] are another approach for effective defense against these attacks. A common obstacle to these statistical techniques is the selection of an appropriate model parameters. Given that each DDoS attack in different networks varies from one to the next, it is especially challenging to identify the most suitable parameters that minimize false detection rates. Himura et al. [9] presented an automated parameter tuning of a statistical network traffic anomaly detection method, which determines a learning period for setting a parameter of the method. However, many of the statistical solutions may not be appropriate in an online setting, as some traffic statistics cannot be computed during data collection periods. To overcome this limitation, a DDoS defense architecture [33] was proposed to support distributed detection and automated online attack characterization, which assigns a score to each packet that estimates the legitimacy of a packet given the attributes. Another popular approach to DDoS defense is based on Machine Learning (ML) anomaly detection techniques. Learning algorithms such as the Support Vector Machine (SVM), Self-Organizing Map (SOM), Random Forest, K-Nearest Neighbor (KNN), Decision Tree, Gaussian EM, and Neural Network have achieved various degrees of success in network intrusion detection [7, 16] . Fundamental methods SOM [1] and SVM [21] [22] [23] were first proposed to provide artificial intelligence in DDoS mitigation. Braga et al. [1] presented a SOM detection scheme with 4 and 6 tuples of attributes as a lightweight detection method based on traffic flow features. Another work [23] applied the SVM classifier for detecting attacks. Some others [21, 22] took a hybrid approach that combined the SVM and SOM to enhance the accuracy in differentiating normal flows from abnormal flows. A proposed method [16] employs a one hidden layer feedforward neural network to detect abnormal traffic, and it achieves high accuracy and low false detection. In another work [7] , the authors evaluate nine ML methods on the source side data in the cloud for DDoS attack detection. The features (e.g., DNS packages ratio, Dillie-Hellman key exchange packages, ICMP package rate) are first extracted from the traffic data. The extracted data is then fed into the ML algorithms to distinguish between benign and malicious network traffic. Even though these ML methods are promising, the reported results depend heavily on the selection of features and the evaluated datasets [10] . Hence, to improve the generalization of traffic data and reduce the significance of feature engineering, deep learning models have been increasingly used in recent years. The recent advancement of machine learning has given rise to increasing exploration of deep learning (DL) models for DDoS detection. We can broadly categorize these DL models into two groups-those that adopt an offline learning approach [3, 14, 15, 24, 32, 37] , where all attack data is observed and available for training the models, and online learning models [4, 5, 7, 17] that only receive data, as more attack traffic is observed with time. One offline method [14] , CNN models with different model depths were used to investigate the relationship between performance and the number of CNN layers. They observe that deeper structures do not improve performance, and the LSTM model, a variant of RNN, performs better. The modeling of network traffic through time in a sequential manner, which retains information of historical patterns, instead of treating the inputs as independent packets. RNN-based intrusion detection systems [3, 15, 32] demonstrate that the sequential learning models have superior modeling capabilities, and they can detect some sophisticated attacks for intrusion detection tasks. RNN-based methods have demonstrated superior detection capabilities on some sophisticated attacks for intrusion detection tasks [3, 15, 32] .In addition, a few deep learning methods [24, 37] that experiment with RNNs and CNNs hybrid models have slightly increased DDoS detection performance. However, all these methods are not designed for online systems, since the entire training data must be preprocessed before the models start training. They are not suitable for incremental updates when new data is observed. While there are many offline learning-based DDoS detection methods, few online approaches have been proposed. One online method [5] that uses ground-truth labels for training, extracts traffic flow attributes (e.g., Packet Length, TCP Length, Window Size) as input for a deep CNN model. Only a few methods do not use data labels for training. He et al. [7] evaluated nine basic ML methods with an added online learning module on the source side data in the cloud for DDoS attack detection. In another proposed online approach [17] , basic learning algorithms and simple decision tree methods were employed. Last but not least, Çakmakçı et al. [4] employs a kernel-based learning algorithm in addition to the Mahalanobis distance and a chi-square test. It was shown to be competitive with offline DDoS classification algorithms. However, these existing online approaches are defined by their limited modeling capabilities, in which deep learning methods are capable of overcoming the limitations. This work presents a principled formulation of a machine learning optimization problem that optimizes the detection of attacks in DDoS situations, and it proposes two online learning approaches. First, we introduce N-over-D, an LSTM-based training algorithm that mitigates DDoS attacks by contrasting estimated normal and attack network traffic conditional probability distributions and ranking the unidentified traffic. However, this approach requires an exact likelihood calculation, and there is no joint training of the and models. Second, we propose an enhanced iterative two-class classifier and design a specific loss function more suited for deep learning that solves the presented optimization problem. Through extensive evaluation, we demonstrate the effectiveness of N-over-D and iterative classifier against a range of DDoS attacks, actual and synthetically generated, offering a practical solution to the everevolving DDoS attacks. We recognize the lack of ground truth labels that identify traffic during the attacks, which contains a mixture of legitimate and malicious traffic, raising serious concern about the effectiveness of existing machine learning detection methods that rely on these labels for the reported performance. Furthermore, we analyze the strengths and weaknesses of both approaches, which illustrate the practical considerations of designing a more functional and robust online DDoS mitigation system. We provide the LSTM architecture operations, describe the network traffic data preprocessing procedure, and report the default parameter settings for the experiments implemented in this paper. We use the LSTM architecture without peep-hole connections. The operations are computed as follows: where is the input at time , (·) is the sigmoid function, tanh(·) is the hyperbolic tangent function, and ⊙ denotes element-wise product. Each gate function has its own weight matrix and a bias vector. We denote the parameters with subscripts for the forget gate function, for the input gate function, and for the output gate function respectively (e.g., , ℎ , and are parameters of the forget gate function). Using the data processing procedure below, we prepare network traffic data into a streaming form, grouped by source and destination IP addresses, for online learning. For the numerator model , we split the training data into an 80/20 training and validation ratio. As for the online model , every batch of data in a time interval is divided into a 90/20 training and validation ratio to maximize the training data. Table 8 lists the parameter settings of numerator . Lightweight DDoS flooding attack detection using NOX/OpenFlow Distributed denial of service: Trin00, tribe flood network, tribe flood network 2000, and stacheldraht ciac-2319 Comparative Study of CNN and RNN for Deep Learning Based Intrusion Detection System Online DDoS attack detection using Mahalanobis distance and Kernel-based learning algorithm LUCID: A Practical, Lightweight Deep Learning Solution for DDoS Attack Detection Entropy-score: A method to detect DDoS attack and flash crowd Machine Learning Based DDoS Attack Detection from Source Side in Cloud The CAIDA DDoS attack An Automatic and Dynamic Parameter Tuning of a Statistics-Based Anomaly Detection Algorithm Critical review of machine learning approaches to apply big data analytics in DDoS forensics JESS: Joint Entropy-Based DDoS Defense Scheme in SDN DDoS attacks in Q1 2020 An Evaluation of Different IP Traceback Approaches An Empirical Study on Network Anomaly Detection Using Convolutional Neural Networks Detection and defense of DDoS attack-based on deep learning in OpenFlow-based SDN DDoS attack detection based on neural network Smart Detection: An Online Approach for DoS/DDoS Attack Detection Using Machine Learning Botnet: classification, attacks, detection, tracing, and preventive measures Detecting Distributed Denial of Service Attacks Using Source IP Address Monitoring Survey of network-based defense mechanisms countering the DoS and DDoS problems A Novel Hybrid Flow-Based Handler with DDoS Attacks in Software-Defined Networking Efficient Distributed Denial-of-Service Attack Defense in SDN-Based Cloud Open-FlowSIA: An optimized protection scheme for software-defined networks from flooding attacks Deep Learning Models for Cyber Security in IoT Networks Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization Protection from distributed denial of service attacks using history-based IP filtering An Entropy-Based Distributed DDoS Detection Mechanism in Software-Defined Networking Feature Hashing for Large Scale Multitask Learning Symantec Corporation Symantec Corporation DeepGFL: Deep feature learning via graph for attack detection on flow-based network traffic A Deep Learning Approach for Intrusion Detection Using Recurrent Neural Networks Packetscore: statistics-based overload control against distributed denial-of-service attacks Fool Me If You Can: Mimicking Attacks and Anti-Attacks in Cyberspace Traceback of DDoS Attacks Using Entropy Variations Discriminating DDoS Attacks from Flash Crowds Using Flow Correlation Coefficient DeepDefense: Identifying DDoS Attack via Deep Learning A Survey of Defense Mechanisms Against Distributed Denial of Service (DDoS) Flooding Attacks