key: cord-0894541-hufbsnwj
authors: Farzad, Amir; Gulliver, T. Aaron
title: Two Class Pruned Log Message Anomaly Detection
date: 2021-07-24
journal: SN Comput Sci
DOI: 10.1007/s42979-021-00772-9
sha: 049166fed549f75b4a682fd2d6fc25975315072e
doc_id: 894541
cord_uid: hufbsnwj

Log messages are widely used in cloud servers and other systems. Millions of logs are generated each day which makes them important for anomaly detection. However, they are complex unstructured text messages which makes this task difficult. In this paper, a hybrid log message anomaly detection technique is proposed which employs pruning of positive and negative logs. Reliable positive log messages are first selected using a Gaussian mixture model algorithm. Then reliable negative logs are selected using the K-means, Gaussian mixture model and Dirichlet process Gaussian mixture model methods iteratively. It is shown that the precision for positive and negative logs with pruning is high. Anomaly detection is done using a deep learning long short-term memory network. The proposed model is evaluated using the well-known BGL, Openstack, and Thunderbird data sets. The results obtained indicate that the proposed model performs better than several well-known algorithms.

Companies and customers expect 24/7 connectivity to their cloud and software systems and loss of access can have serious consequences. Thus, significant investments have been made to preserve the quality and availability of these services. This is achieved by generating log messages which indicate the status of the system. Logging is the process of storing records for audit or security [46] . Log messages are unstructured text data that consist of time stamps, verbosity, and raw content concerning the system status. Logs are unstructured because developers typically use free text to record events for ease and flexibility [46] . Thus, the structure of these logs can vary considerably, making it hard to identify abnormalities [42] . Log messages are used for several purposes including anomaly detection [17] and performance monitoring [43] . Most techniques employ rules to identify anomalies in logs but this requires specialized knowledge of the area [41] . Some only consider one feature such as verbosity which limits the ability to detect abnormalities. Anomaly detection can be carried out manually, but for large systems this is not practical due to the amount of data and the complexity [25] . As a result, automated log analysis methods are required to identify anomalies.

Deep learning (DL) is a subclass of machine learning (ML) which employs a network with several layers. DL can identify similarities in data [14] which makes it desirable for big data applications. DL has been shown to provide excellent results for speech recognition, image processing and text classification [22, 44] . DL methods have good recognition capabilities for large amounts of data and are better than other ML methods for feature representation [5] . ML methods can be discriminative, generative, or hybrid. Discriminative methods are typically used for supervised classification while generative methods are employed for unsupervised classification. Hybrid methods combine generative and discriminative methods. One of the main issues in anomaly detection is handling unlabeled data. Millions of log messages are produced daily in cloud and other systems so it is typically not possible to label even a small portion of the data. Thus, unsupervised approaches should be considered to deal with this unlabeled data.

ML algorithms have been used to develop a range of anomaly detection methods. Elliptical envelope (EEnvelope) creates an elliptical area around the data mass center which is used in redshift estimation to detect anomalies [20] . Local outlier factor (LOF) uses local data deviation to detect network flow anomalies to reduce the risk of internet attacks [31] . Support vector machine (SVM) and one-class support vector machine (OC-SVM) have been employed to detect unknown computer activity [30] and anomalies in networks [28] . A convolutional neural network (CNN) combined with a graph convolutional network was used for abnormal breast detection in mammograms [45] and COVID-19 detection in CT images [38] . A decision tree model was considered in [32] to detect faults using log messages. An improved supervised K-nearest neighbors (IKNN) method was employed in [36] to detect anomalies in log messages. However, using supervised methods is not always possible because of the lack of labeled data. A method to detect anomalies in log messages using isolation forest with two autoencoder networks for feature extraction was presented in [11] . However, a detection threshold is required for each data set which is difficult to determine in practice.

Deeplog [7] uses a long short-term memory (LSTM) network [19] for anomaly detection. First, each log is mapped to its print statement (in the source code) using a log parser. Thus, each log is represented by a number (feature) and a session of logs is parsed to a sequence of numbers. Then, an LSTM network is trained as a multi-class classifier using a window with normal sequences (corresponding to normal system operation) [7] . The trained LSTM network is then used to predict the probabilities of numbers occurring at a given time step. If the actual number is unlikely to occur based on the LSTM prediction, then it is considered to be an anomaly. While this approach may be effective, it can be difficult to obtain labeled logs for normal system operation. The proposed model uses both normal (positive) and abnormal (negative) log messages for training and unsupervised algorithms are used to prune positive and negative logs. Once the reliable positive and negative logs are selected, an LSTM network is used for anomaly detection. Further, while log parsing is used by most ML/DL log message anomaly detection models such as Deeplog, only simple text log preprocessing is used in the proposed model.

Since the log messages are unstructured, they are usually parsed before being input to an ML model. There are several different log parsing methods including LogSig and IPLoM [18] . An example of a log message is given in Fig. 1 . The purpose of log parsing is to differentiate between constant and variable parts of the message so that the constant elements are mapped to a list of log events. However, systems are continually changing, so it is difficult to develop effective automated log parsing methods [18] . Figure 2 shows examples of positive and negative logs from Openstack 1 . Although the verbosity level for both logs is INFO, the positive log is a declaration which may show a claim was successful on a node and the negative log may show the system has been terminated due to failing to start. This indicates that considering just one log message component may be insufficient for detecting anomalies.

K-means is a well-known clustering method which has been widely used in tasks such as detecting network intrusions [34] . Gaussian mixture model (GMM) is a clustering method that assumes the data was created from a combination of Gaussian distributions. It has been used to solve problems such as detecting anomalies in flight operation data [24] . The Dirichlet process Gaussian mixture model with variational inference (BGM) is a Bayesian mixture model (an extension of finite mixture models), which has been used for tasks such as anomaly detection in hyperspectral data [35] . An LSTM network [19] is a recurrent neural network (RNN) which uses a cell to preserve sequence information and recall long-term dependencies. LSTM networks have been used for various tasks such as text classification [37] and texture classification in images [4] . An LSTM network is suitable for sequential data such as log messages [10] and is effective with big data [8] . Another advantage of an LSTM network is robustness so it can be used with complex data. However, training an LSTM network can be time-consuming [16] .

In this paper, a hybrid model is proposed which employs unsupervised K-means, GMM and BGM methods for data pruning and a supervised LSTM network for anomaly detection using reliable data. First, reliable positive logs are obtained using a GMM. Although GMMs are widely used for anomaly detection, we use a GMM to prune only positive logs in the first step. Then, reliable negative logs are obtained using pruning with the unsupervised K-means, GMM and BGM methods. Finally, a portion of the reliable positive and negative logs are used for anomaly detection using an LSTM network. The amount of data used for LSTM training is very small even though convergence with deep networks typically requires a significant amount of training data. The proposed model is evaluated using the accuracy, precision, recall and F-measure criteria [12] , and three log message data sets, namely BlueGene/L (BGL) 2 , Openstack and Thunderbird 3 , are considered. The parameters of the proposed model are the same for all data sets to illustrate the robustness of this approach.

The main contributions of this paper are as follows.

1. An unsupervised algorithm is presented which uses a GMM method to select reliable positive logs. 2. An unsupervised algorithm is presented which employs K-means, GMM, and BGM methods iteratively to select reliable negative logs. 3. An LSTM network is used with the pruned logs for anomaly detection.

The rest of this paper is organized as follows. In the next section, the K-means, GMM, BGM, and LSTM architectures are presented and the proposed model is described. Experimental results for the three data sets are given in the third section along with a discussion of the model performance. Finally, the fourth section provides some concluding remarks.

In this section, the K-means, GMM, BGM, and LSTM architectures are given and the proposed model is described.

K-means is an iterative clustering method. Given k classes, each cluster has a center which is the average of the samples in the cluster. The set of clusters is S = {S 1 , S 2 , … , S k } and a sample is assigned to the cluster whose center it is closest to. First, the cluster centers are initialized randomly. Next, the Euclidean distances between each sample and the cluster centers, c i , are calculated. Then, each sample is reassigned to the closest cluster and new centers are calculated for each cluster. This process continues until the clusters do not change or the maximum number of iterations is attained. This corresponds to minimizing the objective function given by where k is the number of clusters, c i is the center of cluster S i , x is a data sample, and ||x − c i || 2 is the Euclidean distance from x to c i .

Gaussian mixture model (GMM) is a clustering method. It is assumed that each cluster consists of data with a normal (Gaussian) distribution. The goal is to estimate the distribution parameters of each cluster and determine the labels for the samples, i.e., which cluster each sample belongs to. The expectation maximization (EM) algorithm [6] is used to obtain estimates for these parameters. The GMM probability density function (PDF) of sample x j is given by where i , i and i are the weight, mean, and covariance matrices of the ith distribution, respectively, m is the number of distributions (clusters), and = { i , i , i } is the parameter set of the mixture model. For d features, the Gaussian distribution of x j is

The EM algorithm iterates two steps, expectation (E-step) and maximization (M-step). First, the model parameters are randomly initialized and the expectation and maximization steps are performed. The E-step is given by

where t is the iteration number. Then the parameters are estimated in the M-step as

, Page 4 of 18

where N is the number of samples. These steps are repeated until the criteria are satisfied or a maximum number of iterations is reached.

The Dirichlet process Gaussian mixture model (BGM) is a non-parametric Bayesian mixture model that is an extension of finite mixture models. The number of clusters (classes) does not need to be explicitly predefined because it is a nonparametric model. BGM uses the Dirichlet process (DP) which is a generalized form of a Dirichlet distribution [13] . A DP is composed of a base distribution G 0 and a positive concentration scaler . Since this model is not a finite mixture model, variational inference is employed [3] . The model parameters are

where 0 and i are the mean and covariance of the Gaussian distribution, s is the scale matrix, and v is the number of degrees of freedom for the Inverse-Wishart distribution [39] .

An LSTM is a recurrent neural network [19] which has been used to solve sequential data problems [15] . Cells are used to store information and they are connected recurrently. The use of cells solves the vanishing gradient problem. Each LSTM block includes input, forget, and output gates. These gates can store information longer than feed-forward neural networks which improves performance [15] . A block of an LSTM network is shown in Fig. 3 . The cell input at time t is x t and the input, forget, and output gate outputs are Fig. 3 A block of an LSTM network with input, input, output and forget gates [19] SN Computer Science respectively, where h t−1 is the previous block output, b is the bias vector, W and U are weight matrices, and is the activation function (usually the sigmoid function). The block input at time t is given by where b C is the bias vector, W C and U C are the weight matrices, and tanh denotes the hyperbolic tangent activation function. The cell state at time t is where ⊙ denotes point-wise multiplication. The block output at time t is

The proposed model architecture has three steps. First, positive logs are pruned using an unsupervised GMM method. Second, negative logs are pruned through multiple rounds of the unsupervised GMM, BGM, and K-means methods. Finally, anomalies are detected using an LSTM network with the selected (reliable) positive and negative logs.

First, simple text pre-processing including changing letters to lowercase, removing hyphens and tokenization are applied to the data set D. Next, the sentences are padded to 40 tokens, and sentences including less than five tokens are removed. Then, the number of appearances of each token in the data set is computed and the tokens are ordered from most frequent to least frequent. Each token is given an index starting from zero and the indices are used as the data set features. Next, the features are normalized using a min-max scaler so all values are between 0 and 1 and the entries in the data set are shuffled. Then D is divided into two sets, t 1 with 2% of the data for training and r 1 with the remaining 98% of the data. The set t 1 is small to keep the computational complexity low and have more data for the rest of the algorithm. The proportion of negative and positive logs in these sets is the same as in D.

A GMM is used to prune the positive logs. It is trained with t 1 and tested with r 1 . The negative predicted logs (predicted output y = 0 ) and positive predicted logs (predicted output y = 1 ) are counted and labeled c 0 and c 1 , respectively. If the number of logs predicted as positive is less than the number predicted as negative, then c 0 and c 1 are swapped. This is because it is known that the number of anomalies (negative

logs) is much less than the number of positive logs (around 10% ). The variance is given by where x i is the ith feature, x is the average of the features and F is the total number of features in the data set. Let a = c 1 c 0 and the variances of the negative and positive predicted logs be z var and o var , respectively. If a > 3 and o var × c < z var (c is a constant), then the positive and negative predicted logs are added to the sets o 0 (reliable positive logs) and z 0 (rest of the data), respectively. The threshold for a was chosen considering that the majority of the logs are positive. A high value of c increases the probability of getting only positive logs but if it is too high the algorithm criterion may not be satisfied. It was set to c = 1.6 for all data sets based on the experimental results obtained.

The variance measures the spread of a data set. If the model predicted most of the positive logs correctly (small number of false positives), then the variance of the positive logs should be lower than that of the negative logs. A high variance may indicate that there is a mix of positive and negative logs whereas a small variance indicates that there are mostly positive logs predicted correctly. If the criteria are met, the results are kept, otherwise the process is repeated.

The GMM, K-means and BGM methods are now used to select negative logs. These models were chosen because they are efficient unsupervised models for text data [1] . There are n rounds and in each round, GMM, K-means and BGM are run m times. In the first round, the models are trained with z 0 from the previous step. In subsequent rounds, the results from the previous round z are used for this purpose. The entropy of sample x j is given by where n i is the ith feature of the sample, d is the length of the sample and M is the sum of the sample features.

For a model run, denote the average entropy of the logs predicted as negative and positive as sh 0 and sh 1 , respectively. If sh 0 < sh 1 , then the predicted negative logs are appended to z 1 and the predicted positive logs are appended to o 1 . This is because small features appear frequently in positive logs so they are more uniform and thus have a higher entropy. At the end of a round, z 1 is assigned to z for use in the next round. The logs in z are counted and ordered from most frequent to least frequent and the repetitions discarded.

This is done so that each log appears at most once in z and the logs that appear more often are used earlier so the models in the next round can predict better. The logs in z 1 are discarded at the start of each round.

In each round, the prediction of positive and negative logs is done using z. Thus, the number of positive logs is reduced and only reliable negative logs are kept. The final z 1 contains the predicted negative logs from the last round and these are used in the next step.

In this step, an LSTM network is used with the reliable negative ( z 1 ) and positive ( o 0 ) logs from the previous two steps for anomaly detection. Each log in z 1 and o 1 is counted and ordered from most frequent to least frequent and the repetitions are discarded. The first L logs in z 1 are selected and assigned to z 2 (most reliable predicted negative logs), and the remaining logs are assigned to o 2 . Logs which appear in o 1 but not in z 1 are placed in o 3 . The reliable positive logs o 0 obtained in the first step are shuffled, and 10% are randomly assigned to o 4 and the remainder to o 5 . The logs in z 2 are repeated four times and assigned to x n . A portion of o 4 which is the same size as the number of elements in x n is randomly chosen and assigned to x p . Thus, the reliable negative logs are oversampled so the number of reliable positive logs and negative logs is the same. This is because LSTM networks work better with balanced data and should be trained with a sufficient number of positive and negative logs. The logs in x n and x p are labeled with y = 0 and y = 1 indicating negative and positive logs, respectively, and x n and x p are assigned to t 2 . The remaining logs in o 4 are assigned to o 6 , and o 2 , o 3 , o 5 and o 6 are assigned to t 3 . The data set was initially scaled so all values are between 0 and 1, but this is reversed for t 2 and t 3 to provide training and testing sets, respectively, for the LSTM network.

The LSTM network is trained with 90% of t 2 , validated with the remaining 10% of t 2 , and tested with t 3 . The parameters used are k = 20 , n = m = 5 , and L = 10000 for the BGL and Thunderbird data sets and L = 3000 for the Openstack data set. A different value of L is used because the Openstack data set is much smaller than the BGL and Thunderbird data sets. For training the BGL and Thunderbird data sets, an LSTM network with three hidden layers of size 256, batch size 128 and a maximum of 10 training epochs is used. To prevent overfitting, dropout with probability 0.5 and early stopping are used. The softmax activation function is applied in the last dense layer. The cross-entropy loss function and Adam optimizer are used for training. The Adam optimizer is used because it has been shown to provide good performance and fast convergence in DL algorithms [33] . For the Openstack data set, an LSTM network with a single hidden layer of size 512 and embedding dimension of size 512 is used. A single-layer network is used for this data set because it is smaller than the other data sets. The rest of the architecture for the Openstack data set is the same as above. All network parameters were chosen based on the experimental results obtained. An LSTM network is used for anomaly detection because it has been shown to provide good results in classifying sequential data [15] . However, other DL discriminative networks such as a CNN can be employed. The proposed model algorithms are given in Algorithms 1-3 and shown in Fig. 4 . The data preparation for Algorithm 3 is shown in Fig. 5 . 

In this section, the proposed model is evaluated using the BGL, Openstack and Thunderbird data sets. Four performance criteria are considered, namely accuracy, precision, recall and F-measure [12] . The percentage of data correctly predicted is called the accuracy and is given by where T p is the number of positive samples predicted by the model to be positive, F p is the number of negative samples predicted to be positive, T n is the number of negative samples predicted to be negative, and F n is the number of positive samples predicted to be negative. Then, the precision is

SN Computer Science the recall is and the F-measure is

All experiments were run on the Compute Canada Graham cluster with 32 CPU cores, two P100 GPUs and 124 GB of memory. The algorithms were implemented using Python, Keras 4 and Scikit-learn 5 .

The hyperparameters of the proposed model were not tuned so the default values were used in all experiments. Each experiment was repeated 10 times and the minimum, maximum and average testing accuracy, precision, recall, F-measure and computation time were obtained. Table 1a gives the proposed model results for the BGL, Openstack and Thunderbird data sets using GMM for positive pruning. For comparison, the proposed model results using BGM for positive pruning are given in Table 1b with the order for negative pruning changed to K-means, GMM, and BGM. The results for negative logs with the Auto-LSTM [10] , IKNN, nLSALog [40] and Deeplog algorithms for the (a) BGL, (b) Openstack, and (c) Thunderbird data sets are given in Table 2 . Table 3 gives the average testing accuracy, precision, recall, F-measure and computation time with the BGM, EEnvelope, GMM, K-means, LOF and OC-SVM methods for the (a) BGL, (b) Openstack, and (c) Thunderbird data sets. Table 4 presents the positive log pruning results for the BGL, Openstack and Thunderbird data sets with (a) GMM and (b) BGM. Tables 5, 6, 7 give the negative log pruning results for the BGL, Openstack and Thunderbird data sets, respectively, with (a) GMM, (b) K-means, and (c) BGM for n = 5 rounds. Table 5 Negative pruning testing accuracy, precision, recall, and F-measure for the BGL data set with (a) GMM, (b) K-means and (c) BGM methods for five rounds (with a GMM for positive

pruning)

The minimum, maximum and average (in parenthesis) values are given for 10 runs. Positive labels are denoted by 1 and negative labels by 0 Round Table 6 Negative pruning testing accuracy, precision, recall, and F-measure for the Openstack data set with ( Table 7 Negative pruning testing accuracy, precision, recall, and F-measure for the Thunderbird data set with ( 

The BlueGene/L (BGL) data set has 4,399,502 positive logs and 348,460 negative logs. From these, 94,960 logs are used for the training set t 1 and 4,653,002 for the remaining set r 1 . Using GMM for positive pruning, the final average testing accuracy is 99.5% with average precision, recall and F-measure of 95.6%, 97.8% and 96.7% for negative logs, and 99.8%, 99.6% and 99.7% for positive logs, respectively. Using BGM for positive pruning, the final average testing accuracy is 99.5% with average precision, recall and F-measure of 95.4%, 98.5% and 96.9% for negative logs, and 99.9%, 99.6% and 99.7% for positive logs, respectively. The results with Auto-LSTM, IKNN and nLSALog for the BGL data set are given in Table 2a . The precision, recall and F-measure results for negative logs are better than the 92%, 91% and 92%, respectively, with the improved K-nearest neighbors (IKNN) supervised algorithm [36] . The precision, recall and F-measure results for negative logs are also better than the 82.5%, 94.7% and 88.2%, respectively, with the nLSALog algorithm [40] . The precision, recall and F-measure results for negative logs are also better than the 98%, 91.3% and 94.5%, respectively, with the Auto-LSTM algorithm [10] (however, the precision with Auto-LSTM is higher). Several well-known models were also evaluated for anomaly detection. The average testing accuracy, precision, recall, F-measure and time with the BGM, EEnvelope, GMM, K-means, LOF, and OC-SVM methods for the BGL data set using 10-fold cross-validation are given in Table 3a . Among existing methods, the GMM results for negative logs are the highest with precision, recall and F-measure of 38.2%, 50% and 43.3%, but these values are lower than those for the proposed model. The proposed model results are better because of pruning positive and negative logs and using DL. Because of the high complexity of the LOF and OC-SVM methods [2, 9, 29] , only 5% of the data set was used for these models.

The Openstack data set has 137,074 positive log messages and 18,434 negative log messages. From these, 3111 logs are used for the training set t 1 and 152,397 for the remaining set r 1 . Using GMM for positive pruning, the final average testing accuracy is 99.9% with average precision, recall and F-measure of 99.9%, 99.7% and 99.8% for negative logs, and 99.9%, 99.9% and 99.9% for positive logs, respectively. Using BGM for positive pruning, the final average testing accuracy is 99.9% with average precision, recall and F-measure of 99.7%, 99.9% and 99.8% for negative logs, and 99.9%, 99.9% and 99.9% for positive logs, respectively.

The results with Auto-LSTM and Deeplog for the Openstack data set are given in Table 2b . The precision, recall and F-measure results for negative logs are better than the 94%, 99% and 97% obtained with the Deeplog network [7] . The precision, recall and F-measure results for negative logs are also better than the 99.4%, 92.8% and 96%, respectively, with the Auto-LSTM algorithm [10] . Several well-known models were also evaluated for anomaly detection. The average testing accuracy, precision, recall, F-measure and time with the BGM, EEnvelope, GMM, K-means, LOF, and OC-SVM methods for the Openstack data set using 10-fold cross-validation are given in Table 3b . Among existing methods, the EEnvelope results for negative logs are the highest with precision, recall and F-measure of 53.4%, 44.9% and 48.8%, but these values are lower than those for the proposed model. The proposed model results are better because of pruning positive and negative logs and using DL.

From the Thunderbird data set, 3,000,000 positive log messages and 324,824 negative log messages are used. Of these, 66,497 messages are used for the training set t 1 and 3,258,327 for the remaining set r 1 . Using GMM for positive pruning, the final average testing accuracy is 99.8% with average precision, recall and F-measure of 98.9%, 99.6% and 99.3% for negative logs, and 99.9%, 99.8% and 99.9% for positive logs, respectively. Using BGM for positive pruning, the final average testing accuracy is 99.8% with average precision, recall and F-measure of 99%, 99.6% and 99.3% for negative logs, and 99.9%, 99.8% and 99.9% for positive logs, respectively.

The results with Auto-LSTM and IKNN for the Thunderbird data set are given in Table 2c . The precision, recall and F-measure results for negative logs are better than the 96% for all criteria with the IKNN supervised algorithm [36] . The precision, recall and F-measure results for negative logs are about the same as the 98.4%, 99.8% and 99.1%, respectively, with the Auto-LSTM algorithm [10] . Several wellknown models were also evaluated for anomaly detection. The average testing accuracy, precision, recall, F-measure and time with the BGM, EEnvelope, GMM, K-means, LOF, and OC-SVM methods for the Thunderbird data set using 10-fold cross-validation are given in Table 3c . Among existing methods, the GMM results for negative logs are the highest with precision, recall and F-measure of 27.1%, 70% and 37.2%, but these values are lower than those for the proposed model. The proposed model results are better because of pruning positive and negative logs and using DL. Because of the high complexity of the LOF and OC-SVM methods, only 5% of the data set was used for these models. Page 16 of 18

Gaussian mixture model (GMM), Dirichlet process Gaussian mixture model (BGM), and K-means are well-known clustering algorithms. Clustering algorithms have been shown to provide good results with text data [1] and logs are mostly text. In addition, clustering algorithms are faster to train than DL algorithms [26] . However, an unsupervised GMM is used here for pruning positive logs and the unsupervised GMM, BGM, and K-means methods are used for pruning negative logs. This eliminates the need to label log messages to detect anomalies. The positive and negative logs are selected unsupervised using Algorithms 1 and 2, respectively. If the conditions in these algorithms are satisfied, then the logs predicted to be positive and negative are added to o 0 and z 0 , respectively, for Algorithm 1, and o 1 and z 1 , respectively, for Algorithm 2. Then reliable positive and negative logs are selected using o 0 and z 1 , respectively, in Algorithm 3. The amount of positive data is far greater than the amount of negative data, so the positive data can be accurately predicted using clustering algorithms. However, the negative cluster contains a lot of positive data and this is a disadvantage of using clustering methods with imbalanced data [23] . We take advantage of this data imbalance as GMM can easily predict positive data. Another disadvantage of unsupervised clustering is that clusters may be labeled incorrectly [27] . Thus, not only do negative clusters include positive logs but clusters may be incorrectly labeled in different runs. As a consequence, the average results shown in Table 3 for the BGM, GMM and K-means methods are poor. Comparing the results in Tables 1 and 3 , it is evident that BGM, GMM and K-means alone do not provide good log message anomaly detection results. This is also due to the complexity of the unstructured log messages. For negative pruning, our experimental results indicate that using just one model can limit the pruning process so that many unreliable logs are retained. Thus, multiple models are employed.

The precision for positive logs is the percentage of true positive logs predicted of all logs predicted to be positive. Table 4 shows that with GMM and BGM for positive pruning, the average precision of positive logs is 99.9% for the BGL, Openstack and Thunderbird data sets, which is very high. Thus, most logs that are predicted to be positive are correct, and the number of negative log messages predicted to be positive ( F P ) is low. This indicates that pruning positive logs using Algorithm 1 is effective. However, the average precision for negative logs for the BGL and Thunderbird data sets is around 76% and 50%, respectively, which is quite low.

In separate experiments, the value of the constant c and the threshold for a were varied to determine their effect on Algorithm 1. The first set of experiments considered c = 1.4, 1.5, 1.6, 1.7 , and 1.8 with a > 3 and the second set of experiments considered a > 3 , 4, 6, and 10 with c = 1.6 . The criterion for this algorithm is that the set z 0 is not empty. This is because Algorithm 2 requires this data for training. In the first set of experiments, the criterion was satisfied for all values of c with the Openstack and Thunderbird data sets and for all values except c = 1.8 with the BGL data set. In the second set of experiments, the criterion was satisfied for a = 3 , 4, and 6 with the BGL and Openstack data sets and for a = 3 and 4 with the Thunderbird data set.

The effect of the negative log pruning algorithm is shown in Tables 5, 6 and 7 for the BGL, Openstack and Thunderbird data sets (with a GMM for positive pruning), respectively. Here, precision for the negative logs is the most important criterion. For the BGL data set, the average precision of the negative logs with the GMM, K-means and BGM methods increased from 82.4 to 100%, 80.8-95.9% and 88.1-100%, respectively, over the five rounds. For the Openstack data set, the average precision of the negative logs with the GMM, K-means and BGM methods increased from 99.9 to 100% for all models over the five rounds. For the Thunderbird data set, the average precision of the negative logs with the GMM, K-means and BGM methods was approximately the same, 99.9-100%, over the five rounds. These results indicate that Algorithm 2 is very effective in pruning negative logs. Further, the BGL data set required five rounds to obtain good results but only two rounds were sufficient for the Openstack and Thunderbird data sets. The GMM, K-means and BGM methods were used here to prune negative logs, but other unsupervised models can be employed.

The adjusted Rand index [21] is a measure of the similarity of results and has a value between 0 and 1. A high index value means the results are very similar. For negative log pruning with the BGL data set, the average adjusted Rand index of all rounds for GMM and BGM was 0.94, for GMM and K-means was 0.33 and for BGM and K-means was 0.28. For negative log pruning with the Openstack data set, the average adjusted Rand index of all rounds for GMM and BGM was 0.74, for GMM and K-means was 0.49, and for BGM and K-means was 0.37. For negative log pruning with the Thunderbird data set, the average adjusted Rand index of all rounds for GMM and BGM was 0.93, for GMM and K-means was 0.19, and for BGM and K-means was 0.17. These values show that the GMM and BGM results are more similar than the GMM and K-means results and the BGM and K-means results.

The anomaly detection results with the LSTM network using the reliable positive and negative logs are shown in Table 1 . The final results with GMM and BGM positive log pruning were similar. The proposed model results are better than with Auto-LSTM because the data was balanced before it was input to the LSTM network for anomaly detection whereas in [10] , imbalanced data was used in the network.

The amount of data used for LSTM training was very small (less than 2% for BGL, 3% for Thunderbird, and 18% for Openstack) whereas deep networks typically require a significant amount of training data for convergence. For the Openstack data set, a greater percentage of training data was required for convergence because it is small (more than 20 times smaller than Thunderbird and BGL). The proposed hybrid model (with unsupervised selection of reliable logs), has three advantages over supervised methods. First, it is suitable for many practical applications as there is no need to label data. Second, labeling data is a timeconsuming task and in many cases is not feasible. Third, using an unsupervised method eliminates the human error inherent in labeling. The default hyperparameters were used with the proposed model so better results may be obtained with hyperparameter tuning.

Many millions of log messages are generated each day in cloud and other systems. These messages are important for system maintenance which includes anomaly detection. Log messages consist of unstructured data which is mostly text. Thus, machine learning (ML) is a good choice for anomaly detection. In this paper, a hybrid log message anomaly detection technique using deep learning (DL) was proposed with pruning of positive and negative log messages. An unsupervised algorithm with a Gaussian mixture model (GMM) was used to prune positive logs. Then, an unsupervised algorithm was used to prune negative logs using the K-means, GMM, and Dirichlet Process Gaussian mixture model (BGM) methods iteratively. The precision with the pruning algorithms for positive and negative logs was high, i.e., there were few false positives ( F P ). The proposed model was tested on three different log message data sets, namely BGL, Openstack and Thunderbird. The results obtained show that this model is better than other well-known approaches. Future research can consider the effect of adding other unsupervised methods such as isolation forest to the proposed model. Further, a CNN network can be used for anomaly detection instead of an LSTM network and hyperparameter tuning can be investigated.

Funding This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

A survey of text clustering algorithms

An improved outlier detection algorithm K-LOF based on density

Variational inference for Dirichlet process mixtures

Texture classification using 2D LSTM networks

A survey of deep learning and its applications: a new paradigm to machine learning

Maximum likelihood from incomplete data via the EM algorithm

DeepLog: anomaly detection and diagnosis from system logs through deep learning

The real-time big data processing method based on LSTM for the intelligent workshop production process

Highdimensional and large-scale anomaly detection using a linear oneclass SVM with deep learning

Log message anomaly detection and classification using auto-B/LSTM and auto-GRU

Unsupervised log message anomaly detection

An introduction to ROC analysis

A Bayesian analysis of some nonparametric problems

Deep learning

Supervised sequence labelling with recurrent neural networks

Machine-learning based methods in short-term load forecasting

Log-based anomaly detection of CPS using a statistical method

An evaluation study on log parsing and its use in log mining

Long short-term memory

Anomaly detection for machine learning redshifts applied to SDSS galaxies

Comparing partitions

A comprehensive survey on word recognition for non-Indic and Indic scripts

Imbalanced K-means: an algorithm to cluster imbalanced-distributed data

Anomaly detection via a Gaussian mixture model for flight operation and safety monitoring

Log clustering based problem identification for online service systems

500+ times faster than deep learning (a case study exploring faster methods for text mining StackOverflow)

Classification of outdoor 3D lidar data based on unsupervised Gaussian mixture models

Distributed online one-class support vector machine for anomaly detection over networks

Selecting training sets for support vector machines: a review

Detecting unknown computer worm activity via support vector machines and active learning

Local outlier factor use for the network flow anomaly detection

Mining unstructured log files for recurrent fault diagnosis

Super-resolution imaging using convolutional neural networks. In: Communications, signal processing, and systems

Unsupervised clustering approach for network anomaly detection

Fully unsupervised learning of Gaussian mixtures for anomaly detection in hyperspectral imagery

Log-based anomaly detection with the improved K-nearest neighbor

Chinese text sentiment analysis using LSTM network based on L2 and Nadam

COVID-19 classification by FGCNet with deep feature fusion from graph convolutional network and convolutional neural network

Bayesian forecasting and dynamic models, chap 16: multivariate modelling and forecasting

nLSALog: an anomaly detection framework for log sequence in security management

Beehive: large-scale log analysis for detecting suspicious activity in enterprise networks

Sher-Log: error diagnosis by connecting clues from run-time logs

The research of log-based network monitoring system

A survey on deep learning for big data

Improved breast cancer classification through combining graph convolutional network and convolutional neural network

Tools and benchmarks for automated log parsing

The authors declare no conflict of interest with regards to this paper.