key: cord-0426215-cag3zgt1 authors: Parto, Mahmoud; Saldana, Christopher; Kurfess, Thomas title: Real-Time Outlier Detection and Bayesian Classification using Incremental Computations for Efficient and Scalable Stream Analytics for IoT for Manufacturing date: 2020-12-31 journal: Procedia Manufacturing DOI: 10.1016/j.promfg.2020.05.136 sha: e5037de16a0634af1028838bd8d7c8761856eb70 doc_id: 426215 cord_uid: cag3zgt1 Abstract As the manufacturing industry progresses towards the Internet of Things (IoT) and Cyber-Physical Systems (CPS), current methods of historical data analytics face difficulties in addressing the new challenges which follow Industry 4.0. Industry 4.0 and IoT technologies facilitate the acquisition of ubiquitous data from machine tools and processes. However, these technologies also lead to the generation of a large number of data that are complex to be analyzed. Due to the streaming nature of the IoT systems, however, stream analytics could be used to extract features as the data are generated and published, which can prevent the need to store the data and perform advanced analytics that require high performance computing. This manuscript aims at demonstrating how traditional historical methods can be modified to be used as stream analytics tools for IoT data streams. Since data analytics is a wide domain, this paper has only focused on the two light-weighted methods that have been popular in the industry: Statistical Process Control Chart (SPCC), and Bayesian classification. This paper has defined, tested, and evaluated the accuracy and latency of the novel variation of these methods. It is concluded that by modifying the traditional methods and defining incremental solutions, methods such as Real-Time Dynamic Statistical Process Control Chart (RTDSPCC) and Incremental Gaussian Naïve Bayes (IGNB) can be formed that are highly beneficial for IoT applications as they are highly scalable, require minimal storage, and can update the models in real-time. Analytics and process control in today's industry are minimal and challenging. Due to the simplicity and reliability, however, the manufacturing industry is still utilizing traditional Statistical Process Control (SPC) methods to detect out-ofcontrol signals in the operations. An example figure of SPCC is shown in FIGURE 1. Despite the simple nature of SPCC, the process of creating, implementing, and updating the model in a machine shop involves manually sampling the data and computing the Upper Control Limit (UCL) and Lower Control Limit (LCL). Analytics and process control in today's industry are minimal and challenging. Due to the simplicity and reliability, however, the manufacturing industry is still utilizing traditional Statistical Process Control (SPC) methods to detect out-ofcontrol signals in the operations. An example figure of SPCC is shown in FIGURE 1. Despite the simple nature of SPCC, the process of creating, implementing, and updating the model in a machine shop involves manually sampling the data and computing the Upper Control Limit (UCL) and Lower Control Limit (LCL). 48th SME North American Manufacturing Research Conference, NAMRC 48 (Cancelled due to COVID-19) 2 Mahmoud Parto, Christopher Saldana, Thomas Kurfess / Procedia Manufacturing 00 (2019) 000-000 2 © 2020 by ASME In stream analytics, as IoT systems publish their data, the amount of time it takes for an algorithm to process the data, known as latency, needs to be minimized. The progression of the industry toward Smart Manufacturing (SM) with the coming of Industry 4.0 is ensuring that conventional methods of data analytics will not apply. As it stands, the current methods for producing analytical models is time-consuming and has to be done manually. Also, due to the stream analytics and low latency computation needs of complex IoT data, this sort of methodology will not address the scalability, automation, and real-time needs of smart manufacturing systems. To address these limitations, this manuscript defined, tested, and concluded upon various novel incremental methods that have been applied on outlier detection and classification. More specifically, the following methods of Real-Time Dynamic Statistical Process Chart (RTDSPCC), and Real-Time Moving Statistical Process Control (RTMSPCC) as well as seasonal versions, are presented and evaluated based on their performance and ability to detect various anomalies in data and have been compared with ordinary SPCC, Dynamic SPCC (DSPCC), Moving SPCC (MSPCC) and their seasonal versions, Autoregressive Integrated Moving Average (ARIMA), and Extreme Studentized Deviate (ESD). In addition, Incremental Gaussian Naïve Bayes (IGNB) as an incremental and real-time method for classification is introduced and its latency and accuracy are compared with other classification algorithms. Parsing through the different methods and evaluating them will provide a better understanding of the advantages of the methods and strategies proposed in this manuscript which can be explored on other methods and algorithms in order to make them compatible with stream analytics needs. Control charts, also known as Shewhart charts, are statistical tools used to determine if a manufacturing process is out of the normal control conditions known as process distributions. Many studies have created methods and algorithms to gain insights from manufacturing systems by combining SPC and Artificial Intelligence (AI), such as ANN [1] . When creating SPCC, it is often assumed that the data follow the Gaussian normal distribution. The formula and the probability distribution of Gaussian are shown in Equation (1) and FIGURE 2 [2] , where , , and 2 are mean, standard deviation, variance, respectively. (1) UCL and LCL for SPCC in this study are often defined by three-sigma deviation. Higher number of sigma values are also used in applications where only a very high certainty for out-ofcontrol data points need to be reported. ARIMA is one of the most important and widely used timeseries analytics models for forecasting and predicting future points in series [3] . ARIMA models are very flexible and can be used as pure autoregressive (AR), pure moving average (MA), or as a combined model of AR and MA (ARMA) or with Integration (ARIMA) in order to make the time series stationary [4] . ARIMA models are generally denoted as ARIMA (p, d, q) where parameters p, d, and q, define the order of the autoregressive model, the degree of difference, and the order of the moving-average model, respectively [3, 4] . In this study values of 1, 1, and 0 were used for p, d, and q, respectively. The reason behind choosing these values was to have a simplified first-order autoregressive model that can identify the pattern of the data with one order of nonseasonal differencing while requiring a low amount of computation. ARIMA has also been used in many studies for anomaly detection. A study by Bianco et al. demonstrates that their procedure with ARIMA has shown a better estimation than the classical methods based on maximum likelihood type estimates and Kalman filtering [5] . Another study by Noskievičová et al. shows the use of an ARIMA control chart to control a blast furnace process. ARIMA SPC is also used in this study and its performance and accuracy for anomaly detection are compared with the other algorithms [6] . 3 © 2020 by ASME Since many IoT applications in the industry require a quick learning phase with a low number of data points, the Naïve Bayes (NB) classifier is a proper choice, as it can generate accurate predictions even with a small number of training data points. With a given prior knowledge of the system, Bayesian statistics allows prediction of the likelihood of a given dataset. The Bayes theorem is the integration of joint probability and conditional probability [10] shown in Equation (2) . In this section, the novel algorithms that are studied in this research project are presented. These algorithms are divided into two categories of anomaly detection and classification. For each category, novel algorithms with the goal of incremental and realtime computations are proposed and compared with the commonly used algorithms in that category. More specifically, in addition to SPCC, SESD, and ARIMA for outlier detection that were described in the background chapter, various methods are presented and used to create incremental moving statistical process control charts which either take the characteristics of all of the data points into account or perform computations on a specified moving batch of data points. For classification also, a novel incremental Bayesian classifier is proposed, which its accuracy and performance are analyzed and compared with other classifiers such as Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Gaussian Naïve Bayes (GNB), Artificial Neural Network (ANN), Decision Tree (DT), Random Forest (RF), and AdaBoost (AB). To develop an automated and dynamically updatable algorithm, in this method, the SPCC model is updated every time a new data point is received. This is done by keeping a historical buffer of the data points and recalculating the UCL and LCL. User can define a certain period of time or number of datapoints that are sufficient for the training phase and avoid this computation for long period of times. This algorithm may reduce the need for manual data collection and computations to set up the control chart, as this process can be automated in an application. As it appears, however, this method is expected to be used for low-frequency data rates or for short training periods since the computations may slow down over time due to the increase in the data size. DSPCC is expected to detect anomalies better than regular SPCC in the conditions that the parameters could be unsteady. Mean and standard deviations can be calculated for DSPCC using (3) and (4), respectively. To solve the latency issue of DSPCC, rather than performing the computations on the historical log of all of the data points, a new algorithm, RTDSPCC, is proposed where only the key features of the data such as mean, standard deviation, variance, and the number of data points are stored. It is demonstrated that by having these key values of a series, they can be incrementally updated for an upcoming data point. Since the computations will occur on a fixed number of parameters with this method, the computation latency is expected to be constant for any number of data points. In this method, the mean and standard deviation can be incrementally calculated using (5), (6) , and (7). The derivations of these formulas are shown in the appendix section of this manuscript. In some cases, the processing parameter of interest may be steady in short terms and unsteady in long terms. In these cases, the short-term knowledge of the parameters may result in better accuracy for the process control than all of the datapoints. In this regard, MSPCC is proposed where the characteristics of the data in a moving buffer are considered to create the UCL and LCL. In this method, a buffer of data points needs to be dedicated where data points are shifted every time a new data point arrives. Mean and standard deviations can be calculated using (3) and (4), respectively, similar to DSPCC. (RTMSPCC) In the case of defining a large buffer of data points for MSPCC, the computations could become extensive. A similar approach discussed in RTDSPCC can address this issue. With RTMSPCC, therefore, the critical features of the defined buffer are only stored and incrementally updated as new data points arrive. Since the computations are performed on a fixed number of parameters, the latency with RTMSPCC is expected to be constant for any number of data points. Assuming the data has a sequence of 1 , 2 , 3 , 4 , 5 , … , −1 , , with a moving buffer of size m, the buffer dataset for the kth data point will be in the form of [ , +1 , … , + ]. For this moving buffer, a middle dataset is also defined as [ , +1 , … , + −1 ]. Mean and 4 © 2020 by ASME standard deviations of the moving buffer, therefore, can be incrementally computed using equations (7), (8) , (9) , (10) , and (11) for RTMSPCC. The derivations of these formulas are also shown in the appendix section. In case that the processing data of interest has seasonality or trends, the following algorithms are proposed, which are the seasonal forms of the previously presented algorithms. • Process Control Chart The presented algorithms mentioned above are applied to the residual part of the signals to detect indexes of the anomalous data points. The residual form of a signal is computed by subtracting the seasonal and trend of the signal from the original signal, as shown in FIGURE 3. The trend and seasonal parts of a signal are commutated by utilizing a moving average and calculating the dominant frequency of the signal by Fourier transformation, respectively. In this section, a novel Incremental Gaussian Naïve Bayes algorithm is presented. Considering the naïve Bayes formula (12) , and assuming a normal distribution, the probability of an incoming data point can be calculated by the Gaussian distribution probability formulas, as shown in (1) . Rather than keeping a log of all of the data points, the mean and standard deviation of the data can be incrementally calculated for the Gaussian distribution as shown in (5), (6) , and (7). The remaining parameters required to calculate the probability of each class for the given dataset are shown in (13) and (14). Note that since ( | ) also follows the Gaussian distribution as well, the conjugate prior can be also used to estimate the posterior. However, since the in this method the computation of the posterior is not complex, the shown formulas are utilized. The classifier can be defined as (15). Mahmoud Parto, Christopher Saldana, Thomas Kurfess / Procedia Manufacturing 00 (2019) 000-000 5 © 2020 by ASME To evaluate the latency, negative predictivity, and accuracy of the algorithms, the following evaluation metrics and simulation data using Monte Carlo methods are considered. First, three general types of anomalies, as described below, are considered [11] [12] [13] . 1. Point Anomaly, where an individual data point is anomalous compared to the rest of the data 2. Contextual Anomaly, where an individual data point is anomalous within a context 3. Collective Anomaly, where a collection of related data points is anomalous Next, Monte Carlo methods were used to create the simulation data with these three types of anomalies as shown in • Contextual anomaly dataset: The simulated data for performance analytics and comparison of the classification algorithms consist of moonshaped, circular-shaped, and linearly separable datasets as shown in FIGURE 5, where dark blue and dark red represent the training data points, and light blue and light red represent the test data points, respectively [14, 15] . It is recommended to see the online version of this manuscript to see the figures in color. These simulation datasets, generated by SciKit-Learn random functions, have the following features: • detected incorrectly With these terms considered, Negative Predictive Value (NPV) as the main evolution metric for outlier detection is defined as follows [16] [17] [18] [19] . This is followed by TABLE 1, where the specifications of the computer on which the algorithms were analyzed with these metrics are shown. The evaluation results of the anomaly detection algorithms on the simulation data mentioned above for the three cases of point anomaly, contextual anomaly, and collective anomaly, are presented. For each data type, the results of the algorithms of ARIMA, SESD, DSPCC, MSPCC, and their RT, as well as their seasonal versions, are shown. Since the presentation of all of the figures requires a large amount of space, only the summaries of the results are shown. First, the anomaly detection accuracy and latency of DSPCC and RTDSPCC are shown in FIGURE 6 and FIGURE 7, respectively. The rest of the algorithms for all different types of anomalies experimented similarly. As shown in these two figures, note that the latency of the incremental algorithms such as RTDSPCC was constant in contrast to the batch computing algorithms such as DSPCC that the latency increases as more data points are considered. The accuracy and latency of the algorithms for various anomaly types are shown in FIGURE 8 to FIGURE 11. This is followed by TABLE 2 that shows a summary of the best performing algorithms. © 2020 by ASME Considering the evaluation results of SESD and ARIMA: • SESD had an acceptable latency, in the order of milliseconds, on all of the data types. ARIMA, on the other hand, was quite slow, in the order of seconds, for all of the data types • Both algorithms had acceptable accuracy in point anomaly detection for stationary datasets; SESD was also good at detection of contextual anomalies; however, they were not accurate in the detection of any other types of anomalies or detection of anomalies in trended datasets • Based on the results, none of these two algorithms are suitable for RT anomaly detection Considering the results for point anomaly detection: • SPCC, RTDSPCC, and RTMSPCC showed a very low training latency in the order of microseconds for each incoming data point • The average training latency of DSPCC and MSPCC, however, was significant, in the order of hundreds of milliseconds for the 1000 data point dataset • It was seen that the average training latency for RTDSPCC and RTMSPCC is constant as more data points are trained. This latency, on the other hand, is increasing as more data are trained with DSPCC and MSPCC • The results of the seasonal forms for point anomaly detection show that SDSPCC and SRTDSPCC algorithms could have a negative predictivity of more than 80% on trended data. This shows that the seasonal forms were significantly more accurate than the regular forms of the algorithms. • Based on the results, it was concluded that RTDSPCC, RTMSPCC, SSPCC, SRTDSPCC, and SRTMSPCC were the only algorithms that could satisfy both needs of accuracy and low latency in point anomaly detection. Among these algorithms, RTMSPCC, SRTDSPCC, and SRTMSPCC showed good accuracy for the detection of trended data as well. Considering the results for contextual and collective anomaly detection: • The results show that all of the algorithms of SPCC, DSPCC, RTDSPCC, MSPCC, RTMSPCC were unsuccessful in the detection of contextual or collective anomalies. • The seasonal forms of these algorithms, SSPCC, SDSPCC, SRTDSPCC, SMSPCC, SRTMSPCC, had a negative predictivity of more than 40% and were more capable of detecting contextual and collective anomalies • Based on the results, it was concluded that SSPCC, SRTDSPCC and SRTMSPCC were the best performing algorithms for contextual and collective anomalies in RT. SRTDSPCC and SRTMSPCC, were successful in the detection of these anomaly types on trended data as well. The accuracy of the algorithms SVM, ANN, GNB are compared with IGNB and shown in FIGURE 12. KNN, DT, RF, AB, are also compared with IGNB and shown in FIGURE 13 . Note that the accuracy of each algorithm is shown on the bottom right hand side of each image. It is recommended to see the online version of this manuscript to see the figures in color. © 2020 by ASME The training performance of the algorithms mentioned above was evaluated on streaming data to simulate IoT systems. This is done by measuring the time it takes for each algorithm to update the model with an incoming data point. For the ML algorithms of SVM, ANN, GNB, KNN, DT, RF, AB, which require historical data for training, the time it takes for retraining the algorithms is measured while for IGNB the time it takes to incrementally train the model is measured as the data are streamed. These results are shown in FIGURE 14 to FIGURE 19. © 2020 by ASME Considering the accuracy results of the classification algorithms: • It was found that the accuracy of the proposed IGNB algorithm was 88%, 68%, and 72% for moon shape, circle shape, and linearly separable datasets. • The accuracy of IGNB for the classification of moon shape data was found to be similar to the accuracy of GNB, linear SVM, and ANN with 100 and 1000 maximum iterations. The accuracy of AB, RF, DT, and KNN, however, were found to be higher in the classification of this type of data. • The accuracy of IGNB for the classification of circle shape data was found to be higher than the accuracy of linear SVM and ANN with 100 maximum iterations. The accuracy of the other algorithms was higher than IGNB for the classification of this type of data. • The accuracy of IGNB for the classification of linearly separable data was found to be approximately 15% lower than the other algorithms. • Based on the results, it was concluded that the overall accuracy of IGNB was similar to the other considered algorithms. In some cases, the developed IGNB algorithm could classify datasets more accurately than linear SVM and ANN with lower iterations. However, the other algorithms were found to be more accurate in classifying linearly separable datasets with high overlaps. Considering the latency results of the classification algorithms: • The results show that the (re)training latency for all of the classification algorithms other than IGNB was increased as more data points were trained. The training performance of IGNB was found to be independent of the number of data points. • The incremental training latency of IGNB was computed as 10.27 ± 2.57 microseconds. • On a 1000 data point dataset, the latency of retraining the models with an incoming data point was found to be 0.5ms, 0.5ms, 1.2ms, 3.5ms, 10ms, 50ms, 250ms, 400ms, for GNB, KNN, DT, SVM, RF, AB, ANN with 100iter, and ANN with 1000iter, respectively. The results show that IGNB is 48 times faster in updating the ML models with an incoming data point, compared to the fastest performing algorithm studied in this work. Since the performance of IGNB remains constant, this difference will become more significant with a higher number of data points. Batch retraining latency, however, might be in some cases faster than incremental training. For instance, the retraining latency for 1000 data points with GNB is measured as 0.5ms, while incrementally training this number of data points with IGNB will require 10.2ms. Therefore, incremental training is best suited for streaming analytics, and batch computing is more acceptable for analyzing historical data. As the industry progresses towards smart manufacturing and stream analytics with IoT systems, traditional methods of historical data analytics will not apply to the new challenges which follow Industry 4.0. This work presented methodologies such as incremental computations that allow analytical models to be created with a more automated fashion, get updated with the streaming of data points, perform real-time computations, and require minimal storage and processing power. Sample algorithms following these methods for outlier detection and classification were presented and successfully demonstrated that the training latency of the incremental algorithms such as IGNB, RTDSPCC, RTMSPCC, their seasonal version was independent of the number of data points. The proposed methods, therefore, are highly beneficial for IoT applications in manufacturing and can be considered to design frameworks where real-time training and prediction of machine learning models are needed. © 2020 by ASME Integrating artificial intelligence into online statistical process control A visual representation of the Empirical (68-95-99.7) Rule based on the normal distribution Introduction to time series and forecasting Time series forecasting using a hybrid ARIMA and neural network model Outlier detection in regression models with arima errors using robust estimates Statistical analysis of the blast furnace process output parameter using ARIMA control chart with proposed methodology of control limits setting On the detection of many outliers Percentage points for a generalized ESD many-outlier procedure Automatic anomaly detection in the cloud via statistical learning Machine learning, a probabilistic perspective Anomaly detection: A survey Contextual anomaly detection framework for big sensor data Data mining for anomaly detection WWRP/WGNE Joint Working Group on Forecast Verification Research An introduction to ROC analysis Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation Encyclopedia of machine learning