key: cord-0047438-xl0kqjgm authors: Wu, Tingxian; Zhao, Ziru; Wei, Haoxiang; Peng, Yan title: Research on PM(2.5) Integrated Prediction Model Based on Lasso-RF-GAM date: 2020-07-11 journal: Data Mining and Big Data DOI: 10.1007/978-981-15-7205-0_8 sha: 88fd1caab663bccda5af316490fb5a46debd9215 doc_id: 47438 cord_uid: xl0kqjgm PM(2.5) concentration is very difficult to predict, for it is the result of complex interactions among various factors. This paper combines the random forest-recursive feature elimination algorithm and lasso regression for joint feature selection, puts forward a PM(2.5) concentration prediction model based on GAM. Firstly, the original data is standardized in the data input layer. Secondly, features were selected with RF-RFE and lasso regression algorithm in the feature selection layer. Meanwhile, weighted average method fused the two feature subsets to obtain the final subset, RF-lasso-T. Finally, the generalized additive models (GAM) is used to predict PM(2.5) concentration on the RF-lasso-T. Simulated experiments show that feature selection allows GAM model to run more efficiently. The deviance explained by the model reaches 91.5%, which is higher than only using a subset of RF-RFE. This model also reveals the influence of various factors on PM(2.5), which provides the decision-making basis for haze control. In recent years, PM 2.5 concentration has increased astronomically on almost every continent, and studies show that the damage done are catastrophic and some are even irreversible. Not long-ago the data collected by the Chinese Ministry of Environmental protection presented a sharp increase in both PM 2.5 concentration and cardiovascular disease. Researchers have shown that high PM 2.5 concentration is the main cause of lung cancer, respiratory disease, and metabolic disease [1] . Predicting PM 2.5 concentration has been done in article [2] , Professor Joharestani and Professor Cao collected PM 2.5 air pollution data and climatic features. With Random Forest Modeling and Extreme Gradient Boosting, they were able to form a model to predict PM 2.5 in the certain areas. But unfortunately, meteorological phenomes caused their data inaccuracy. The data in our experiment was professionally measured and provided by China's National Population and Health Science Data Sharing Service Platform. The rest of the paper is arranged as the following: In the second part of this paper, the relevant research work is introduced, and in the third part, a PM 2.5 concentration prediction model is proposed. This model combines RF-RFE method and lasso model for joint feature selection. Then, the PM 2.5 concentration prediction model based on GAM is constructed. The model is verified by using the meteorological monitoring data and pollution index monitoring data of the Ministry of Environmental Protection. The experimental results show the effectiveness of the integrated model. In the fourth part, it is summarized that the model can be used to predict the concentration of PM 2.5 , which is helpful for the relevant departments to better understand the influencing factors of PM 2.5 concentration and provide auxiliary decision-making basis for air pollution control. RF(Random forest) and RF-RFE(Random Forest-Recursive Feature Elimination algorithm) are both descendants of machine-learning. RF is compatible with highdimensional problems and can predict nonlinear relationships with the downside of its frequent inability to identify strong predictors in the presence of other correlated predictors [3] . Random forest can be used for parallel operation, regression analysis, classification, unsupervised learning, and other statistical data analysis; even when there is a large proportion of data missing from the data set, it has the ability to estimate the absent data value, and keep the accuracy unchanged. Based on the above characteristics, this study will select random forest for RFE and RF-RFE for feature selection of PM2.5 concentration influencing factors. The lasso method (Least Absolute Shrinkage and Selection Operator) was first introduced by Professor Robert Tibshirani. The Lasso method which include the regression shrinkage and selection. It is said to "minimize the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant." [4] Ultimately, Lasso has the ability to achieve two main tasks, regularization and feature selection. In a recent study with Professor Lingzhen Dai and his colleagues, they used the adaptive Lasso method to identify PM2.5 components associated with Blood Pressure in Elderly man. [5] Adaptive Lasso is a more recent version of Lasso, it uses weight to penalize different predictors and allows the researchers in this study to identify the right subset model and satisfy asymptotic normality. For the given data X i , y i , i = 1, 2, . . . N , X i = X i1 , . . . X ip T are the predictor variables and y i are the responses. The mathematical expression of lasso method is Where, t ≥ 0 is a tuning parameter. α is the penalty parameter, which has a negative correlation with the number of features finally selected; β is the coefficient corresponding to the characteristic words in each column of the independent variable x. L1 regularization adds the L1 norm of coefficient β as the penalty term to the loss function, because the regular term is non-zero, thus forcing the coefficient corresponding to the weak feature to become 0. In order to reach a better understanding of GAM (Generalized Additive Models), one should be familiar with its close relative GLM (Generalized Linear Models). GLM is a generalization for logistic regression in a linear regression. The general form of GAM is: Where f i represents smoothing functions such as smooth spline, natural cubic spline and local regression [6] . The RF-RFE-lasso and GAM algorithms are used to form the PM 2.5 integrated prediction model. The model structure is shown as Fig. 1 . Table 1 . As presented in Table 1 , each data item has different dimensions. When joint analysis is carried out, Z-Score standardization method is selected for dimensionless standardization processing, as shown in Formula (3). Where x i represents the original value, X represents the mean value of the original data, σ (X) is the standard deviation of the original data, and Znorm(x i ) is the standardized result. The complex nonlinear relationship between various indexes of air quality monitoring and PM 2.5 concentration and the multicollinearity among air quality indexes affects the performance of the model. For our specific experiment, the RF-RFE algorithm is utilized alongside Lasso algorithm to select a feature subset with stronger correlation with the results, so as to improve the prediction accuracy and efficiency of the model. The RF-RFE algorithm is used to filter the 19 variables in Table 1 , and the resulting feature subset is the subset corresponding to the minimum root mean square error (RMSE). The implementation process of the RFE method is as follows: (1) Construct a feature matrix with the given feature vectors in the data set, each row represents a sample and each column corresponds to a feature; (2) Set the control parameters for constructing RFE function and adopt random forest function as well as the 20-fold cross-validation sampling method; (3) RFE algorithm is used to sort these features according to their correlation with PM 2.5 concentration; (4) Based on the ranking results, the first N (N is the number of features meeting the user's needs) features with the highest correlation with PM 2.5 concentration are selected as the featured subset. The final number of features is 9, namely PM 10 , NO 2 , RH_ave, SO 2 , RH_min, wind_ex, sun, O 3 , and wind_max. As shown in Fig. 2 . Lasso estimation is carried out on all independent variables, and the change process of regression coefficient is shown in Fig. 3 . The final selected feature subset is shown in Table 2 . Common feature subset fusion methods include feature element intersection method and weighted fusion method. The general form of the weighted fusion formula is. (a 1 * mathod 1 + a 2 . . . a n * methodn) n (4) a 1 , a 2 , etc. are the weighting coefficients of the feature subset, where a 1 + a 2 + . . . a n = 1. The weighted average method is used to obtain the fused subset of the RF-RFE and the lasso feature subsets. The process is as follows: in line with the ordering of each feature in the RF-RFE feature subset and the lasso subset, the corresponding score is assigned in ascending order based on rank. The weights coefficients of both subsets are set to 50% and the total score of each feature is added and averaged. Finally, the top 12 features with the highest average score are selected as the final feature subsets and given the corresponding name. RF-lasso-T, as shown in Table 3 : Based on the joint feature subsets obtained by RF-RFE and lasso methods, PM 10 , SO 2 , NO 2 concentration and other explanatory variables were introduced into the model. The effects of explanatory variables are eliminated as a result of a smoothing spline function while seasonal dummy variables are introduced to eliminate periodic effects. The degree of freedom is selected by determining the smallest sum of the absolute values of the model partial autocorrelation (PACF). The GAM model constructed by the joint feature subset RF-lasso-T is shown in formula (5): Where Y is the concentration of PM 2.5 , X refers to the variable, I is the serial number of each variable, J is the number of variables, and S is the smooth function of the model. The entire model adopts the Gaussian iteration method. In agreement with the 80% and 20% proportion, the whole data set is divided into a training set and a test set. On the training set, two GAM models were constructed, one being with the feature fusion subset RF-lasso-T and the other using the full feature subset. The execution duration of the two models is compared shown as in the following figure (Fig. 4) : Evidently, the prediction time of the integrated GAM model is lower than that of the GAM model only whereas the deviance explained by GAM model based on RF-lasso-T is 91.5%, which is higher than that of GAM model based on RF-RFE (90.9%). The test results of RF-lasso-T-GAM model are shown in Table 4 : The overall data set was divided into training and test sets based on the ratio of 80% and 20%. The training model was used to simulate a 146-day data in the test set, and the PM 2.5 daily concentration value was predicted. The model prediction fit is shown in Fig. 6 . The average value of PM 2.5 is 76.25, the average value of the predicted PM 2.5 is 76.27, and the root mean square error is 0.377. The change in PM 2.5 concentration can be affected by the interactions between influencing factors. In pursuance of the relationship between interactions and the concentration of PM 2.5 , the smooth spline function is used to connect the influencing factors. With the combination of various pollutants and meteorological factors, an RF-lasso-T-GAM model was established. The deviance explained by the interaction model was 95.7% with an adjustment decision coefficient of 0.947. The results of the model show that cross variables such as PM 10 and NO 2 , PM 10 and RH_ave, PM 10 and O 3 , PM 10 and sun, NO 2 and SO 2 , NO 2 and RH_min, NO 2 and O 3 , NO 2 and wind_ex, RH_ave and SO 2 , RH_min and O 3 , RH_ave and O 3 , RH_ave and wind_ex are significant at the level of P < 0.001. The cross terms include the interaction between air pollutants SO 2 , NO 2 , PM 10 and meteorological elements as well as the interaction among air pollutants. All the above observations indicate that the change in PM 2.5 concentration is affected by the interaction between air pollutants and meteorological elements. Taking the remarkable interaction between NO 2 and various factors as an example, the interaction model is visually plotted and shown in Fig. 7 . As can be seen from Fig. 7(1) , when the NO 2 concentration is constant, with the increase of SO 2 , the PM 2.5 concentration increases first, then decreases and then increases, which demonstrates a wave-like upward trend. When SO 2 is constant, with the increase of NO 2 concentration, the concentration of PM 2.5 shows a trend of increasing first, then decreasing and then increasing. From Fig. 7 (2), when NO 2 concentration is constant, with the increase of RH_min, the concentration of PM 2.5 increases first, then decreases and then increases, but the overall trend is relatively stable. When RH_min is constant, with the increase of NO 2 concentration, the concentration of PM 2.5 shows a wave-like upward trend of increasing first, then decreasing and then increasing. From Fig. 7 (3), when NO 2 concentration is constant, with the increase of wind_ex, the concentration of PM 2.5 decreases first and then increases, then increases again after a large decrease. When wind_ex is constant, with the increase of NO 2 concentration, the concentration of PM 2.5 decreases first and then increases, then increases again after a large decrease. As can be seen from Fig. 7(4) , when the NO 2 concentration is constant, with the increase of O 3 , the PM 2.5 concentration increases sharply first and then decreases, and then gradually increases in a wave shape. When O 3 is constant, with the increase of NO 2 concentration, the concentration of PM 2.5 shows a wave trend of decreasing first and then increasing, then there is a large decrease, and then it continues to rise. In this study, RF-RFE and Lasso method are used to select the characteristics of the air quality data set. The runtime efficiency of the RF-lasso-T based on GAM model is higher than the GAM model constructed directly on data sets without feature selection and the deviance explained by this model is higher than the GAM model using a single feature subset of RF-RFE. The fitting results of this model can be further analyzed to obtain the meteorological and pollution factors with significant influence on PM 2.5 , as well as the linear and nonlinear relationship between meteorological factors and PM 2.5 concentration. Visual results are provided to support auxiliary decision-making basis for PM 2.5 prediction and air pollution control. Acute effects of fine particulate matter (PM 2.5 ) on hospital admissions for cardiovascular disease in Beijing, China: a time-series study PM 2.5 prediction based on random forest, xGBoost, and deep learning using multisource remote sensing data Using recursive feature elimination in random forest to account for correlated variables in high dimensional data Regression shrinkage and selection via the lasso Differential DNA methylation and PM 2.5 species in a 450 K epigenome-wide association study Generalized additive models Funding. This work was supported by the National Education Sciences Planning-Key Program of Ministry of Education (DLA190426).