key: cord-0571617-47j5vnyd authors: Bhattacharyya, Mayukh; Nag, Sayan; Ghosh, Udita title: Deciphering Environmental Air Pollution with Large Scale City Data date: 2021-09-09 journal: nan DOI: nan sha: d5acad2053bcb0dcf3c520fb6491c7d1b5478d0b doc_id: 571617 cord_uid: 47j5vnyd Out of the numerous hazards posing a threat to sustainable environmental conditions in the 21st century, only a few have a graver impact than air pollution. Its importance in determining the health and living standards in urban settings is only expected to increase with time. Various factors ranging from emissions from traffic and power plants, household emissions, natural causes are known to be primary causal agents or influencers behind rising air pollution levels. However, the lack of large scale data involving the major factors has hindered the research on the causes and relations governing the variability of the different air pollutants. Through this work, we introduce a large scale city-wise dataset for exploring the relationships among these agents over a long period of time. We analyze and explore the dataset to bring out inferences which we can derive by modeling the data. Also, we provide a set of benchmarks for the problem of estimating or forecasting pollutant levels with a set of diverse models and methodologies. Through our paper, we seek to provide a ground base for further research into this domain that will demand critical attention of ours in the near future. Advancement of civilization has led to a lot of interferences on earth generated by humans. While a lot of those pose a threat to the balance of ecosystem and overall climate of the planet, air pollution happens to hold severe and immediate impacts on us humans. PM2.5 and NO 2 the two most common air pollutants are well known to inflict irreversible respiratory disease (2016) . Besides asthma attacks and cardiovascular issues, it has been observed to cause or exacerbate cancers, diabetes (2018) and also influence mortality in infants (2019) . The major sources of pollutants like PM2.5, NO 2 , O 3 etc are emissions from automobiles, power plants and other heavy industries 1,2 . The close proximity of industrial zones around highly populated metropolitan areas combine all the sources of the pollutants to create a very poor living conditions in quite a few cities in the world. Governments of multiple cities have tried methodologies ranging from artificial rains by cloud-seeding, partial traffic ban to *Corresponding Author email: mayukh.bhattacharyya@stonybrook.edu giant air purifiers. All these efforts showcase the rising importance of the issue with every passing year. A year long stretch of lockdown and work-from-home systems has suddenly proved how the absence of public human activities has an improving effect on the pollution levels in big metropolitan cities. Traffic, a big factor behind urban air pollution was completely absent in the initial 1-2 months. This led to the significant drop in different pollutant levels as demonstrated in different studies such as (2020) . These developments have garnered a lot of attention over finding viable solutions that was inconceivable on a large scale earlier due to lack of data with such variability. We feel this is an exciting opportunity to hasten the research in this domain further. Thus we present our dataset and analysis which we believe will aid in our common mission. The primary contributions of the paper are as two-fold: 1. We have introduced a curated spatio-temporal dataset encompassing daily levels of different pollutants and the major casual agents (both artificial and natural) behind them for a duration over 2 years in cities in the United States. Such a large-scale dataset will enable us to study the inter-relationships between the factors and understand how they influence air pollution in urban settings. 2. We have presented a holistic analysis of the dataset that we are introducing. With bayesian modeling, we have captured the relative importance of the different factors in influencing the pollutant levels, as well as the uncertainties associated with the factors. Alongside that, with different methodologies, we have approached the problem of estimating pollutant levels from the various causal factors. In doing so, we have presented initial baselines for further research to be based on. Study of factors leading to air pollution has gained momentum in recent time, although it is a persistent problem for long. Although the studies have taken different problem statements but they have focused mainly around PM2.5 as the central theme. Though previous works existed for air quality forecasting, one of the first works to consider natural influencers like wind, humidity, temperature as well as gases like NO, CO was done by (2013) . Traditional machine learning methods like Support Vector Machines (1995) have been used to forecast Air Quality Index (AQI) as well as individual pollutant levels in the air (2020) . Prediction of specific pollutants concentration like PM2.5 by gradient boosting approach from past data of PM2.5 concentration and climate information is presented in (2020) . However, with the advent of RNNs (1985; 2020) , most of the recent works have been using LSTMs (1997) for air pollution estimation (2020) . (2019a) and (2020) utilised LSTM based systems to predict the concentration of the air pollutants. A Spatial Temporal DNN has been presented in (2018) which takes surrounding conditions into consideration while predicting. However, the drawbacks of most of these works is either they are concentrated on a single region which makes the models not universal, or that they do not consider the influences of causal agents of pollution like automobile and industry emissions. Although few studies like (Wang et al. 2018) evaluates the effectiveness of several thermal power plant control measures on the air quality, a larger exploration or forecasting study is not available due to lack of large scale data. We present a large dataset 3 for modeling the variation of air pollution at the daily level over multiple cities involving data from most of the influencing agents both natural and man-made. The dataset is the largest as per our knowledge in regards to the number of locations and days involved. Overall, the dataset contains a total of 35,596 unique sample points spanning 54 cities and 24 months with each sample point representing a unique (date, city) combination. The data is collected and curated from multiple sources. Hence, some cities and some dates do not have the values for all the pollutants and features. The different aspects of the dataset are given in Table 1 . The sources of the features and the data processing involved are described in the respective sections below: O 3 , and CO. The violin plots in Figure 1 illustrates the monthly distribution and variation of the aforementioned air pollutants. • Meteorological Factors Meteorological factors like humidity, windspeed, temperature and pressure have an impact on the concentration of pollutants in the atmosphere. The concentration values of these meteorological factors are also obtained from the Air Quality Open Data Platform 4 like above. They serve as input features for our models. The units of the different features are provided in the dataset itself. Figure 3 depicts the correlations among these meteorological factors and the respective AQI data. • Traffic The corresponding daily traffic data is collected from 5 provided by Maryland Transport Institute (2020) . The traffic data follows almost the same spatio-temporal granularity as the air pollutant data, apart from one aspect. The traffic data is provided at a county level, not at the 4 https://aqicn.org/data-platform/covid19 city level. But since we are dealing with mainly major metropolitan areas, we have taken the liberty to consider the traffic of the city same as the county it lies within. The trip-based data from (2020) is processed to collate all the trips in a day to calculate the "million miles" of travel in the day, which we treat as the measure of traffic in that city for that day. • Power Plant Emission The data around the generation patterns of power plants could only be obtained at a monthly level from US EIA Website 6 . Considering the production patterns of power plants don't change much at the daily level, we made a pragmatic approximation of averaging the monthly value to the daily level. There are 11,833 power plants we have considered in the dataset. It should however be noted that we have only selected generation data of generators running on fuel types -Coal, Oil, Gas and Biomass, since these are the major ones most frequently held responsible for air pollution. While we do provide the power-plants data in it's granular raw form, we needed a single feature representing the effects of the power plants for a certain (city,date) pair. For that purpose, we design an intuitive metric to form the feature . where I pp is the feature obtained from power-plants for a city c on a date t. G p is the average daily generating capacity for the plant for that month and r cp is the linear distance between the power-plant and the centre of the city. We have taken R limit as 30 km. We have chosen 3 different methodologies to analyse the dataset and estimate pollutant levels. These methodologies attempt to unravel the linear, non-linear and sequential aspects of the presented data. They are outlined in the following sub-sections. For modeling the data, we have used Bayesian methods. In our dataset, we have used traffic data, natural factors and powerplant data as inputs (Figure 2 ). To keep things simple, for each city we have used a weighted sum of the powerplant capacity data where the weights are function of distances between the locations of power-plant and center of the city. This is outlined in the Dataset section. So, mathematically we can define our problem as: Here, W i s are the weights by which N number of inputs each given by I i,c,t for any city c are weighted respectively. X c,t is the air pollutant level for that particular (c, t) instance. represents independent and identically normally distributed noise variables. We can write the collection of aforementioned model parameters as: θ = {W i }, ∀i = 1 : N . Therefore, given a model P (X|θ) we are interested in computing the posterior distribution of model parameters given as: where Θ represents the parameter space of θ. However, in the above expression the integral is intractable. Hence, we resort to variational inference to approximately infer the model parameters. In order to find a close enough approximation (variational density) to the true posterior distribution 6 https://www.eia.gov/electricity/data/eia923/ the Kullback-Liebler (KL) divergence (1951) between the two distributions needs to be minimized given as: where Q(θ; φ) is the variational density parameterized by φ. However, the absence of an analytic form because of the direct involvement of posterior in the KL divergence expression leads to difficulty (2017) . Therefore, the Evidence Lower Bound (ELBO) is maximized instead which is equivalent to minimizing the KL divergence (2017; 2014) . The ELBO acts as the loss function which is given as: We model the distribution of each pollutant separately with Gradient Boosting Regression (2000) based on the environment conditions in different cities at different time of the year. Each pollutant is inferred as: where X denotes the input features, M is the the number of trees or additive steps, W m is the weight of the m th tree and θ m is the tree parameter. T is the basis function which we consider to be a regression tree. At each consecutive step m, θ m and W m is estimated to minimise the loss as: Here,Ŷ i is the ground truth of the i th data ∀ i = 1, ..., N The learned basis function W m T (X, θ m ) boosts the estimator f m−1 as: For each of the pollutant, an additive model with gradual stage-wise expansion is obtained via grid-search over the hyper-parameters by minimising the cross validated meansquared error. In sequential models, we based our explorations mainly on Long Short Term Memory (LSTM) models (1997) . We wanted to explore 2 major aspects of sequential behaviour while estimating pollutant levels. • How much does each pollutant depend upon the data history from the previous days (except that of the pollutants) and to what extent? Table 3 : Performance of predictions from different models for all 6 pollutants. LSTM E is trained on explicit information of weekday and month whereas the normal LSTM and other models are trained without those. The sequence length (number of past days) for each LSTM is given in brackets. • Does explicit information about day of the week or month help forecasting or are those information already captured implicitly in the rest of the data.? Taking these 2 questions into consideration we have experimented with 4 different architectures, 2 for each case. Firstly, for the first case, we denote the LSTM as type E if it takes in explicit day-month information. Secondly, we have the varied sequence length of the LSTM. These 4 cases help us understand both the lower and upper reaches of the various possible dependencies. y p In this section, we approach the problem of estimating pollution levels based on information about the causal and influencing factors. We have experimented with both non-sequential and sequential methodologies. For non-sequential ones like Gradient Boosting Machines and Bayesian Regression, the problem statement is straightforward: trying to estimate a pollutant value based on the day's features. For sequential models, the input extends to include features of past few days. However, we have consciously not included information of pollutants from past days in input as it opens up a host of other possibilities and should be part of much more rigorous studies focused solely on that. The results of Ordinary Least Squares (OLS) (2011) have been provided as a baseline. • Evaluation Data and Metrics: Since we have both sequential and non-sequential models, we needed an trainevaluation split of the whole dataset which would have let us evaluate sequential models with the same ease as non-sequential traditional models. Usually for time-series data, there is the prevalent norm of selecting a later portion as the evaluation dataset. However, we realized that in doing so we would be restricting the evaluation to a particular season with not much daily variations caused by the features. Since we have both the year 2019 and 2020 in the dataset, we constituted the evaluation or test dataset by taking a continuous 60-day segment for each city starting from the first week of March 2020. Since this time period marked the onset of the COVID lockdowns, we would get a much better variability in terms of the features and pollutants. Considering some values in the test set might be missing due to reasons we discussed before, the evaluation performance is calculated only on the available and valid test data samples. We evaluate and compare all our methods with 2 metrics: Root Mean Square Error (RMSE) (2006; 2018) and Mean Absolute Percentage Error (MAPE) (2016) . A combination of these two will give us a holistic picture of the performance of the models being evaluated. The results are presented in Table 3 . Analysis Table 1 provides a good idea about the general fit and importance of the inputs in terms of estimating pollutant levels. As we can see both Gradient Boosting Machines (GBM) and LSTM with a sequence length of 7 days do well in terms of performance. The great performance of GBM does raise a question of whether features of past days influence the pollutant levels of future. Perhaps the importance of the current features far overshadows past features which would lead to such a result. This needs to be explored further to understand the relations more deeply. The results of GBM and LSTM provide a solid benchmark for those future analysis. We also wanted to model the uncertainty in the data and explore other inferences. The following figures shed some light onto the findings we have have from the data and how the data can be useful. We have also computed the correlation of the input features (powerplant, traffic and meteorological factors) with the respective pollutant data. The weights (means, µ) obtained as a result of Bayesian Inference have been shown in the Figure 4 for NO 2 and PM2.5 alongside corresponding correlation values. This plot not only gives us an idea of the importance of each factor in our proposed Bayesian Regression model and the extent of the influences of each input feature on affecting the pollutant levels, but also demonstrates a parity that exists between the weights and the corresponding correlation values. It is to be noted that we have also considered Population at Home as an input feature to the models for all the analyses. This is because it can be assumed to be proportional to the household heating emissions which is also a potential factor responsible for air pollution. The visualizations shown in Figure 5 provide information about each city's conformity with the universal model. It provides us the leads to explore the context behind each outlier. We select one standard deviation (84 th percentile) as upper limit for the median values and 3 standard deviations (99.9 th percentile) for the maximum values of pollutants for each day. We find that the primitive model does a fairly good job barring PM2.5 where the actual median values have crossed our set limits by a greater proportion. While for the high NO 2 at Manhattan we can assume it's mainly due to automobile emissions, the high SO 2 at Honolulu is in fact natural, caused by volcanic emissions. Air pollution will lead to be one of the crucial issues of the society in the years to come. An early initiative to tackle the problem may make a big difference in the future. Through our dataset and methodologies we have intended to establish a foundation for the community to build on. Our dataset cap- In this study, we have illustrated the impact of such factors on the air quality indices using both non-sequential models such as Gradient Boosting Machines and Bayesian Regression, as well as sequential models like Long Short Term Memory. Our intention is to improve and extend the dataset with more data considering other emission sources. In this work, we have not included information of the pollutants from the past days as inputs since it calls for more exhaustive studies which will be a part of a future work. Furthermore, other future works may include exploration and analysis of this dataset using more complex models like transformers (2017; 2020; 2021; 2019b; 2020) , informers (2021) and recently introduced ∞-formers (2021) . Spatio-temporal analysis of the data can also be considered as a potential future work. Impact of wildfire smoke on adverse pregnancy outcomes in colorado A new method for prediction of air pollution based on intelligent computation Performance metrics (error measures) in machine learning regression, forecasting and prognostics: Properties and typology The 2016 global and national burden of diabetes mellitus attributable to pm2.5 air pollution Traffic transformer: Capturing the continuity and periodicity of time series for traffic forecasting A machine learning approach to predict air quality in california Support-vector networks Mean absolute percentage error for regression models Greedy function approximation: A gradient boosting machine Long Short-Term Memory Ordinary least-squares regression. L. Moutinho and GD Hutcheson, The SAGE dictionary of quantitative management research Another look at measures of forecast accuracy Auto-Encoding Variational Bayes Automatic differentiation variational inference On information and sufficiency Forecasting air quality in taiwan by using machine learning Novel analysis-forecast system based on multi-objective optimization for air quality index Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting ∞-former: Infinite memory transformer Impact of lockdown measures to combat covid-19 on air quality over western europe A transformer self-attention model for time series forecasting A long short-term memory (lstm) network for hourly estimation of pm2. 5 concentration in two cities of south korea Learning internal representations by error propagation Air quality prediction using optimal neural networks with stochastic variables Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network Low-concentration pm2.5 and mortality: Estimating acute and chronic effects in a population-based study Adaptive deep learning-based air quality prediction model using the most relevant spatial-temporal relations Attention is all you need Predicted impact of thermal power generation emission control measures in the beijing-tianjin-hebei region on air pollution over beijing, china Deep transformer models for time series forecasting: The influenza prevalence case An interactive covid-19 mobility impact and social distancing analysis platform Informer: Beyond efficient transformer for long sequence time-series forecasting