key: cord-0563294-hq9niluk
authors: Geng, Guannan; Xiao, Qingyang; Liu, Shigan; Liu, Xiaodong; Cheng, Jing; Zheng, Yixuan; Tong, Dan; Zheng, Bo; Peng, Yiran; Huang, Xiaomeng; He, Kebin; Zhang, Qiang
title: Tracking Air Pollution in China: Near Real-Time PM2.5 Retrievals from Multiple Data Sources
date: 2021-03-11
journal: nan
DOI: nan
sha: 6f862ea5c4bd7f27057a4cbe16062383ad6c51e1
doc_id: 563294
cord_uid: hq9niluk

Air pollution has altered the Earth radiation balance, disturbed the ecosystem and increased human morbidity and mortality. Accordingly, a full-coverage high-resolution air pollutant dataset with timely updates and historical long-term records is essential to support both research and environmental management. Here, for the first time, we develop a near real-time air pollutant database known as Tracking Air Pollution in China (TAP, tapdata.org) that combines information from multiple data sources, including ground measurements, satellite retrievals, dynamically updated emission inventories, operational chemical transport model simulations and other ancillary data. Daily full-coverage PM2.5 data at a spatial resolution of 10 km is our first near real-time product. The TAP PM2.5 is estimated based on a two-stage machine learning model coupled with the synthetic minority oversampling technique and a tree-based gap-filling method. Our model has an averaged out-of-bag cross-validation R2 of 0.83 for different years, which is comparable to those of other studies, but improves its performance at high pollution levels and fills the gaps in missing AOD on daily scale. The full coverage and near real-time updates of the daily PM2.5 data allow us to track the day-to-day variations in PM2.5 concentrations over China in a timely manner. The long-term records of PM2.5 data since 2000 will also support policy assessments and health impact studies. The TAP PM2.5 data are publicly available through our website for sharing with the research and policy communities.

estimations and fill the gaps caused by missing AOD data. The WRF model version 3.9.1 and CMAQ model version 5.2 (https://www.cmascenter.org/cmaq/) are used in our work. The simulation domain covers all of China with a horizontal resolution of 36 km. The vertical resolution is designed as 46 sigma levels from the ground surface to 100 hPa for WRF but only 28 vertical layers in CMAQ after the processing of the Meteorology-Chemistry Interface Processor (MCIP) . For the WRF model, the NCEP-FNL and NCEP-GFS data are used to provide the initial and boundary conditions, while the NCEP-GFS sea surface temperature (SST) reanalysis data and NCEP Automated Data Processing (ADP) global observational weather data are used for analysis, observation, and soil nudging. The parameterization scheme follows Cheng et al. 40 with the Kain-Fritsch cumulus physics scheme version 2 41 modified to the Grell-Freitas ensemble scheme 42 . For the CMAQ model, we use the CB05 gas-phase mechanism with the CMAQv5.1 update and sixthgeneration CMAQ aerosol mechanism (AERO6). The chemical initial and boundary conditions are derived from ASCII vertical profile data.

The dynamically updated anthropogenic emissions for mainland China are taken from the MEIC 34- 36 . Emissions for other Asian countries and regions are obtained from the MIX inventory 43 . The simulated PM2.5 concentrations from our WRF/CMAQ model have been fully evaluated against ground measurements in our previous studies 10, 40, 45 . Accordingly, the model performance statistics can meet the recommended performance criteria, and the simulated results have been used for policy assessment and health impact studies in China 10, 40, 45 .

A two-stage machine learning model coupled with the synthetic minority oversampling technique (SMOTE) developed in our previous study 46 is used to generate the TAP PM2.5 data, as presented in Figure 1 . In the first stage, we define a high-pollution indicator to improve the PM2.5 estimations on highly polluted days, which are usually underestimated in statistical and machine learning models 23, 28 . This high-pollution indicator is calculated based on PM2.5 observation data and describes whether the PM2.5 observations at each location exceed the monthly mean by two standard deviations. As high-pollution events cover only 3.9% of our training dataset, which hinders the model's ability to characterize the associations between high-pollution events and other predictors, we adopt the SMOTE technique to resample our dataset and obtain a balance between high-pollution and normal samples. The resampled dataset is then used to train the first-stage random forest model with all the input data except for the CMAQ simulations, after which the predicted full-coverage high-pollution indicator is passed to the second-stage model as one of the input data. In the second stage, we use the residuals between the PM2.5 measurements and the CMAQ PM2.5 simulations as the dependent variable to train the second-stage random forest model.

The predicted residuals combined with the CMAQ simulations represent the final PM2.5 estimations.

Compared with the models presented in previous studies, our model has two major advantages. In the first stage, the SMOTE algorithm balances the uneven proportion of high-pollution and normal data, which could improve the model performance at high PM2.5 levels. In the second stage, using the residuals between simulated and measured PM2.5 enhances the variability of the dependent data, which could enhance the responses of predictors to PM2.5 variations, thus improving the prediction accuracy. We design a sensitivity test model (Sens) without the SMOTE technique and using PM2.5 measurements as the dependent variable to show our model improvements.

Our previous study 30 evaluated different gap-filling strategies and proposed a binary tree-based algorithm coupled with WRF/CMAQ simulations to fill the gaps in missing AOD. As the missingness of satellite AOD are primarily related to meteorological conditions (e.g., cloudy, rainy days) and PM2.5 pollution (e.g., highly polluted days), the tree-based algorithm could directly predict missing PM2.5 by mining the relationship between availability status of satellite data, PM2.5 concentrations and other supporting information 47 . This method is robust at characterizing the spatial patterns of PM2.5 without generating artificially oversmoothed PM2.5 spatial distributions and is efficient for use in a near real-time data product 30 . In each step of our two-stage model, a dichotomous predictor defined by whether the satellite AOD is available is constructed as the cut point of the first layer of the decision tree. This predictor serves to build the associations between satellite AOD availability, PM2.5 concentration, and other supportive information, such as WRF/CMAQ simulations and meteorological conditions, and helps to fill the gaps in the final PM2.5 estimations. Figure 1 shows the operational process for generating the near real-time PM2.5 product in TAP, which includes three steps: data downloading, data processing and PM2.5 modeling. Data from multiple sources (summarized in Table 1 ) are routinely downloaded to the cloud-computing platform every day once trey are available. As these data are at different temporal and spatial resolutions, they are processed to match the 10-km grid defined in our work, as described in Sect.

Multiple PM2.5 models are built to develop PM2.5 data from 2000 to date. For years when ground PM2.5 measurements are available (i.e., 2013-2020), individual models are developed for these years using input data within each year. For the hindcast of PM2.5 prior to 2013 when ground measurements are absent, a model trained with dataset between 2013-2019 is developed and validated to provide robust hindcasting power. For the near real-time product since Jan 2021, the training dataset contains data from the year 2020 and is updated every day on a rolling basis to include the most recent input data. The two-stage random forest model is trained by the updated dataset every day, and then near real-time PM2.5 data are generated and uploaded to our website.

The performance of our two-stage model is evaluated through three cross-validation (CV) experiments: out-of-bag CV, spatial CV and by-year CV. The out-of-bag CV is the most commonly used CV for the random forest models that compares the PM2.5 measurements with the predictions of out-of-bag samples. Spatial CV evaluates the model's ability to make predictions at locations without monitors; all the monitoring stations are randomly divided into five subsets, and each time, the model is trained using data from four subsets and tested on the data from the remaining subset. Similarly, by-year CV evaluates the model's hindcast prediction ability, which sequentially selects one year of data for testing and trains the model with the data from the remaining years. Table 2 shows the CV results of our two-stage random forest models at the daily level, including the R 2 and root mean square error (RMSE) values between the CV estimates and the ground measurements. The PM2.5 predictions from the out-of-bag CV show good agreements with the observations, with R 2 of 0.80-0.88 and RMSE of 13.9−22.1 μg/m 3 for different years. The spatial CV R 2 value decreases by 0.05-0.11 when compared with the out-of-bag CV, indicating that unobserved spatial trends contribute to the PM2.5 predictions. The model's hindcast performance further decreases in the by-year CV, with an R 2 of 0.58 and RMSE of 27.5 μg/m 3 , reflecting a slight overfit in the hindcast of PM2.5 in years prior to 2013.

Our model's performance is comparable to that of models presented in other studies on the basis of the R 2 and RMSE values shown in Table 2 . The statistical or machine learning models at the 10-km grid on a daily scale have ten-fold CV R 2 values ranging between 0.79 and 0.80 in China 19, 22, 24 , which are similar to our out-of-bag CV results (i.e., 0.83 on average). Models with a 1-km grid have higher R 2 values 23, 28 , which might be partially explained by the correlations between PM2.5 and the 1-km AOD being higher than those between PM2.5 and the 10-km AOD, as well as the substantial increase in collocated AOD-PM2.5 pairs at a 1-km resolution than at a 10-km resolution for a larger sample size 48 .

Our TAP PM2.5 product is the first near real-time PM2.5 database in China based on multisource data, including ground measurements, satellite AOD, high-resolution emission inventories (i.e., the MEIC inventory) and WRF/CMAQ simulations. Several factors support the timely update of PM2.5 data. First, the dynamic updates of anthropogenic emissions in China by the MEIC and the high-performance computer at Tsinghua University facilitate the operational simulation of the WRF/CMAQ model, which is an important data source for PM2.5 estimations, as has been evaluated in previous studies 25, 28 . Second, we choose the tree-based algorithm to fill the gaps in PM2.5 concentrations, which is accurate and has reasonable speed. Other methods for filling in AOD gaps such as the multiple imputation method make use of more PM2.5 observations in the training dataset 18 ; however, such a method has a much lower computation speed, and we found similar performances between these two gap-filling methods in our previous work 30 . Finally, the cloud-computing platform makes it possible to develop the model online and allows users to conveniently access all the data products. The daily dataset of PM2.5 from TAP can be found through our website in near real time.

Our two-stage model coupled with the SMOTE technique improves the PM2.5 estimations on highly polluted days. Compared to the sensitivity test model (Sens) without SMOTE and using PM2.5 measurements directly as the dependent variable, the two-stage model has a similar R 2 but higher regression slope (0.97 vs 0.94) when evaluated against ground measurements. Figure 2 shows a detailed comparison between our two-stage model and the Sens model using year 2015 as an example. Usually, PM2.5 concentrations are underestimated over polluted days but a little overestimated in clean days. After adopting our two-stage model, the mean biases over China decrease by 5.9 μg/m 3 (Figure 2a) . We also present examples of the estimated daily variations in PM2.5 concentrations from TAP and Sens and find that TAP has better ability in capturing the concentrations peaks on polluted days. 

The TAP PM2.5 database is also able to provide historical trends of PM2.5 from 2000 to the present ( Figure 4) . Indeed, PM2.5 estimates prior to 2013 have larger uncertainties, as there are no observation data to calibrate and evaluate our models. The by-year CV indicates that the model's hindcast ability has a smaller R 2 and larger RMSE than the out-of-bag CV. However, we use the year-by-year emission inventory from MEIC and the long-term CMAQ simulations as important input data to support the PM2.5 estimates before 2013, thereby providing the best available knowledge of the spatial and temporal trends of PM2.5 concentrations in history over China.

Moreover, the long-term satellite AOD dataset also provides valuable observational evidence of aerosol changes since 2000. We believe that the long-term trend of PM2.5 constrained by these two datasets is reliable. 

In this study, we develop the TAP PM2.5 database that couples real-time ground observations, near real-time satellite data and meteorological reanalysis data, and operational simulations from the WRF/CMAQ modeling system to provide PM2.5 concentration data that are updated in a timely manner. Based on a two-stage machine learning model and gap-filling method, TAP provides daily full-coverage PM2.5 concentrations at a spatial resolution of 10 km in near real time. All the data are publicly available through our website for sharing with the community.

Our work is subject to some limitations. First, our near real-time PM2.5 products rely on the near real-time updates of all the input data (except for the land use, population and elevation data, which have update frequencies of yearly or longer). Delays in any of these datasets would influence the updates of our PM2.5 data. Second, although we believe that the long-term spatial and temporal patterns of PM2.5 concentrations prior to 2013 are reliable due to the reasonableness of the input data, the uncertainties in PM2.5 on a daily scale are still larger than the daily PM2.5 estimates after 2013. Finally, previous studies have shown that using 1-km AOD estimates from the Multi-Angle Implementation of Atmospheric Correction (MAIAC) algorithm would improve the model performance, as a finer resolution would result in better correspondence between the AOD and PM2.5 48 . However, building near real-time models at 1 km would cause exponential increases in the required computing resources and storage. Therefore, we choose the 10-km PM2.5 data as our first step for the TAP database.

In the future, we will continue to improve our methods and provide more air pollutant species and finer spatial resolution data. Accordingly, we will build the TAP database into a near real-time database of multiple air pollutants at different spatial and temporal resolutions based on multiple data sources. 

Global, regional, and national comparative emissions since 2010 as the consequence of clean air actions

Changes in China's anthropogenic emissions during the COVID-19 pandemic

The Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2)

40-Year (1978-2017) human settlement changes in China reflected by impervious surfaces from satellite remote sensing

Dominant role of emission reduction in PM2.5 air quality improvement in Beijing during 2013-2017: a model-based decomposition analysis

The Kain?Fritsch Convective Parameterization: An Update

A scale and aerosol aware stochastic convective parameterization for weather and air quality modeling

MIX: a mosaic Asian anthropogenic emission inventory under the international collaboration framework of the MICS-Asia and HTAP. Atmospheric Chemistry Physics

The Model of Emissions of Gases and Aerosols from Nature version 2.1 (MEGAN2.1): an extended and updated framework for modeling biogenic emissions

Air quality improvements and health benefits from China's clean air action since

Separating emission and meteorological contribution to PM2.5 trends over East China during

Predicting Daily Urban Fine Particulate Matter Concentrations Using a Random Forest Model

A critical assessment of high-resolution aerosol optical depth retrievals for fine particulate matter predictions

Satellite-based estimates of ground-level fine particulate matter during extreme events: A case study of the Moscow fires in 2010

This work was supported by the National Natural Science Foundation of China (42005135, 42007189, 41921005, and 41625020).