1 Introduction

The Sixth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC AR6) [8] acknowledges that climate change is real, based on strong evidence such as declines in Arctic sea ice and other glaciers, thawing of permafrost, an increase in the frequency and severity of heat extremes accompanied by a decrease in cold extremes, a reduction in available water in dry seasons, and an increase in the occurrence of weather conditions that promote torrential rains or wildfires. Thus, global warming is one of the most significant concerns regarding climate change induced by modern society. Due to this phenomenon, practically all beings in the world are negatively affected, including people [1].

In this sense, although Greenhouse Gas (GHG) emissions are the most debated in the scientific community, land use and land cover change are also essential factors to be considered [9]. Humans have deforested vast areas to carry out their activities and utilize natural resources: extraction of natural resources, the establishment of farms, or expansion of urban sites [14]. Consequently, the most threatened areas have become ecological hotspots [11].

In the case of Brazil, and considering Myers’ classic definition [11], the Amazon region does not fit the hotspot criteria. Nevertheless, its rapid deforestation has raised significant concerns, as the collapse of this vegetation could have deleterious effects with global repercussions [6]. Hence, it is important to conduct studies in this area to demonstrate how anthropogenic alterations, encompassing changes in land, may be responsible for temperature fluctuations, aiming to scientifically support their preservation, as current policies on the subject seem not to have produced the desired effects [15].

Therefore, this study covers the period from 1990 to 2022 and utilized data on land occupation from the MapBiomas project [16] along with monthly temperature records from the Brazilian National Institute of Meteorology (INMET) in two different locations: Cajazeiras-PB and São Félix do Xingu-PA. The first city is Brazil’s Northeast region, with a semi-arid climate, while the second one is part of the Amazon rainforest biome. For both regions, the local station presented few data gaps and a significant contrast in land use and land cover change during the study period, with an increase in urban areas but primarily a large expansion of areas dedicated to agriculture and livestock, accompanied by a reduction in native forest.

In this vein, the main objective of the research is to gain insights from machine learning models on the effect of land use changes in local temperature. Since the exogenous variables are land occupation and the endogenous variable is temperature, an ARMAX model was used to perform regression on the temperature time series. Then, Random Forests and Multilayer Perceptron Networks (MPL) were trained with ARMAX outputs. Finally, by confronting the ARMAX model coefficients and XAI techniques for the machine learning models, we provide machine learning-powered quantification of the effects of urban, farming, and forest biome transformations on local temperature. The findings indicate a significant influence of urban and farming areas on increasing temperatures. However, each model provides a different perspective on the magnitude of the influence of each exogenous variable.

This work proposes a methodology to address the impacts of soil changes on local temperature. The remaining paper is organized as follows: Sect. 2 outlines related works on time series forecasting, focusing on temperature forecasting. Section 3 describes the data preparation steps. Section 4 presents the explainability methods applied in this work. Section 5 discusses the results achieved in the experiments. Section 6 provides the conclusions of the paper and suggests directions for further research.

2 Releated Works

Numerous studies have leveraged machine learning methods to enhance temperature prediction and downscaling across different geographical areas [2]. Cifuentes et al. [4] examined various machine learning approaches for temperature forecasting, emphasizing their critical role in planning and decision-making processes across governmental, industrial, and public sectors. This review noted that Deep Learning models generally exhibited lower error rates than conventional Artificial Neural Network structures, particularly at regional scales. At the same time, Support Vector Machines were favored worldwide for their simplicity and precision.

Hernanz et al. [7] expressed concerns about the extrapolation abilities of machine learning techniques when used for statistical downscaling of climate change projections. Through controlled experiments, they illustrated that widely used machine learning models, including artificial neural networks and support vector machines, often show limited performance under extrapolation conditions. This study stressed the importance of understanding the limitations inherent in applying machine learning to downscaling tasks.

Xinyi Li et al. [10] explored the efficacy of both machine learning (Long Short-Term Memory Networks and Support Vector Machines) and statistical (arithmetic ensemble mean and multiple linear regression) methods in downscaling daily temperatures in Ontario, Canada. They reported that machine learning and statistical methods effectively produced high-resolution daily temperature forecasts using multi-model ensembles, though challenges persisted in accurately predicting extreme temperature events.

While numerous studies highlight the utility of machine learning in temperature prediction and downscaling, there remains a significant gap in the literature, specifically concerning the impacts of urban expansion and deforestation on temperature increases. Recent works, such as those by [3, 5], and [13], focus on these areas, elucidating the relationship between local air temperatures and land use. Understanding and mitigating these impacts is crucial for developing strategies to mitigate the adverse effects of urbanization and land use changes on temperature trends.

Furthermore, research by Nascimento et al. [12] directly links temperature variations to land surface changes. This study utilized a network of thermometers spread throughout São Paulo-SP. It compared their readings with data from the LANDSAT 8 satellite’s TIRS sensor, affirming a correlation between air and surface temperatures. Notably, land cover, especially vegetated areas, significantly influences temperature, typically resulting in cooler conditions.

Building upon these insights, our study contributes to this field by proposing a methodology emphasizing soil data’s importance in temperature forecasting systems. Our approach demonstrates how changes in soil can be discerned through temperature variations over the years, providing a crucial understanding of how human land use impacts local temperatures. This methodology enhances our ability to predict temperature changes and deepens our understanding of the interplay between human activity and environmental conditions, paving the way for more informed urban and environmental planning.

3 Data Processing

3.1 Temperature Database: Cajazeiras-PB and São Félix do Xingu-PA Temperature Data

This article examines the average maximum temperatures for each month in the cities of Cajazeiras and São Félix do Xingu. The temperature data from Cajazeiras were collected from a weather station operated by the National Institute of Meteorology (INMET). This station (ID 82689) is situated in the São Gonçalo district, near Cajazeiras, within the municipality of Sousa-PB. The data from São Félix do Xingu were obtained from the INMET station with code 82688, located in the São Félix do Xingu Municipality, Pará State.

To mitigate the impact of seasonal patterns, both time series should undergo a 12-month differencing. However, to retain the long-term trend within the series, the differencing is conducted by comparing each month to the corresponding month in 1990. Consequently, the output yields two series of delta values relative to 1990, facilitating a more precise analysis of temperature changes over time while mitigating the effects of seasonal variability. Statistical tests such as Pettitt’s test were employed to analyze the behavior of the series. Pettitt’s test identified a significant change point in both series (CP) using a significance level of 0.05, indicating a lack of homogeneity. Figure 1 illustrates each differenced series with Pettitt’s u plotted over the series.

Fig. 1.
figure 1

Delta series with pettit’s change point.

Soil Database: Cajazeiras-PB and Sã o Félix do Xingu-PA Soil Ocuppation. For the analysis of land use and cover, data were sourced from a Brazilian government platform called MapBiomas. This platform classifies each 30 \(\times \) 30 m plot of soil across Brazil using satellite imagery. In our study, the soil data for Cajazeiras and São Félix do Xingu were collected and treated as exogenous variables. The data from MapBiomas are collected annually; however, a specific transformation was required to integrate them with the monthly frequency of the temperature records. This transformation involved adjusting the temporal granularity of the MapBiomas data to align with the monthly frequency of the temperature records. Figures 2 and 3 depict the evolution of soil for Cajazeiras and São Félix do Xingu, respectively.

Fig. 2.
figure 2

Soil Evolution in Cajazeiras-Paraíba from 1990 to 2020.

Fig. 3.
figure 3

Soil Evolution in São Félix do Xingu-Pará from 1990 to 2022.

Another essential aspect of our methodology is the consideration of the relative frequency of each biome near each city. Our soil processing approach involves grouping pixels corresponding to each biome to form relative frequency data for each biome within the area of the cities. Subsequently, we employ linear interpolation to interpolate the annual data from MapBiomas, assigning data values for each month. Another important aspect of our data processing involves the classification of soil. While MapBiomas originally categorized over 30 different soil classes, its methodology allows these classes to be consolidated into five super-classes. These super-classes are utilized as exogenous variables in the experiments detailed in this paper. The classes were grouped into super-classes as defined in Table 1.

Table 1. Map biomes classes

4 Fiting Model and Explainability Approach

Following the methodology described by Andrade et al. [5], we employed an ARIMAX model, initiating training fifty steps prior to the change point (CP) value to enhance our analysis of predictions near the change point identified by Pettit’s test. This approach was chosen to explore which exogenous variables exert the most influence near the change point. To determine the optimal parameters for the ARIMAX model, we conducted a grid search across the autoregressive term (p), integrative term (d), and moving average term (q). The model with the lowest Akaike Information Criterion (AIC) score was selected, ensuring the best fit for model simplicity and predictive accuracy. The specific CP values and the chosen p, d, and q parameters for each series are detailed in Table 2.

Table 2. ARIMAX training parameters

The exogenous variables were modeled using soil data, where the relative frequency of each biome within the city’s total area was pre-processed. Then, each biome was grouped into its superclass. For example, Biome 33, categorized as Mangrove, and Biome 1, categorized as Tropical Forest, were labeled under the superclass “Forest,” as described in Table 1. Table 3 lists the remaining biomes and their respective variation (2022 value divided by 1990 value) for each series.

Table 3. Biomes Variation from 1990 to 2022

Although both time series are categorized under the “Forest” biome, the specific forest types for each series differ significantly. São Félix do Xingu is part of the Amazon rainforest biome, characterized by dense, moisture-rich tropical forests. In contrast, the forest biome of Cajazeiras is composed of caatinga, a biome with characteristics well-suited to drier locations. For example, the outskirts of Cajazeiras, where this biome is prevalent, receive less than 1000 mm of precipitation annually, while São Félix do Xingu receives more than 2000 mm. These characteristics are crucial because they influence the biome’s ability to affect local temperatures. Since the primary vegetation in Cajazeiras is caatinga, it is expected to contribute less to urban cooling than the lush, dense vegetation found in the Amazon rainforest of São Félix do Xingu.

As noted in Table 3, the most significant soil variations in both cities were observed in the forest, farming, and urban biomes. Consequently, our approach will focus exclusively on these biomes for subsequent analyses, discarding the Non-Forest Natural Formation class due to its minimal impact on overall soil variation. Following this reduction of exogenous variables, the statistical model for each series was trained using a rolling ARMAX approach with zero integration (\(i=0\)). At each prediction point, coefficients of the exogenous variables were recorded for subsequent analysis to determine their influence on the response variable over time.

A Random Forest (RF) model and an Artificial Neural Network (ANN) were also trained to augment the analysis of the statistical model’s coefficients. The RF model utilized 15 maximum depth and 100 estimators, while the ANN was configured with two hidden layers. These machine-learning models used the output of the ARMAX model as the target and shared the same input variables as the statistical model. Two techniques of XAI (Explainable Artificial Intelligence) were used, SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations), to provide a deeper understanding of the model outputs. This approach aims to offer an alternative explanation to the traditional statistical coefficient analysis. Because these machine learning models were trained on the ARMAX outputs, the SHAP/LIME values can provide an alternative viewpoint on the statistical model outputs from both the Random Forest and ANN perspectives.

5 Results and Discussion

The initial analysis primarily focused on scrutinizing the coefficients of the statistical model, evaluating the magnitude of each coefficient across multiple prediction steps. However, before delving into the interpretation of the statistical model, it is imperative to assess the model’s adequacy. Table 4 presents the Mean Absolute Percentage Error (MAPE) for each model applied to the target time series. The metrics were collected in the test set.

Table 4. ARMAX Model MAPE.

Upon examination of Table 4, both models exhibit satisfactory fits. Thus, we proceed with the analysis of the coefficients of the statistical model. Notably, given the proximity of change points in both models, our analysis will concentrate specifically on the window from steps 230 to 260, encapsulating the change point for both time series. Figure 4 illustrates the logarithm of the coefficient magnitudes for each exogenous variable.

Fig. 4.
figure 4

Coefficient magnitude of exogenous variables.

The coefficients divulge that, for both datasets, the ARMAX model accords greater significance to the Urban Area, followed by the Farming biomes. In the case of the Cajazeiras time series, the coefficient magnitude for Farming closely rivals that of the Forest, albeit numerically higher. This suggests that the Urban biome significantly influences the observed temperature increase over the analyzed period. Another crucial aspect to consider is the employment of Explainable AI (XAI) techniques in the machine learning models, which were fitted to the statistical outputs generated by the ARMAX model. These models exhibited low errors, as corroborated in Table 5.

Table 5. Machine Learning models MAPE.

Furthermore, the utilization of Shapley Values in the Random Forest model, trained on the ARMAX model’s statistical output, offers insights into the impact of features. Higher positive SHAP values signify a positive contribution to the temperature increase from an exogenous variable, while higher negative SHAP values imply otherwise. Figure 5 provides an overview of the SHAP values measured for both time series.

Fig. 5.
figure 5

Summary of Shapley Values for Cajazeiras (up) and São Félix do Xingu (down).

The findings presented in Fig. 5 offer intriguing insights. Beyond the expected correlation indicating that higher urban areas correspond to increased temperatures, the SHAP values suggest that an increase in farming areas contributes to rising temperatures. Conversely, a decrease in forest coverage accelerates this increase. It is vital to acknowledge the distinct characteristics of the two cities analyzed. São Félix, nestled within the Amazon rainforest, benefits from the cooling properties of its dense forest.

In contrast, Cajazeiras, characterized by the caatinga biome, features sparser vegetation, offering lesser cooling effects. This distinction is mirrored in the SHAP values, which assign numerically higher importance to the forest of São Félix due to its more pronounced cooling effects. Additionally, a Multilayer Perceptron (MLP) was employed with LIME to analyze the data. The MLP underwent 250 epochs of training using the same inputs as the ARMAX statistical model. Figure 6 showcases the box plot results of the LIME analysis.

Fig. 6.
figure 6

Box plot of LIME results for Cajazeiras (left) and São Félix do Xingu (right).

Similar to the SHAP values applied to the Random Forest model, the LIME technique underscores the significant influence of increases in Farming and Urban Areas on temperature rises in both cities. Notably, both models not only identify the increase in urban and farming biomes as pertinent to the temperature increase but also acknowledge the forest biome’s role as a cooling factor for both cities.

6 Conclusion

This study presented a comprehensive analysis of the impact of land use changes on local temperatures in two distinctly different Brazilian cities, Cajazeiras and São Félix do Xingu, using advanced statistical and machine learning models. By employing an ARMAX model alongside Multilayer Perceptron (MLP) and Random Forest models enhanced with Explainable AI techniques (SHAP and LIME), we quantified the effects of urban, farming, and forest biomes on temperature variations. The findings indicate a significant influence of urban and farming areas on increasing temperatures, confirming the critical role of land use in local climate change.

The approach proposed in this study combines statistical modeling with machine learning outputs. Future research could explore how these outputs (ARMAX coefficients, SHAP values, and LIME explanations) might be integrated to create a new explicability score for exogenous variables, providing enhanced clarity and understanding of these variables within the models. Although the current methodology already offers valuable insights for various applications, with each technique revealing different aspects of the data, further development could lead to even more robust understanding and applications. Further development can also explore more complex data. The current analisys has only 3 exogenous variables, future works can explore data with more exogenous data.