1 Context

Water scarcity is a pressing issue in various regions worldwide. Even in areas not currently facing water shortage issues, efficient water management is crucial for environmental preservation and improved life quality of inhabitants.

Securing proper water usage is an intricate and multifaceted challenge, with significant efforts being undertaken from both public authorities and private initiatives focusing on Urban Water-Demand (UWD). Alongside improvements in treatment plants and distribution lines, which might include, for instance, the replacement of deteriorated pipework and proper maintenance of equipment, process monitoring, ranging from treatment to final consumption, starts to assume a key role. In this scenario, collected data might be employed, for example, to provide forecasts regarding water demand. If accurate, these forecasts can ultimately contribute to greater efficiency in water supply facilities. Predictions for UWD can be performed on different horizons, depending on the requirements.

There is no clear consensus in the literature regarding the actual definition and designation of the forecasting horizons used for UWD. Short-term and long-term forecasting are commonly employed with different meanings. Some authors, also include a third categorization, i.e., mid-term forecasting. This study adopts the framework proposed by Ghiassi et al. [3], for which short-term predictions encompass hourly and daily forecasts, whereas year-based and decade-based forecasts are deemed long-term. Monthly-based forecasts, considering up to 24 months, are categorized as mid-term forecasts by some authors.

In terms of their applicability, long-term Urban Water-Demand (UWD) forecasts can integrate economic, demographic, and climatic variations. This integration supports the planning and design of water supply infrastructure [11]. On the other hand, short-term forecasts can be used to optimize water treatment and supply management. According to Adamowski et al. [1], short-term forecasting can be employed to enhance understanding of factors influencing water consumption, optimize pump and reservoir management, reduce water loss through leakage by controlling pipeline pressure, and indicate the need for emergency water rationing policies. These measures can directly and indirectly contribute to water conservation and operational cost reduction, obtained from reduced energy consumption in pumping operations, for instance.

For UWD forecasting, predominant approaches are based on data-driven methodologies [7]. The review presented by Niknam et al. [9] identifies neural networks, support vector machines, traditional time series methods (including ARIMA and SARIMA), regression models, random forests, and dynamic systems as the primary data-driven approaches for this purpose. The authors also point out that there is no consensus in the literature regarding the prevalence of one method over others, given the diverse specific conditions of each case under study. However, according to the same authors, researchers in general agree that the methods employed for forecasting UWD should be adapted to the available data and the intended application. In this context, due to their unique characteristics, tourist cities with high seasonal population influx pose a challenge for UWD forecasting, requiring properly adjusted models to ensure reliable results.

Guaratuba is a coastal tourist city located in the state of Paraná, Brazil, that exhibits the aforementioned characteristics. Every year, during the summer season, which coincides with the scholar recess period in Brazil, the city receives thousands of tourists looking to enjoy the beach. As a result, the city, with approximately 42,000 inhabitants according to the latest census held in 2022, received over a million tourists in a single year, mainly concentrated in late December, January, and February. One of the challenges associated with such a sudden seasonal increase in population is managing its water supply facilities.

In this context, the present study aims to evaluate statistical methods for short-term urban water demand (UWD) forecasting in Guaratuba. Specifically, we compare classical methods such as ARIMA and SARIMA to predict a one-week (7-day) horizon considering a daily granularity. Results are compared with those presented in the work of Vieira et al. [12], for which Machine Learning methods were evaluated on the same data, with a regression oriented approach.

2 Related Work

In this section, relevant literature and studies exploring similar topics are reviewed. These provide a foundation for discussing the methodology adopted in this work and corresponding findings, presented in subsequent sections.

In the study conducted by Kavya et al. [7], the authors aimed to provide short-term water demand forecasts for smart water management in the city of Hubli (Hubballi), India. Utilizing 10-minute interval flow meter readings collected over two years (2020–2021), they compared nine machine learning and deep learning models. The study employed both univariate models, using only historical water consumption data, and multivariate models, incorporating climatic parameters and calendar inputs. The results indicate that deep learning models outperformed traditional machine learning models, with the Long Short-Term Memory (LSTM) model achieving the best prediction performance. Although ARIMA presented one of the worst performances in this particular study, it is important to note that no information whatsoever is provided by the authors regarding the hyper-parameters employed for this particular method.

In a comparative study of predictive models for short- and long-term urban water demand forecasting in Milan, Italy, Hao et al. [5] considered several Machine Learning models and SARIMAX to obtain short-term forecasting horizons of 1 and 7 days ahead. Additionally, they investigated whether coupling a Wavelet Data-Driven Forecasting Framework (WDDFF) could improve predictive capacity. Results showed that for the 1-day ahead forecast in the non-wavelet setting, the Artificial Neural Network (ANN) model and SARIMAX performed better, whereas in the wavelet setting, WA-ANN and WA-LSTM outperformed other models. For the 7-day ahead forecast, WA-LSTM significantly outperformed all other models, with SARIMAX showing the worst result. The authors state that ARIMA models can achieve satisfactory results for one-step-ahead prediction, but they have limited capacity for multi-step prediction tasks. Again, no information was found on the hyper-parameters employed by the authors to train the SARIMAX model in order to support these conclusions.

Another work in the area of water consumption forecasting, was developed by Gil-Gamboa et al. [4]. They aimed to predict next quarter demand for each costumer of a company in a specific area of Seville, Spain. In their work, a Deep Feed-Forward Neural Network (DFFNN) was proposed and compared with other predictions methods, including Isotonic Regression (IR), Polynomial Regression (PR), Random Forest (RF), Gradient Boost Tree (GBT) and Long Short-Term Memory (LSTM). The results showed that the DFFNN produced the smallest errors compared w.r.t the other models, outperforming even LSTM, which if commonly employed for short-term water demand forecasting.

Sahoo et al. [10] explored the application of several deep learning models for multi-step ahead urban water demand forecasting in London, Canada. The study focused on water demand for 1-day, 7-day, and 15-day horizons. Among the methods under investigation, the hybrid CNN-BiLSTM model (Bidirectional LSTM and CNN architecture) demonstrated superior performance, accurately capturing daily water demand patterns and outperforming the other models across all forecast horizons. The effectiveness of the model was attributed to its powerful feature extraction capabilities combined with the bidirectional learning approach of BiLSTM. The study highlights the CNN-BiLSTM model as a promising technique for accurate urban water demand forecasting.

The results presented by Vieira et al. [12] are of particular interest to our work, as they rely on the same dataset we employ here. The main objective of their work was to predict the water demand in Guaratuba, a coastal city in the state of Paraná, Brazil. A forecasting horizon of 7 days, with daily forecast frequency was considered by the authors. They employed four machine learning algorithms: linear regression (LR), k-nearest neighbors (kNN), support vector regression (SVR), and multilayer perceptron (MLP). No statistical model, such as ARIMA or SARIMA, was tested for comparative purposes. A validation approach based on sliding and expanding windows was employed, incorporating climate and calendar variables to predict water demand. The MLP model, coupled with expanding windows outperformed all other methods.

3 Materials and Methods

This section presents the materials and methods employed in this study. Sections 3.1 and 3.2 describe the dataset, including its structure and source, as well as the analysis carried out. Sections 3.3 to 3.6 cover the predictive models under comparison, the validation approach, alongside metrics and model selection.

3.1 Dataset

The dataset used in this work was obtained from SANEPAR (Companhia de Saneamento do Paraná), the company responsible for providing water supply and sanitation services to Guaratuba. This dataset has already been preprocessed, therefore, issues related to missing data, noise, and outliers have already been addressed. The dataset includes the daily volume of water produced at the water treatment plants (WTPs), along with additional variables related to meteorological and calendar data. It is important to make clear that only the historical data from the WTPs was analyzed and used as input for the predictive models developed in this work. That is, all models are univariate, taking as input past values for the target feature (volume of water produced by treatment).

The data is presented on a daily frequency, covering a period of four years, from 2016 to 2019. The first two years (2016 and 2017) are set apart for training the models, the third year (2018) is allocated for validation (including hyper-parameter tuning), and the final year (2019) is used exclusively for testing.

3.2 Exploratory Analysis

For all analyses, including the exploratory, the last year (2019) is excluded, in order to prevent any contamination or influence in subsequent evaluation steps.

The first step undertaken in the exploratory analysis aimed to understand the behavior of the time series itself. An initial examination of daily volume of water produced at the water treatment plants (WTPs) yielded insights on how to further proceed. To identify useful patterns for the models, a classical time series decomposition was applied, employing both additive and multiplicative forms. In addition, the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) were employed to identify patterns that were not readily identified through the decomposition process. These functions were also used to determine which lags could be relevant for posterior model optimization.

3.3 Statistical Models

ARIMA and SARIMA, well-known statistical time series models, were used in this work. These will be briefly discussed next. For further information, see [6].

ARIMA(p, d, q) is one of the most commonly used methods for time series forecasting, aiming primarily to describe the autocorrelation within the data. It combines differencing order (I) with autoregression (AR) and moving average (MA) models. The AR(p) parameter specifies the number of lag observations included in the model, indicating how many previous values influence the current value of the series. Differencing order (d) denotes the number of times raw observations are differenced to achieve a stationary time series. Similarly to AR(p), MA(q) indicates the number of lagged forecast errors in the model [6].

The SARIMA model incorporates the seasonal component into the ARIMA model and is defined as SARIMA(pdq)(PDQ)s. In addition to the original ARIMA hyper-parameters, the SARIMA model includes the seasonal hyper-parameters P, D, and Q, which represent the seasonal autoregressive, differencing, and moving average orders, respectively. The remaining hyper-parameter of the SARIMA model (s) determines the length of the seasonal cycle [6].

3.4 Model Evaluation

All models were evaluated employing a backtesting, similar to that employed by [12]. In brief, the evaluation process applies a cross-validation logic without disrupting the temporal order of the data. Two distinct methodologies were employed for backtesting the models, namely, Sliding Window (SW) and Expanding Window (EW). These are illustrated in Fig. 1, considering five iterations of the procedure, for each case. Details of each methodology are provided next.

In the SW methodology (left of the figure), a model is trained with a fixed-size window (constant during all process) in each iteration of the evaluation procedure. After each training and evaluation is performed, the training window is moved forward by a specified step size. The step size also determines the number of objects that will be dropped from the beginning of the previous training set. That is, the same amount of new data points added to the new training window is also removed, considering the oldest data points available.

Fig. 1.
figure 1

Sliding and Expanding Window procedures, respectively. The left schematic depicts a Sliding Window methodology, whereas the right one depicts a Expanding Window methodology. Five evaluation iterations are depicted for both cases, with training data highlighted in blue and test data in red. The figure is adapted from [12] (Color figure online).

In the EW methodology (right of the figure), the size of the training window increases by each iteration of the procedure. No data from the training window is ever dropped, with new data points being added after each step of evaluation procedure (comprised of training and testing). In brief, the training set dynamically increases in size over time, whereas the test set size remains constant.

Note that these two methodologies simulate a real application scenario. Moreover, it is clear that these procedures might be more or less appropriate depending on how data behaves over time. If there is no clear trend (upwards or downwards), an expanding window would, theoretically, be preferred, given that more training data would be available as time progresses. Conversely, if there is a distinguishable trend over time, a sliding window might be preferable, as past data would not reflect the phenomena under study with the passage of time.

The backtesting used in the study has three configurations, summarized in Table 1. One configuration uses EW with a dynamic window size, while the other two configurations employ SW with fixed window sizes of 365 days (1 year - SW1Y) and 730 days (2 years - SW2Y) for training. The step size and forecasting horizon for all three configurations is 7 days, with a forecasting frequency of one day, that is, for each iteration seven daily forecasts are made.

Table 1. Backtesting configurations.

3.5 Model Selection

For model selection a grid search was conducted on a subset of predefined hyper-parameters. These values were identified through the exploratory analysis of the dataset, outlined in Sects. 4.1 and 4.2. Table 2 provides the search space for the models. Root Mean Squared Error (RMSE) employed as the evaluation metric.

Table 2. Hyperparameter Optimization

The hyper-parameter optimization was carried out on the validation set and only the best values found are utilized in the final models. We performed hypre-parameter optimization considering each one of the backtesting configurations (EW, SW1Y and SW2Y). Only the “best” models are evaluated on the test sets.

3.6 Evaluation Metrics

The best models, as identified during hyper-parameter optimization, are applied to the test set and their results are evaluated using three different metrics: Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE). These are given by Eqs. (1), (2) and, (3), respectively. For all the aforementioned equations, \(y_i\) represents the actual value (target) and \(\hat{y}_i\) accounts for the forecast value for the \(i^{th}\) value of the time-series, considering that it has a total of n values under evaluation.

$$\begin{aligned} RMSE &= \sqrt{\frac{1}{n} \sum _{i=1}^{n} (y_i - \hat{y}_i)^2}\end{aligned}$$
(1)
$$\begin{aligned} MAE &= \frac{1}{n} \sum _{i=1}^{n} |y_i - \hat{y}_i|\end{aligned}$$
(2)
$$\begin{aligned} MAPE &= \frac{1}{n} \sum _{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right| \times 100 \end{aligned}$$
(3)

We use different metrics given that none of them provide a perfect and unbiased evaluation [8, 13]. The RMSE metric, for instance is often regarded as sensitive to outliers [2]. All metrics are minimization measures, which means that the lower the value of the metric, the better is the model.

4 Results

This section presents the results of the proposed evaluation. Sections 4.1 and 4.2 detail the dataset analysis. Sections 4.3 and 4.4 cover the selection of the “best” model for each backtesting window, considering the grid-search procedure. Finally, in Sect. 4.5, a comparison between the statistical models proposed in this study with those from machine learning from Vieira et al. [12] is carried out.

Fig. 2.
figure 2

Multiplicative decomposition. Time Series, Trend, Seasonality, and Noise.

4.1 Time Series Decomposition

The decomposition of the water demand time series from Guaratuba was quite similar for both additive and multiplicative methods. However, the multiplicative decomposition yielded slightly better results and is utilized in this analysis. Figure 2 shows the plot of the time series data from 2016 to 2018, highlighting trend, seasonality, and noise, from top to bottom in the figure.

The trend component shows some growth over across subsequent years, but the most prominent feature in the plots is the seasonal component. Using the median seasonality obtained from the decomposition, it is apparent that peak demand occurs on the first day of the year. Therefore, to address the non-stationarity observed in the trend and seasonal components, a first-order differencing (I(d) = 1) is a reasonable choice and should be among the evaluated hyper-parameters tested in both the ARIMA and SARIMA models.

The irregular component (noise) can help understand how external factors influence the behavior of the series. As depicted in Fig. 2, shortly after the beginning of the years 2017 and 2018, there is a decline in the volume of produced water, followed by sudden growth before stabilization, indicating an anomalous event. This, however, corresponds to the carnival season, which usually lasts 4 to 7 days in Brazil and has a significant impact on water demand (tourist influx).

4.2 ACF and PACF

Both ACF and PACF functions are visual tools used to assess the linear dependence of a time series on its past values (lags). Figure 3 displays the ACF and PACF plots, where each lag is represented by a vertical line with a marker. Values outside the shaded light blue area are statistically significant and indicate potential correlation between a data point and its lagged values

Fig. 3.
figure 3

Plots for ACF and PACF for the time series considering the training set.

The ACF indicates a significant correlation between the current value of the time series and its values up to lag 38, a term that might be appropriate for the MA(q) in ARIMA and SARIMA models. Furthermore, the presence of a 7-day seasonality is evident in the data. This seasonality can be incorporated into the SARIMA model by specifying a seasonal period of 7 in the s hyperparameter.

The PACF isolates the influence of individual past values on the current one, excluding the effects of intermediate lags. As shown in Fig. 3, it suggests significant partial correlations at lags 1 to 3 (correlation of 3), lags 7 and 8 or lags 10 and 11 (correlation of 2), and lags 20 to 27 (correlation of 8).

Note that the hyper-parameter values defined for the ARIMA and SARIMA methods’ search space were previously presented in Table 2.

4.3 Model Selection

After completing the grid search w.r.t. hyper-parameters, the best models, coupled with the corresponding window type were identified.

Here, we would like to emphasize the importance of proper hyper-parameter selection for both ARIMA and SARIMA, in contrast with previous studies, highlighted in Sect. 2. For that, we report not only the best hyper-parameter combination obtained from the validation set, but also the worst and overall average across all combinations explored within the search procedure. These results are provided in Table 3, considering the average of the three evaluation metrics and corresponding standard deviation (\(\sigma \)). All values are w.r.t. the validation set.

Table 3. Results of the evaluation metrics for all the methods on the validation set.

The ARIMA (8,0,38) model in the context of a 2 year Sliding Window (SW2Y) exhibited the best overall results, achieving an RMSE of 1425.35 with a standard deviation of 918.88. The worst results were obtained with SARIMA models (2,1,3),(2,1,4,7) and (3,1,3),(3,1,1,7), using Expanding Window (EW) and 1 year Sliding Window (SW1Y), respectively. The highest overall RMSE was 2053.38, a value 44% higher/worse than that of the best model.

The selection of hyper-parameters significantly impacts the performance of time series models like ARIMA and SARIMA. Choosing sub-optimal hyper-parameters can lead to poor predictive performance, which could bias comparisons with other methods, such as those from the machine learning toolbox. Therefore, we argue that it is crucial to conduct a comprehensive analysis and tuning of parameters to ensure accurate and reliable results.

It is important to note that, at this point, the test data has not been seen by the models yet. The validation data was used solely to select the best hyper-parameters, which will be incorporated into the ARIMA and SARIMA models and evaluated on the test data, as presented in the next section.

4.4 Model Evaluation

All models with their optimal hyper-parameters were evaluated on the test data w.r.t. RMSE, MAE, and MAPE. Table 4 presents the metrics for each one of the models on the test set, with their standard deviations (\(\sigma \)). Additionally, the best results obtained for each window evaluation methodology (EW, SW1Y, and SW2Y) from the study conducted by Vieira et al. [12] are also included. These will be be discussed in more detail in Sect. 4.5. It is important to emphasize that we fully reproduced the experiments from [12], attaining the same results from the authors. Therefore, all methods were evaluated equally.

Table 4. Results of the evaluation metrics for the best methods regarding the test set.

The best evaluation metrics for each model tested in this study (ARIMA and SARIMA) are highlighted in bold in Table 4. Analysing only the statistical models, the ARIMA model with a 2 year Sliding Window configuration (SW2Y) obtained the best overall performance, with an MAE of 1, 419.40. This result indicates that the daily water demand prediction error is, on average, 1, 419.40m\(^3\), which represents the difference between the actual and predicted values (to place this result in perspective, in the test set, values of produced water range from  5, 000 to 26, 000 cubic meters). For the SARIMA method, the best results were obtained using the EW configuration, achieving an RMSE of 1, 870.03.

A more detailed analysis involves observing the performance of the best ARIMA and SARIMA models across different months, as depicted in Fig. 4. The box-plot indicates that for both models, the best performance was observed in September, while the worst results were attained in December. Indeed, December is the month with greater variability in the data (justified by tourism from the holidays), which poses difficulties to the models.

Fig. 4.
figure 4

Monthly absolute error for the best ARIMA/SARIMA models w.r.t. the test.

Fig. 5.
figure 5

Forecasts and target variable on test set.

An overview of the forecasts made by the ARIMA-SW2Y and SARIMA-EW (the best models from Table 4) and the actual observed time series (target) is provided by Fig. 5. Relevant months in terms of water demand forecasting are zoomed in to highlight the performance of both models in Fig. 6. In addition to the best and worst months in terms of water demand forecasts previously discussed, January and March are emphasized due to their peculiar behavior compared to the other months of the year, as discussed next.

January and December are vacation months for most people in Brazil, which could account for the distinct trends observed in the graphs, posing challenges for effective predictions by the models. In early March, the Carnival holiday can extend up to a week, leading to increased travel to the city and higher water demand, possibly explaining the spike in the graph. Conversely, September is not a holiday season, resulting in fewer people traveling to the coast and less variation in water demand. The “stable” data observed during this month may contribute to the better performance of both models.

Fig. 6.
figure 6

Forecasts and target variable on test set for selected months.

4.5 Comparison with Related Study

As previously mentioned, the work from Vieira et al. [12] serves as a baseline for this study. Both studies had the same primary objective and employed identical backtesting configurations, with a 7-day prediction horizon and daily frequency for forecasts. Therefore, both studies can be fairly compared between each other.

Results to support such a comparison were already presented in Table 4, where the models from the present study are placed in perspective w.r.t. the best model from the work of Vieira et al., for each backtesting configuration.

In the EW configuration, MLP outperformed both time series methods. For the SW1Y training window size, kNN exhibited superior performance in previous research but was surpassed by the ARIMA model in this study. Similarly, in the SW2Y training window size configuration, the ARIMA model also outperformed the machine learning methods under evaluation by Vieira and colleagues.

While previous research indicated superior results with machine learning methods using the EW approach, the current study demonstrates that the statistical models performed better with the SW method. One plausible explanation for this is that as time progresses, older data lose significance and no longer affects current forecasts as much. This might arise due to population increase and/or greater tourism influx in the recent years, in comparison to previous ones.

In summary, both the MLP with the EW and the ARIMA with the SW produced comparable results. This study underscores the importance of effectively exploring the full capabilities of statistical methods in time series problems, conducting thorough data analysis, and optimizing their hyperparameters. In this work, relying solely on historical data, the statistical model (which is univariate) even achieved better RMSE and MAPE metrics than the machine learning models, which also incorporated calendar and meteorological variables.

5 Conclusions

This research presented a case study conducted in Guaratuba - PR - Brazil, aimed at forecasting its daily water demand over a 7-day horizon using statistical methods. In a previous investigation [12], the authors pursued the same objective with a regression oriented approach employing machine learning models. Both approaches employed backtesting for model validation and assessment.

A comparison between the results obtained from the two studies revealed similar outcomes, with machine learning methods performing better when coupled with an expanding window approach. Conversely, statistical models performed better with sliding windows. Overall, the MLP models with Expanding Window (MLP-EW) and ARIMA with a 2 year Sliding Window training window size (ARIMA-SW2Y) achieved the best performances, with ARIMA-SW2Y showing an advantage in RMSE and MAPE metrics.

This study demonstrated the utility of statistical models, specifically ARIMA and SARIMA, in forecasting urban water demand, even over a 7-day horizon. In general, there is a widely held view among researchers that statistical models are particularly suitable for short-term forecasts in urban water demand (UWD), typically one day ahead. In our case, we extended this to a 7-day horizon and achieved results comparable to those obtained using machine learning.

Based on our study, we can conclude that traditional ARIMA and SARIMA models continue to serve as a robust baseline for comparison with more complex machine learning models. However, achieving an appropriate model requires comprehensive data analysis, proper tuning of hyper-parameters, and a fair backtesting approach. In our work, we identified that many studies do not conduct such rigorous analyses when using statistical methods as baselines, potentially resulting in unfair comparisons and sub-optimal model performance.

It is important to note that this study addresses a specific problem, and the findings may not be applicable or extendable to other contexts without further evaluations. Further work is in progress, in order to evaluate Long Short-Term Memory (LSTM) models in this context.