key: cord-0936631-m8cm978p
authors: Sharbaf, Mohammad Ali Mohammad Jafar; Abedini, Mohammad Javad
title: Implementation of supervised principal component analysis for global sensitivity analysis of models with correlated inputs
date: 2022-01-25
journal: Stoch Environ Res Risk Assess
DOI: 10.1007/s00477-021-02158-y
sha: ed436168689c8ccbf023ca10d7ce4cd6b3e099d5
doc_id: 936631
cord_uid: m8cm978p

Global Sensitivity Analysis (GSA) plays a significant role in quantifying the tangible impact of model inputs on the uncertainty of response variable. As GSA results are strongly affected by correlated inputs, several studies have considered this issue, but most of them are computationally expensive, labor-intensive, and difficult to implement. Accordingly, this paper puts forward a novel regression-based strategy based on the Supervised Principal Component Analysis (SPCA), benefiting from the Reproducing Kernel Hilbert Space. Indeed, by conducting one kind of variance-based sensitivity analysis, a renowned method exclusively customized for models with orthogonal inputs, on SPCA regression, the impact of the correlation structure of input variables is considered. The ability of the suggested technique is evaluated with five test cases as well as three hydrologic and hydraulic models, and the results are compared with those obtained from the correlation ratio method; Taken as a benchmark solution, which is a robust but quite complicated approach in terms of programming. It is found that the proposed method satisfactorily identifies the sensitivity ordering of model inputs. Furthermore, it is proved in this study that the performance of the proposed approach is also supported by the total contribution index in the derived covariance decomposition equation. Moreover, the proposed method compared with the correlation ratio method, is found to be computationally efficient and easy to implement. Overall, the proposed scheme is appropriate for high dimensional, quite strong nonlinear or expensive models with correlated inputs, whose coefficient of determination between the original model and regression-based SPCA model is larger than 0.33.

The identification, quantification, and propagation of uncertainty are considered to be three necessary tasks in model building and evaluation. As the process of modeling and simulation of water resources and environmental projects become more advanced and computationally expensive, the process of quantification and propagation of uncertainty in parameter estimation and model structure identification becomes quite time-consuming and laborintensive for high dimensional models. In a sense, in a typical modeling and simulation with a highly nonlinear model structure, a small error in input variables might result in a large uncertainty (or error) in model output (Helton et al. 2005; Iman et al. 2002) . Thus, it is advantageous to investigate how the uncertainty in the output of a model can be made proportional to uncertainty in model inputs. Global Sensitivity Analysis (GSA) is devised to address these issues in model building, execution, and evaluation.

GSA is utilized in many applications in water resources engineering and environmental studies (Ciriello et al. 2013; Zheng and Wang 2015; Razavi and Gupta 2016b) such as model calibration, uncertainty reduction of model output, model validation, identification of redundant parameters, and decision-making processes that try to explore which model inputs might have the most influential impact on output uncertainty (Crosetto and Tarantola 2001) . Due to the numerous applications including the ones cited above, GSA can be effectively utilized to conduct preliminary exploratory data analysis to build the corresponding model. As an example, in a typical rainfall-runoff process, being able to implement GSA in order to build a parsimonious model structure would not only help eliminate the redundant parameters, but also help reduce the uncertainty in model output which somehow lead to saving time and capital. This task is considered as an inevitable part of any effort conducted to convert rainfall into runoff in an efficient way.

There is a wide range of GSA methods in the market based on different philosophies and theoretical definitions of sensitivity measures, such as regression technique, Morris, variance-based, Variogram Analysis of Response Surface (VARS) method, regional SA, and density-based methods. The details of these approaches can be found in relevant literature (Saltelli et al. 2004; Borgonovo 2007; Razavi and Gupta 2016a) . However, the cited approaches suffer from a fundamental problem regarding the correlation structure of input variables. Nowadays, non-orthogonal model input variables can be considered a rule rather than an exception in GSA because it is apparent in the shadow of belittling and ignorance of model factor dependence; GSA outcome will be fallacious. Therefore, results of GSA should be considered with caution when there is a correlation between certain parameters in GSA, that is not considered during GSA implementation. Accordingly, several scientists have attempted to present or develop GSA methods in light of dependent factors. As an example, a few studies applied the parametric and/or nonparametric methods to decompose the variance of the response variable with regard to model correlated inputs (Saltelli et al. 2001 and Xu and Gertner 2008; Da Veiga et al. 2009; Chastaing et al. 2012; Li et al. 2010; Xu 2013; Zhang et al. 2015; Mara et al. 2015) , whereas some other investigators extended the analytical formulations for non-orthogonal input variables and the associated numerical approaches to calculate the variance-based sensitivity measure and/or more advanced Sobol's sensitivity indices (Mara and Tarantola 2012; Kucherenko et al. 2012 Kucherenko et al. , 2017 Wang et al. 2018) . As the computational budgets associated with most of the above approaches are quite expensive due to high dimensional problems, and/or more complex model structure, development of more innovative and efficient approaches with regard to non-orthogonal inputs is strongly recommended for practical applications (Ge and Menendez 2017; Zhou et al. 2019; Do and Razavi 2020; Lamboni and Kucherenko 2021) . In the meanwhile, one should also acknowledge how problematic it is for engineers and other investigators with rudimentary information on GSA to use the above methods because of their complicated procedures and concepts for practical applications on a routine basis.

Taking the above elaboration into account, this paper intended to couple the classical variance-based SA, a class of methods customized for independent inputs, with Supervised Principal Component Analysis (SPCA), based on Hilbert-Schmidt Independence Criterion (HSIC), to address issues concerning model's computational effort and complexity. To the best of the authors' knowledge, this is the first time that SPCA is implemented in the GSA arena to find the degree of importance of various correlated input variables as they affect the output response. Furthermore, in conventional literature, a number of classical variance-based approaches (e.g., classical Sobol and/or FAST as numerical approaches) had been proposed to delineate the degree of importance of models with independent input variables. Here, the authors devised an algorithm whereby SPCA is coupled with classical variance-based SA to obtain the degree of importance of models with correlated inputs. In other words, the classical variance-based approach, which had been extensively used in the case of orthogonal model inputs, can be used for models with correlated inputs if it is coupled with SPCA. This coupling exercise reduces the sample size dramatically imposed by conventional approaches such as the correlation ratio method. As a result, simplicity and efficiency are considered as two major advantages of the proposed method.

Indeed, implementation of this kind of variance-based SA on a regression model derived from the SPCA would lead to a semi-analytical method in terms of components of a dominant eigenvector to estimate the first-order sensitivity measure with no additional computational budget for a system which contains both correlated and orthogonal inputs.

The rest of the paper is organized as follows: first, an overview of the GSA based on the variance-based approach is presented in the next section. Then, the paper gives a detailed account of SPCA, followed by the proposed method along with the proof of its validity in the next section. The subsequent section is devoted to assessing the performance of the proposed method via five test functions as well as three more realistic models in hydrologic and hydraulic domains, for which the corresponding results of all test cases are compared with the correlation ratio method based on McKay's scheme. Finally, the summary and the conclusions are summarized in the last section.

The methodology presented in this work is intended to couple SPCA with the classical variance-based SA which is limited to independent inputs. It is important to make this coupling process quite clear. In essence, the classical variance-based (e.g., Sobol and/or FAST) methodology is limited to transfer functions for which the input variables are independent. On the other hand, SPCA considers the impact of correlation among input variables and a dependent variable using regression-based analysis. Would it be possible to effectively benefit from the advantage of SPCA and combine it with the particular variance-based SA to contrive a better tool for prioritizing the input variables when the input variables are correlated? With this question in mind, first, a concise account of the basic concept of variance-based SA is provided, followed by a discussion detailing the theoretical background of SPCA to determine how this coupling exercise can be implemented in practice.

For a computer model with p number of input variables given by Y ¼ f ðXÞ, X ¼ ðX 1 ; X 2 . . .; X p Þ, if the input variables are random and they are free to vary over their whole bound of uncertainty, 1 then the output becomes a random variable and the uncertainty of the model output can be determined by V Y ð Þ. For such a model, decomposition of the total unconditional variance of output variable can be written as (Saltelli et al. 2004 

V conveys the notion of variance, and E represents the expectation operator. Also, V½E YjX i ð Þ and E½V YjX i ð Þ are called the main and residual effect, respectively (Saltelli et al. 2004) . Thus, to achieve a reduction in the variance of model output, the first-order sensitivity index can be computed by defining the ratio of the main variance (i.e., V i ) as a variance of conditional expectation to variance of model output as (Saltelli et al. 2004 ):

It is worth noting that S i varies between zero and one, i.e., [0, 1] . In addition, the sensitivity analysis (SA) of higher-order terms taking into the account the interaction of model factors such as X i and X j on model output uncertainty can be written as (Saltelli et al. 2004) :

Like S i , S ij varies in the range of zero and one. Saltelli et al. (2004, p. 112) can be consulted for further details regarding the computation of higher-order interactions. Under the independent assumption, the variance-based sensitivity indices can be estimated numerically via the two most popular techniques, e.g., Sobol (1993) and FAST (Cukier et al. 1973) . If a mathematical model under consideration is assumed to be approximately linear due to the relatively linear impact of model factors on output, the given model, i.e., Y ¼ f ðX 1 ; X 2 . . .; X p Þ can be simplified by eliminating the nonlinear or interaction terms as follows (Xu and Gertner 2008) :

Later on, we will elucidate how the original nonlinear function can be converted to Eq. (4). Hence, only the main effect of each model input in the corresponding regression model is considered. when the components of this linear model's inputs are mutually independent, it can be proved that the main effect in reference to variance decomposition can be stated as follows (Saltelli et al. 2008) :

In which h i ¼ a i r X i =r Y . The h 0 s are standardized regression coefficients (SRCs). Actually, in the case of orthogonal model factors, the SRC is considered as a method of sensitivity analysis for the model inputs, and in accordance with the identity of Eq. (5), the squared SRCs describe the contribution of each factor to the total variance (Saltelli et al. 2008) .

To evaluate the performance of the SRCs, it is necessary to appreciate the connection between SRCs and the measure of goodness, the so-called model coefficient of determination, R 2 as (Saltelli et al. 2004) :

where Y i is the output of the original model, i.e., Y ¼ f ðX 1 ; X 2 . . .; X p Þ, and b Y i stands for an estimate of Y i using the regression model, i.e., Eq. (4). This coefficient is an attempt to recognize how well the model under study can be approximated by regression in the form of Eq. (4) (Weisberg 2005) . When the model is linear, using SRC as a GSA method can exactly quantify the amount of response surface variation explained by each model input. 1 The knowledge aboutthe range of uncertainty of model inputs can be obtained from various sources such as the one elaborated by Saltelli et al. (2004) , which reads as:''measurements, expert opinion, physical bounds or an analogy with factors for similar species/compounds. We may additionally have information (e.g., via observation) on the joint probability distribution of the factors''.

The SPCA, initially pioneered by Barshan et al. (2011) , laid the foundations for this approach in the field of supervised methods. For the cited contribution, a considerable number of dimensionality reduction techniques based on supervised methods could only consider similarities and differences for classification purposes. This property of the cited supervised methods is in contrast with Barshan et al.' s approach, which also examines the quantitative value of the target variables. As a result, it is applicable to both classification and regression problems. Their ideas can be considered as a paradigm shift in future activities of researchers carrying out prediction based on regression approaches. Their studies refer to the work of Gretton et al. (2005) , who proposed an independent criterion in Reproducing Kernel Hilbert Spaces (RKHS). This criterion measures the dependency between two random variables according to the Hilbert-Schmidt Independence Criterion (HSIC). In reference to HSIC, the necessary and sufficient conditions for the two random variables to be independent, can be achieved if the value of this statistic (i.e., HSIC) is zero.

There is a theoretical relationship for HSIC, but it is impractical for actual settings. Consequently, the empirical estimate of HSIC proposed by Gretton et al. (2005) as a practical criterion, can be implemented to check the dependence or independence of two random variables with a finite number of observations. Thus, this criterion, i.e., the empirical estimate of HSIC for a series of n observations such as Z:¼f x 1 ; y 1 ð Þ; x 2 ; y 2 ð Þ; . . .; ðx n ; y n Þg, is as follows (Gretton et al. 2005) :

where F is a RKHS 2 in which for each point x 2 X; there exists an element£ x ð Þ 2 F , such that h£

and K is considered to be a positive definite kernel for n ! 1 and b 1 ; b 2 ; . . .; b n 2 R, (Hein and Bousquet 2004) , i.e.,

Likewise, G is a RKHS in which for each point y 2 Y; there is an element x y ð Þ 2 G, such that hx y i ð Þ; x y j À Á i G ¼ lðy i ; y j Þ and L is considered as a positive definite kernel. It is necessary to recall that F and G have to be separable. Indeed, they must have complete orthonormal systems.

Briefly, K; L; H 2 R nÂn and K ij :¼ kðx i ; x j Þ,L ij :¼ lðy i ; y j Þ.

In addition, H ij :¼ I À 1 n ee T (centering matrix and e is a unit vector). Finally, tr stands for the trace of a matrix.

In SPCA, the subspace spanned by new features is examined such that the principal components of input variables with maximum dependency on response surface are reserved through the empirical criterion, i.e., HSIC. Actually, for a model with p standardized inputs,X ¼ ðX 1 ;X 2 ; . . .;X p Þ and l standardized outputs,Ỹ ¼ ðỸ 1 ;Ỹ 2 ; . . .;Ỹ l Þ, to maximize the dependency among the projected data (i.e.,Z ¼XU) 3 and the response surface (i.e.,Ỹ), maximizing the trðKHLHÞ is necessary. This could be justified by referring to empirical HSIC cited above (Barshan et al. 2011) . To maximize trðKHLHÞ, ZZ T is replaced with the kernel matrix K to get (Barshan et al. 2011) :

After replacing the dimension of each matrix in Eq. (9) 

Based on the unique property of trace in matrix algebra (Strang 2016) L is a kernel matrix ofỸ (e.g.,ỸỸ T ), and H ij ¼ I À 1 n ee T . Ultimately, the optimization for HSIC objective function is accompanied by the following constraint (Barshan et al. 2011): arg max trðU TXT HLHX T UÞ U Subject to U T U ¼ I

The optimal solution for this optimization problem is considered to be the unit eigenvectors of the real and symmetric matrix Q ¼X T HLHX. Taking the components of the unit eigenvectors as the decision variables, the 2 The necessary background of this concept is provided by Gretton et al. (2005) 3 Matrix U is a modal matrix consisting of m orthogonal unit eigenvectors that map the data sets to a new space in which features are uncorrelated. 4 trÃ nÃmBmÃn À Á ¼ trB mÃnBnÃm À Á Stochastic Environmental Research and Risk Assessment optimal solution will be U ¼ ½U 1 ; U 2 ; . . .; U m corresponding to k 1 ! k 2 ! . . . ! k m , which are selected among p eigenvalues. Here, m is the dimension of eigenspace (Barshan et al. 2011) . The beauty of the above mathematical argument can be justified as follows: If kernel L is equal to the identity matrix I, the matrixX T HLHX is equivalent to the covariance of matrixX, i.e., the PCA method (Barshan et al. 2011) . Since H T ¼ H:

It is worth noting that the principal component analysis can be used for different purposes. For example, Martin-Barreiro et al. (2021) applied the modern technique in PCA, i.e., disjoint and functional principal component analysis, to classify the South American countries which, due to COVID-19, caused several infections and deaths. In addition, Ramirez-Figueroa et al. (2021) suggest a novel approach for disjoint principal component analysis based on particle swarm optimization. In reference to their proposed method, the predictive variables which have the most contributions on each principal component are selected.

Based on their approach, the main problem in the multivariate statistic is solved.

As mentioned in the preceding section, the novelty in this study is based on implementing the classical variancebased SA (applicable under the assumption of independent inputs) on the regression model extracted from SPCA for models with correlated inputs. In subsequent subsections, derivation of regression-based SPCA and details of how to couple it with a typical variance-based SA will be provided.

Considering the road map offered above, consider the regression model cited in Eq. (4), as mentioned, this linear regression model is obtained in reference to the approximate linear impact of inputs on the response variable, i.e., Y ¼ f ðX 1 ; X 2 ; . . .; X p ). To use this regression model, first, the samples of model inputs are required. For this purpose, Latin Hypercube Sampling (LHS) can be used to generate various realizations of the input variables based on their probability density functions 5 for both correlated and/or uncorrelated model inputs (Iman and Conover 1982) . This method is also recognized as an inverse Nataf transformation (Nataf 1962) . Then, the original model is run using these samples, and a one-dimensional response surface (Y) will be generated. As soon as the multi-inputs-single output realizations are generated, the regression model stipulated in Eq. (4) can be produced. When the independence assumption is violated, i.e., the model inputs are correlated, using ordinary least square gives a poor estimate of model parameters of this equation, i.e., a i . Thus, performing SA on this linear model will produce an unjustified solution. Although the Principal Component Regression (PCR) method can overcome this problem, coupling it with a classical variance-based approach (e.g., Sobol and/or FAST) has a drawback that will be highlighted in the following study. As a result, the SPCA using linear kernel L on standardized response variableỸ (i.e.,ỸỸ T ) was chosen to estimate the simple regression model coefficients. For this purpose,Ỹ is expressed as a linear combination of new features called Z j as stipulated in Eq. (15).

In which Z j j ¼ 1; 2; . . .; m; m\p ð Þ are a set of new features taking into account the impact of the dependent variable,Ỹ. In addition, m is the number of supervised principal components. The procedure of deriving Eq. (15) is explained in appendix A. To express Y as a linear combination of X i ; a linear transformation was utilized to convert Z j to X i as follows (Barshan et al. 2011) : 

Via eigenvalue analysis, it is proved in appendix B that after replacing matrix L withỸỸ T in matrix Q, the resulting matrix gives only one nonzero eigenvalue (k 1 ). In appendix B, a unique eigenvalue and the corresponding eigenvector for matrix Q are rationalized in a meaningful manner. Thus, its associated unit eigenvector, i.e., U 1 ¼ ½u 11 ; u 21 ; . . .; u p1 T is chosen for projecting explanatory variables, i.e., X. As the corresponding new feature (Z 1 ) has a maximum linear dependency onỸ, this means other new features can be eliminated from the model due to zero correlation with output which is proved in appendix B. In the meanwhile, in linear regression, when model variables are uncorrelated with zero means, the regression coefficients of Eq. (15) can be estimated as follows (Bedford 1998; Xu and Gertner 2008) :

As was proved in appendix B, COVỸ; Z j À Á ¼ 0, for j 6 ¼ 1. Thus, b b j ¼ 0 for j 6 ¼ 1. Consequently, the regression-based SPCA model will be simplified as follows:

After replacing Z 1 and b 1 with their equivalences [i.e., Eq. (16) and Eq. (17)] into Eq. (18), the regression-based SPCA model can be expressed in terms of standardized variables as:

where u i1 are the entries of the U 1 eigenvector corresponding to the maximum eigenvalue, i.e., k 1 : As a result, Eq. (19) can be expressed in terms of the original variables Y and X i as:

The above equation can be written as:

Hence, Eq. (21) is considered as a regression-based SPCA model for Y ¼ f ðXÞ.

At this stage, the classical variance-based SA coupled with SPCA can be effectively utilized to investigate the impact of X i on Y. Before mentioning this task, it might help to summarize the main point of the above mathematical manipulations for practical purposes. If we are given a functional, thereby the predictor variables are correlated. We have three choices to investigate the impact of various predictor variables on output. According to the first option, the correlation amongst inputs might be ignored and the classical variance-based SA (applicable under the assumption of the independent inputs, e.g., Sobol and/or FAST) can be implemented directly on the original model with correlated inputs. The most important variables can then be reported; however, the results are not justified and cannot be defended. In addition, based on the second option, the methods in the literature such as the developed variance-based SA (under the assumption of correlation of model inputs, e.g., correlation ratio method) can be applied directly on the original model, which leads to having acceptable results for recognition of the importance of correlated inputs, but these methods may have complex procedures because re-coding their algorithms may not be straightforward (Saltelli et al. 2004) or may be rather timeconsuming in the case of high dimensional and/or computationally expensive models 6 (Ge and Menendez 2017). However, the regression-based SPCA can also be implemented to present Eq. (21) and then impose the classical variance-based SA on this regression 7 model to obtain meaningful and justifiable results to prioritize the correlated inputs with respect to the uncertainty of model output.

In fact, in the case of having a model with correlated inputs, we reused the classical variance-based SA method, which is proposed for models with independent inputs, but we conduct it on the regression-based SPCA model instead of the original model. We argue and prove that prioritization in reference to the third option would give rise to acceptable results for models which include correlated inputs. In the next section, we show this coupling to be more straightforward and computationally efficient as compared to the developed variance-based. Following the third option, estimation of the sensitivity measures based on Eq. (5) using regression-based SPCA model can be written as: 6 Generally, the researcher is advised to prioritize the correlated variables only when necessary, as convergence of sensitivity estimates is much slower for the correlated inputs (Saltelli et al. 2004 ). 7 ''The regression algorithm returns a regression meta-model, whereby the output Y is described in terms of a linear combination of the input factors'' (Saltelli et al. 2004) . Indeed, in a meta-modeling approach, instead of an original system, the effective model maps output and inputs. Then, the importance of model inputs is examined using the constructed surrogate model. Conversely, in a classical approach, the original system and the corresponding model output values are directly considered to estimate the sensitivity indices (Li et al. 2010) .

where V Y ð Þ unc stands for estimation of V Y ð Þ without considering the correlation among inputs. Since U T U ¼ I; the constraint of U 1 T U 1 ¼ 1 is satisfied due to the orthonormal nature of eigenvectors, thus:

After proper simplification, we have:

This equation demonstrates the simplicity of the proposed method 8 conferring a linear or nonlinear function with correlated input variables. The simplicity refers to the fact that sensitivity measures, i.e., S i are equal to the square of ith component of the unit eigenvector corresponding to the dominant eigenvalue (i.e., first new feature).

It is worth noting that the classical Sobol or FAST as numerical methods of classical variance-based SA (applicable under the assumption of independent model inputs) can also be conducted on the Eq. (21) and the results of this coupling can be examined numerically. It is speculated that there is a small difference between the numerical results with that of the analytical solution emerging from Eq. (24).

At this stage, a serious question may arise. While the classical variance-based SA (e.g., Sobol and/or FAST) cannot be implemented on correlated data emerging from a linear or nonlinear model, how and why can such a tool be safely implemented on the regression-based SPCA model for evaluating the importance of correlated predictor variables? The following mathematical elaboration is intended to address this question. To this end, first, it was shown in appendix B that there is a relationship between the first eigenvalue of matrix Q assuming L ¼ỸỸ T , the unit eigenvector components associated with the first eigenvalue, u i1 , and COVðY; X i Þ as follows:

Equation (25) is substituted into Eq. (24). As a result, the first-order sensitivity measure can also be computed as follows:

Equation (26) can be restated as:

In what follows, we managed to prove that the coefficient b i is approximately equivalent to the corresponding coefficient obtained from the regression-based SPCA model, i.e.,a i . Therefore, it may be more practical to start with the coefficients of the regression-based SPCA model in reference to Eq. (21):

The coefficient a i , can be simplified via the following mathematical manipulations. After substituting the standardized variables into Z 1 , COVỸ; Z 1 À Á can be written as:

Following Eq. (25), the COVỸ; Z 1 À Á becomes:

As proved in appendix B, the dominant eigenvalue, k 1 , is:

After combining Eq. (30) and (31), COVỸ; Z 1 À Á can be written as:

Now, to further simplify the coefficients of the regression-based SPCA model, the computation of V Z 1 ð Þ can be accomplished. For this purpose, considering Eq. (18) once again and taking variance of both sides results in:

SinceỸ is a standardized variable, VỸ À Á ¼ 1. Therefore, by substituting Eq. (32) into Eq. (33) we have:

Equation (34) is very applicable to linear models. However, as one departs from linearity, then V Z 1 ð Þ departs from the dominant eigenvalue, k 1 . As a result, V Z 1 ð Þ % k 1 . After replacing u i1 and COVðỸ; Z 1 Þ with their equivalences, i.e., Eq. (25) and Eq. (32), respectively into Eq. (28), the coefficients a i are simplified to the following relation:

Based on Eq. (34) and subsequent comment, the coefficients a i can be further simplified to:

Therefore, in reference to the above equation, after implementing the classical variance-based SA on the regression-based SPCA model, the first-order sensitivity measure can be written as:

Finally:

Moreover, in appendix C, first-order sensitivity measure, [i.e., Eq. (38)] was related to two sources of variability, i.e., the variability due to each variable and the variability due to co-variation among the variable under consideration and other predictor variables as stipulated below:

It is worth noting that the derived covariance decomposition is in line with the covariance decomposition, first proposed by Li et al. (2010) . In short, referring to the question raised in the theoretical background section, the above mathematical manipulations imply that while the classical variance-based SA (under the assumption of the independent factors) cannot be used directly on the original model with correlated input, it is possible to effectively benefit from the advantage of SPCA and combine it with this type of variance-based SA to devise a better tool for prioritizing the input variables when the input variables are correlated.

To dig further into the proposed method, the following algorithm intends to document the coupling of the classical variance-based SA with regression-based SPCA in more detail. It is worth noting that, if we have a system for which the distributions of model input and their correlation structure are given (i.e., a system with simulated data), then omit the first three steps of the algorithm and start from the next step.

After proposing the above approach, an important question can be raised by the reader: is it possible to conduct the classical variance-based SA (applicable under the assumption of independent inputs, e.g., Sobol and/or FAST) on PCR (a special form of the SPCA) to differentiate between the important and irrelevant variables in models with correlated inputs? It depends on the nature of the problem at hand. The correlation of new predictor variables with the output and/or lack of it could be the cause of success or failure. The process to address this issue is very similar to what was done while coupling the classical variance-based SA with the regression-based SPCA model. In order to keep the integrity of the material in place, interested readers are referred to appendix D for further details regarding the coefficients in Eq. (40). In summary, the PCR model can be expressed as a linear combination of original input variables as follows:

Thus, after coupling the variance-based SA (under the assumption of the independent inputs) with PCR, the firstorder sensitivity measure can be written as:

After some manipulations, the above equation is simplified as:

In subsequent sections, a few numerical test cases are devised to either confirm or refute the validity of the above coupling exercise.

The procedure of coupling the variance-based SA with PCR is shown in the following algorithm: (42) 11. Report the importance of correlated input based on step 10 Similar to algorithm 1, ignore the first three steps of this algorithm for a system with simulated data.

To better appreciate the proposed method, i.e., coupling the classical variance-based SA with regression-based SPCA model, and comparing it with the variance-based PCR method, the following flowchart can be used (See Fig. 1 ):

In this section, the proposed method based on coupling the classical variance-based SA (applicable under the assumption of independent inputs) with either regressionbased SPCA or PCR is applied to five test cases as well as three simple hydrologic and hydraulic models to evaluate the effectiveness of the corresponding modified model. From the theoretical background section, it is quite clear that implementation of either PCR and/or regression-based SPCA calls for the generation of realizations for various input variables. This task, i.e., generation of correlated realizations, is performed via SIMLAB2.2 (The SAMO team, 2003). As the governing regression model (e.g., regression-based SPCA) is linear, the number of realizations to build this regression can be kept quite small (Song et al. 2015) , e.g., 500 samples for models with low number of dimensions and 2000 samples in the case of high dimensionality. In the meanwhile, a few test cases were run with different sample sizes to examine the impact of various number of realizations on the results of the proposed scheme.

It should be remembered that after conducting eigenvalue analysis on Q extracted from SPCA method, the aforementioned matrix has only one dominant eigenvalue and associated unit eigenvector. However, when it comes to PCR, it has more than one dominant eigenvalue. For this reason, the impact of a different number of components on model performance was investigated. It is worth noting that if all components are included in the regression model, the PCR coefficients are equivalent to ordinary least squares (Jolliffe 1986 ).

Although we analytically touched on the validity of the proposed scheme, the correlation ratio method was also used as a benchmark solution implemented on the original model to further evaluate the hypothesis incorporated into various test cases. This benchmark, which is based on McKay's approach (McKay 1997; Saltelli et al. 2001) , is always recognized as a valid approach due to its nonparametric nature (Xu and Gertner 2008) . For this reason, this benchmark solution, which is extracted from SIMLAB 2.2, is suitable for nonlinear models even in the presence of strong nonlinear effects. However, as the method uses replicated LHS, a large sample size is required to acquire suitable accuracy. In this work, like Xu and Gertner (2008), we suggest 100 replications with each replication having a sample size of 500 (a total of 50,000 model runs) for most of the test functions and 150,000 samples (150 replication with each replication having a sample size of 1000) for the Sobol G function with 35 number of correlated input variables. Finally, the negative impact of ignoring the correlation between inputs is examined by performing Sobol numerical variance-based SA (which assumed the model inputs are independent) on all test functions. It is quite helpful to note that the proposed method was written in R programming environment. 9

This test case, which had also been used by Li et al. (2010) , is a simple and additive model with five inputs:

In this case, the marginal distribution of each input variable is normal with a mean of 0.5 and a unit standard deviation, i.e. X i $ Nð0:5; 1Þ. The Spearman Rank Correlation Coefficient (SRCC), or it is better to say the Spearman Rank Correlation matrix, of input variables is defined as follows: Clearly, in the case of independent inputs, it is easy to deduce each variable has the same impact on model output uncertainty due to their equal SRCs. Taking into account the correlation structure of input variables, each variable has a different impact. Table 1 summarizes the result of sensitivity analysis using four different approaches including the Sobol method implemented on the original model, the proposed approach, three flavors of variancebased PCR, and the benchmark solution implemented on the original model as well as the approximated equivalence of the proposed method (S tc ) obtained from covariance decomposition implemented on the regression-based SPCA model.

In reference to the numerical values of S i for various approaches and different input variables in Table 1 , after clustering the input variables into three clusters, the solution associated with the benchmark solution has X 1 ; X 2 in cluster (1), X 3 in cluster (2), and X 4 ; X 5 in cluster three. It is evident that the sorting in the proposed approach and its equivalence (S tc ) is consistent with the benchmark solution. However, neither the Sobol implemented on the original model, nor three flavors of the variance-based PCR managed to reproduce the benchmark solution.

The change in numerical values of S i from one approach to another is highly coined with the approach itself and one should not expect similar values emerging from each approach. Indeed, for sensitivity analysis purposes, detecting the importance of input variables and their rankings in each approach will become quite important. Figure 2 investigates the impact of the number of realizations on sensitivity measures based on the proposed method. As the figure shows, the number of realizations has minimal effect on S i (i.e., the aforementioned clustering) and one can safely choose a minimum number of realizations (e.g., 500) for sensitivity analysis.

Like the preceding test case, the second test case is also adopted from Li et al. (2010) as follows:

Except the equation, all other conditions of this case are very similar to the first test case. The results of the estimated sensitivity measures based on the aforementioned methods cited in Test case 1 are summarized in Table 2 .

According to the content of this table, the rankings of model variables based on the proposed method are very similar to the correlation ratio scheme. Furthermore, the importance of inputs in reference to the proposed method are the same as those obtained from S tc . Likewise, the performance of coupling PCR and classical variance-based in recognizing the sensitivity ordering of input variables agrees with the benchmark. In this test case, it seems the correlation structure has a minimal impact on prioritizing the input variables. However, the model structure seems to change the destiny of variables with regard to their importance.

To evaluate the performance of the proposed method when the transfer function is nonlinear with correlated inputs, the simplest nonlinear model with three-variable inputs, first proposed by Xu and Gertner (2008) , is considered in this study:

The probability density function of each input is uniform, i.e., X i $ U 1; 10 ð Þ; and the SRCC of the model inputs are: Table 3 shows the results of SA regarding the methods cited earlier.

As the table shows, the most important input variable according to the Sobol scheme is X 3 . However, both the proposed scheme as well as the benchmark solution found the most important variable to be X 1 . Moreover, the result is quite consistent with S tc . By contrast, conducting the classical variance-based approach on PCR using either one and/or two components, cannot capture the degree of importance of input variables.

This nonlinear test function, which is a typical version of the portfolio model, has four variables with the following equation:

This relatively strong test case with respect to the degree of nonlinearity, as relates to interactions among the input variables, can be examined in the following three scenarios.

This scenario was utilized by Kucherenko et al. (2012) . The marginal distributions of variables are normal, i.e., X i $ Nðl; rÞ withl ð0; 0; 250; 400Þ, r ð4; 2; 200; 300Þ and Spearman rank correlation among factors are:q s 12 ¼ 0:8, q s 34 ¼ 0:8 and the SRCC for the remaining variables is assumed to be zero. The importance of each input based on the mentioned approaches is quantified and documented in Table 4 .

As Table 4 clearly demonstrates, the suggested implemented technique along with the results summarized in the last column (S tc ) and yardstick solution are in good agreement. This could be attributed to the fact that all schemes take care of the impact of correlation structure among input variables effectively. Furthermore, this ranking can be achieved using PCR based on two components. In contrast, performing the classical variance-based SA on PCR regression using three components only ranks the inputs correctly and cannot differentiate between the importance of X 1 and X 2 . Similar to the test case 2, adding the correlation structure cannot change the rank of the inputs, although with respect to the correlation among inputs, X 2 becomes slightly less important than X 1 .

Except for the correlation structure, all conditions of this test case are the same as the preceding scenario. The correlation structure is: q s 14 ¼ 0:8, q s 34 ¼ 0:3 and the remaining pairs are assumed to be independent. Figure 2 demonstrates the scatter plot matrix for this test case. The results of SA are summarized in Table 5 .

In reference to the content of Table 5 , as far as the coupling of PCR with the classical variance-based SA is concerned, this coupling cannot successfully delineate the most important variables and their proper rankings. However, the proposed method managed to capture the benchmark solution and the results are quite compatible with S tc summarized in the last column of the table, albeit with the rather small Multiple R-squared: 0.52. The low value of the coefficient of determination stems from the nonlinear relationship between inputs and output as well as the interaction among variables. The scatterplot depicted in Fig. 3 clearly demonstrates this moderate trend in data which somehow leads to the low value of R 2 .

This scenario is proposed by Ge and Menendez (2017) , they assumed the PDF associated with each random variable has a different form. In other words, in this case, the distributions of X 1 ; X 2 ; X 3 and X 4 are respectively, normal Nð0; 1Þ, gamma Cð2; 1Þ, uniform Uð0; 1Þ; and lognormal logNð0; 1Þ. The rank correlation coefficients are the same as the preceding scenario, i.e., q s 14 ¼ 0:8, q s 34 ¼ 0:3 and q s ij ¼ 0fori 6 ¼ 1; 3andj 6 ¼ 4. The first-order sensitivity measures based on the adopted approaches can be seen in Table 6 .

Concerning the different governing PDF for each predictor variable, as Table 6 demonstrates, the proposed method managed to mimic the variation of sensitivity measures documented by the benchmark solution as well as S tc , while conducting the classical variance-based on PCR has failed to reproduce the proper ranking of model inputs.

As an additional more comprehensive example, this test case is designed to evaluate the ability of the proposed 

The variables incorporated into this test case have uniform standard marginal distribution and the parameter d i is always nonnegative. Equation (47) is able to accommodate the interaction. The Sobol G function, a strongly nonlinear and non-monotonic model, can be generated by replacing X i with 2T i À 1 j j . The test case considered in this study has a lower degree of nonlinearity and is not as complex as the Sobol G function. In the Sobol G function, by increasing d i ; the importance of the corresponding variable will be reduced (Saltelli et al. 2008) . This mathematical feature is very applicable to the proposed test function as well. In this test case, we assume the F-function is comprised of 35 variables and their associated coefficients are listed in Table 7 .

As the content of this table shows, the importance of the orthogonal inputs based on their associated coefficients (d i ) are divided into four groups. Indeed, here we want to examine the ability of the proposed method in the classification of inputs in which most of them are correlated. For this reason, the SRCCs between strongly influential variables and non-influential inputs, i.e., (X 1 ; X 35 ), (X 2 ; X 34 ), (X 3 ; X 33 ), (X 4 ; X 32 ) and (X 5 ; X 31 ) are 0.8. In addition, it is assumed that the SRCCs between moderately influential variables and non-influential inputs, i.e., (X 6 ; X 30 ), (X 7 ; X 29 ) and (X 8 ; X 28 ) are 0.7. Furthermore, this kind of correlation between (X 11 ; X 12 ), (X 12 ; X 13 ), (X 13 ; X 14 ),…and (X 24 ; X 25 ) is 0.2. By considering this correlation structure and other pertinent information, the estimated first-order sensitivity measures can be seen in Figs. 4-8 for all schemes considered in this study.

It is worth mentioning the variance and coefficient of determination for the number of components in PCR method, i.e., 17 to 28 are documented in Tables 8 and 9. Figure 4 shows that both the proposed scheme as well as the S tc approach can reproduce the grouped ranking stipulated by the benchmark solution. In other words, the proposed method can successfully delineate the most important variables ([X 1 ,X 2 , …,X 5 ] and [X 31 ,X 32 ,…,X 35 ]), moderately important variables ([X 6 ,X 7 ,…,X 11 ] and [X 28 ,X 29 ,X 30 ]), weakly important ( [X 12 ,X 13 , …,X 25 ]) and non-influential variables ([X 26 ,X 27 )]. It is worth mentioning that the coefficient of determination between the original model and regression-based SPCA model is 0.33. This implies that the proposed method can be used for quite strong nonlinear models. In addition, as the number of samples of the proposed method, i.e., 2000 samples in comparison with the benchmark solution i.e., 150,000 samples is low, the suggested technique can be considered as a computationally efficient approach.

If one ignores the correlation structure inherent in input variables, the conventional scheme, Sobol, fails to highlight the importance of the last eight variables. However, the correlation structure inherent in input variables triggers both the proposed scheme as well as the benchmark solution to give appropriate grouped ranking to input variables. Upon coupling the classical variance-based SA with PCR (see Fig. 5, 6, 7, 8) , the first-order sensitivity measures of the last eight variables become more distinct but the coupling exercise did not manage to regenerate the ranking associated with those variables compared to the benchmark solution.

In this part, three practical test cases are considered. Two of them are based on two nonlinear hydrologic models which used real observed data but one of them is based on a low dimensional model and another regards a high dimensional model. In addition, the last test case considers a hydraulic model to simulate flood inundation using the simulated data. The details of these models are discussed below: 

When runoff starts, the basin characteristics control the pattern of the flood and affect the flood peaks (Benson 1964) . For this reason, using this model is reasonable. This model is a real application but with low dimension in operational hydrology. In this model, real data is used to calibrate a nonlinear model with 7 decision variables. Based on the observed and measured data of watershed characteristics in 58 watersheds in Indiana and independent estimation of 10-year instantaneous peak discharge, i.e.,Q 10 ; using historic data for each watershed, the parameters of the power model are estimated by McCuen and Snyder (1986) . Then, the probability density function of the model inputs is chosen based on these observed data (i.e., 58 instances) via EasyFit software. Finally, the basin characteristics and their probability density functions are described in Table 10 .

The correlation structure of the model inputs based on SRCC measure, using the observed data, is as follows: After implementing the adopted approaches, including Sobol on the original model, the proposed approach, five {8, 8.5, 9, 10, 10.5, 11, 12, 12.5, 13, 13.5, 14, 14.5 Tables 11 and 12. As can be seen in Table 11 , the most influential parameters are H r , P i , and C based on the proposed method. Nevertheless, it is difficult to differentiate among variables such as P i and C when it comes to the sensitivity measures. In addition, parameters such as (D), ðS; L c Þ, and (A) are divided into the third, fourth, and fifth groups in terms of their influences, respectively. This classification, borne out by the proposed approach is very consistent with the correlation ratio method and S tc . It seems in coupling the classical variance-based SA with PCR ( Stochastic Environmental Research and Risk Assessment influential in delineating the important parameters. In our case, beyond the two principal components, the PCR approach is unable to reproduce the sensitivity ordering inherent in model input variables. Interestingly, upon ignoring the correlation structure of input variables, the Sobol approach, implemented on the original model, found the drainage area to be the most influential model input.

Actually, according to the Sobol approach, important variables are found to be unimportant (i.e., type II error) while less influential parameters are considered to be quite important (i.e., type I error). Figure 9 considers the impact of the number of realizations on sensitivity measures based on the proposed scheme. As the figure clearly demonstrates, the number of realizations has minimal effect on S i and one can safely choose a minimum number of realizations (e.g., 500) for sensitivity analysis.

This model is similar to the previous real test function which relates the flood peak to hydrologic factors, but in this model as a high dimensional model, we want to examine the importance of the 14 correlated factors on the uncertainty of snowmelt-flood as a response surface of this model. As the timing of flood hydrograph will be controlled by the basin characteristics, the observed data of 35 watersheds in Northern New Mexico and Colorado are extracted from the Benson report (1964) . The main runoff mechanism in those watersheds is from snowmelt. The nonlinear model can be stated as follows: 

The basin characteristics along with the corresponding PDF for each feature, derived from observed data, are listed in Table 13 .

In addition, the correlation matrix of the model inputs based on SRCC measure, using the observed data, can be seen as follows: Coupling variance-based SA with PCR The first-order sensitivity measures based on the adopted approaches can be seen in Figs. 10, 11.

The variance and coefficient of determination for the number of components in PCR method, i.e., 3 to 8 are documented in Table 14 .

As Fig. 10 shows, the proposed method is quite consistent with the benchmark solution but is very similar to the approximated equivalence of the proposed method (S tc ). It is important to mention that the coefficient of determination between the original model and the regression-based SPCA model is 0.46. As can be seen, if the correlation structure of model inputs is ignored and the Sobol approach is implemented on the original model, the results clearly depict the Type 1 error. In other words, the N and R a as unimportant variables are considered to have a significant impact. In addition, six flavors of variancebased PCR cannot capture the importance of correlated inputs as compared to the mentioned methods (see Fig. 11 ). In addition, as the number of samples in the proposed method (i.e., 2000 samples) in comparison with the benchmark solution (i.e., 50,000 samples) is low, the suggested technique can be considered as a computationally efficient approach (Table 14) .

This practical example is a simplified version of a diversion channel subjected to uniform flow under a flood inundation scenario. In order to protect the service road from flood inundation, a dyke is built in between the diversion channel and the service road. This simple application, which is used as an instructive test in Iooss and Lemaître (2015) , and Chastaing et al. (2012) , tries to simulate the height of water in the diversion channel with respect to the height of dyke intended to protect the service road, agricultural and industrial sites adjacent to the diversion channel bank from inundation. This model that comprises the characteristics of a typical channel reach can be written as (see Fig. 12 ):

Upon assuming uniform flow in a wide rectangular channel, h can be derived as (De Rocquigny, 2006) : 

where S is the maximum overflow (in meter) being a function of eight inputs. Symbols used in Eq. (50) and (51) were defined in Table 15 . Computation of maximum overflow calls for monitoring these eight variables. As these field variables are highly corrupted by noise and some of them also exhibit both spatial and temporal variability (e.g., Z m and Z v ), the dependent variable S has to be considered as a random variable of its own. In Table 15 , after defining each variable, the probability density function associated with each variable is also documented.

Generally, the flow rate is highly correlated with channel roughness. As the channel roughness increases, the flow discharge decreases. The Strickler coefficient is inversely proportional to channel roughness. As a result, a correlation coefficient of 0.45 is assumed for subsequent computations. Furthermore, Z v and Z m can be correlated with correlation coefficient of 0.3. From an educational point of view, other parameters are considered uncorrelated. After implementing the adopted approaches on the governing equations, the first-order sensitivity measures are summarized in Table 16 .

As Table 16 illustrates, the results of the proposed method are consistent with the assessment reported in benchmark and S tc solutions. According to the table content, H d has an influential impact on overflow depth followed by Q and Z v . In fact, in practice setting, that would be the case as far as protection of adjacent agricultural as well as industrial sites is concerned. In light of this, the height of dyke along with the maximum annual flow rate should be chosen with great care in practice. If the classical variance-based SA is implemented on PCR, the methodology can trigger the height of dyke and leave the maximum annual flow rate intact. In addition, this coupling compared to the proposed method, cannot effectively demonstrate the other variables' importance. Once again, the above procedural scheme was also implemented for the various number of realizations. As Fig. 13 shows, it seems the number of realizations has minimal impact on the ranking process.

Nowadays, the GSA is considered a powerful approach to organize the input variables in terms of their degree of importance, as they affect the dependent variable. Over the last two decades, one of the most important challenging tasks in GSA is to consider the impact of correlated input variables on the uncertainty evolved in model output. Recent scientific endeavors and the associated literature have tried to develop approaches to address this issue. In light of this, when it comes to quantitative approaches (e.g., correlation ratio method), the methodology is usually accompanied by a great deal of computational cost particularly in the case of having high dimensionality and can be quite labor intensive. For this reason and due to the complexity of most of the developed approaches, the majority of research activities assume the decision variables to be independent for prioritization purposes. In this paper, an innovative methodology is developed whereby the conventional variance-based approach (under the assumption of orthogonal input variables) is coupled with a regression-based SPCA model to evaluate and assess the impact of correlated input variables.

To evaluate the effectiveness of the proposed scheme, eight test cases are considered. In all test cases, the input variables are equipped with common information content leading to a correlation among input variables. However, the model structure (i.e., linear versus nonlinear and/or low versus high dimensional) could differ from one test case to another. With respect to the governing probability density function, a variety of scenarios are assumed. After implementing the proposed scheme on various test cases, the following results are obtained:

1. The scheme managed to capture the ranking structure in input variables with even less computational cost and being quite consistent with the results obtained from the benchmark solution. 2. The proposed scheme can mimic the variation among the first-order sensitivity measures, similar to the benchmark solution. This success of the proposed method can be attributed to the fact that the result of coupling the classical variance-based with regressionbased SPCA can easily be proven approximately equivalent to the total contribution index in the covariance decomposition equation, i.e., S tc which was done in this study. S tc can be related to variancecovariance structure for comparison purposes. In addition, this equivalency can be justified by the fact that the regression-based SPCA considers the impact of the output variable on the new features in the eigenspace. 3. In reference to test case 5, the proposed method can be recommended when the coefficient of determination is larger than 0.33. This implies the proposed method is appropriate for models whose degree of nonlinearity is more distinct than moderately nonlinear models, i.e., R 2 ! 0:7. 4. To conclude, the advantage of the proposed method is that the developed methodology for models with correlated inputs is remarkably simple to implement, computationally efficient, particularly for high dimensional problems as compared to the benchmark solution. Indeed, the case of correlated inputs presents the significant complexities for the developed variancebased SA such as the benchmark solution, including the difficulty of re-coding the benchmark solution as well as a significant increase of computational cost (number of samples). However, it seems, the efficiency of the proposed method is more tangible in the case of a model with high dimensional problems that are correlated. By considering the above advantages, possible implications of the current research can be stated as follows:

• The proposed method can be implemented to reduce the uncertainty of model output based on recognition of the most important correlated variables and their contributions to the uncertainty of the response surface. By eliminating or reducing the uncertainty of the most important correlated factors and/or improving factor characterizations that have significant contributions to the uncertainty of output, the uncertainty of output will be reduced. • Considering the first item and to be more specific, nowadays, data assimilation is considered the rule rather than the exception in rainfall-runoff modeling. A variety of methods are available for data assimilation, namely input updating, parameter updating, state updating, output updating, model structure updating, etc. All these assimilation schemes are developed to achieve the best match between observations and simulations. In light of this, one is quite curious to see how and in what aspect he or she has to invest to get the best possible result. Should he or she go for say, parameter or state, or output updating to obtain the best match with minimum investment in model implementation? One basic implication of the current study is to implement the proposed methodology and prioritize the nature of data assimilation for a better result, particularly, when inputs, parameters, and states are correlated. • Research is underway to implement the proposed methodology to design a typical rain gauge network to maximize its information content and/or reduce the variance of the estimated long-term mean over a block as large as the study area.

Overall, the proposed scheme can be considered an enlightening approach for sensitivity analysis modelers.

In contrast to the proposed method, after coupling the variance-based approach (under the assumption of orthogonal input variables) with PCR, it was found that implementation of this scheme on most of the test functions gave rise to solutions which are incompatible with the benchmark solution, the correlation ratio method. On the other hand, based on the second test function and the first scenario of the fourth test function, it seems that if results of Sobol, which are conducted directly on the original model, and benchmark solutions are approximately the same, then this coupling may appear somewhat successful. In general, contrary to the coupling of the classical variance-based SA with the regression-based SPCA, the coupling of the classical variance-based SA with PCR [Eq. (42)] cannot accommodate the impact of the covariance structure of model inputs on the uncertainty of the response surface.

Apart from the mentioned advantages, it is necessary to touch on some disadvantages of the proposed method. As a drawback, the proposed approach cannot replicate the acceptable results in light of a more complex model. This implies that this approach cannot reliably differentiate and rank the input variables when the coefficient of determination between the original model and regression-based SPCA model is very small. In such scenarios, it is recommended to use the benchmark solution or developed approaches cited in the literature (e.g., Ge and Menendez 2017) . In addition, the proposed method is not appropriate for the case when a higher-order correlation of model inputs can be traced to be among the input variables. In such instances, it is necessary to modify the proposed approach in the presence of complex dependency or higher-order correlation among model inputs.

In future, it is recommended to modify and implement the proposed methodology on more complex transfer functions and monitor the model efficiency in comparison with the developed approaches in the literature such as the correlation ratio method. In other words, as in our proposed method, we have coupled the classical variance-based SA with regression-based SPCA model and this model cannot regard the interaction terms in its equation, consequently, the higher-order interaction impacts cannot be quantified. For this reason, future research could focus on presenting a modified model based on using SPCA which considers the interaction and non-linear terms, and couple this SPCA nonlinear model with the classical variance-based SA. Furthermore, independent component analysis (ICA) is considered as an important tool for the consideration of higher-order correlation. Therefore, we recommend obtaining an algorithm for supervised independent component analysis (SICA) similar to SPCA and coupling it with classical variance-based SA to derive a new method for GSA in the case of dependent inputs for examination of the higher-order correlation in the uncertainty of model output.

The linear regression model, which can be used in place of the original model, i.e., Y ¼ f X 1 ; X 2 ; . . .X p À Á can be written as follows: In the SPCA approach, it is recommended to standardize the original variables (Y; X i ). Subsequently, the functional relationship between the standardized variables (Ỹ;X i ) is: 

In brief, we consider: 

Thus,

X is a standardized random vector that comprises of p standardized random variables, i.e.,X i and h is a vector with p scalar values, i.e., h i . Indeed, the dimension ofX and h are 1 Â p and p Â 1, respectively. This means:

Since the identity Matrix can be multiplied by a matrix without any changes in results, we have:

Since U T pÂp U pÂp ¼ U pÂp U T pÂp ¼ I (Jolliffe 1986) ,

In the equation above,X 1Âp U pÂp are the new features. In other words, Z 1Âp ¼X 1Âp U pÂp or (Barshan et al. 2011) ½Z 1 ; Z 2 ; . . .; Z p ¼ ½X 1 ;X 2 ; . . .;X p u 11 u 12 . . . u 1p u 21 u 22 . . . u 2p : : : : :

: : : u p1 u p2 . . . u pp 2 6 6 6 6 4 3 7 7 7 7 5 ð59Þ Z is a random vector consisting of p random variables of new feature, i.e., Z i , and U T pÃp h pÃ1 is a new vector, i.e., b that includes p values, i.e.,:

As a result:

Thus:Ỹ

where b 0 ¼ h 0 Since the new features are uncorrelated, we can use variable reduction and define the model above based on the reduced model: 

And a Matrix:

For this matrix, kWk 2 can be claimed to be an eigenvalue of matrixÃ corresponding to eigenvector W. This can be proved as follows:

After replacing U with W andÃ with WW T , we have:

In summary, based on the above equation:

As for the remaining eigenvalues and corresponding eigenvectors, such as:

As these eigenvectors are perpendicular to W, one can show that:

Thus,

This implies that the remaining eigenvalues are zero. Based on the above theory, since the matrix Q is the multiplication of a vector i.e.,X TỸ (with p entries) by its transpose, its dominant eigenvalue can be written as:

Thus:

And other eigenvalues are zero. In addition, the eigenvector of this matrix corresponds to the dominant eigenvalue, k 1 isX TỸ with its components as follows:

As in subsequent applications, we will be using the standard eigenvector (unit eigenvector), i.e. U 1 T U 1 ¼ 1; each component of the dominant eigenvector will be divided by the magnitude of the eigenvector to make it standard as:

Thus:

After using Eq.ð78Þ, the eigenvector u i1 can be simplified to:

TheỸ andX i are standardized values of the original variables, thus the u i1 can be expressed in terms of the original variables (Y; X i ) as:

In addition, in the following, we prove that if L is replaced byỸỸ T in matrix Q ¼X T HLHX, the correlation between standardized output and all new features with the exception of the dominant new feature (i.e., Z j j 6 ¼ 1) is zero. For this purpose, we substitute the standardized variables into Z j , thus, COVỸ; Z j À Á (j ¼ 2; 3; . . .; m) can be expanded as: COVỸ; Z j À Á ¼ COVỸ; u 1jX1 þ u 2jX2 þ Á Á Á þ u pjXp À Á ¼ COVỸ; u 1jX1 À Á þ COVỸ; u 2jX2 À Á þ Á Á Á þ COVỸ; u pjXp À Á ¼ u 1j COVỸ;X 1 À Á þ u 2j COVỸ;X 2 À Á þ Á Á Á þ u pj COVỸ;X p À Á ¼ U j :COVỸ;X i À Á ¼ hU j ; COVỸ;X i À Á i; i ¼ 1; 2; . . .; p and j ¼ 2; 3. . .; m:

Where h:; :i denotes the inner product of two vectors. Based on Eq. (82), Eq. (84) can be written as:

COVỸ; Z j À Á ¼hU j ; COVỸ;X i À Á i ¼hU j ; k 1 0:5 COVỸ;X i À Á k 1 0:5 i; j 6 ¼ 1 ð85Þ

Since the eigenvector U 1 is perpendicular to other eigenvectors (U j ; j 6 ¼ 1),hU j ; U 1 i ¼ 0. Thus, COVỸ; Z j À Á ¼ 0 forj 6 ¼ 1. This implies only the first new feature has maximum linear dependency on output.

Since in PCA, V Z 0 j À Á ¼ k j , the PCR is:

where: 

Supervised principal component analysis: visualization, classification and regression on subspaces and submanifolds

Sensitivity indices for (tree)-dependent variables

Factor affecting the occurrences of floods in the southwest

A new uncertainty importance measure

Generalized Hoeffding-Sobol decomposition for dependent variables-application to sensitivity analysis

Polynomial chaos expansion for global sensitivity analysis applied to a model of radionuclide migration in a randomly heterogeneous aquifer

Uncertainty and sensitivity analysis: tools for GIS-based model implementation

Study of the sensitivity of coupled reaction systems to uncertainties in rate coefficients. I Theory

Local polynomial estimation for sensitivity analysis on models with correlated inputs

La mâıtrise des incertitues dans un context industriel-1ere partie: une approche ḿ ethodologique globale baśee sue des exemples

Correlation Effects? A Major but Often Neglected Component in Sensitivity and Uncertainty Analysis

Extending Morris method for qualitative global sensitivity analysis of models with dependent inputs

Measuring statistical dependence with hilbert-schmidt norms

Kernels, Associated Structures and Generalizations (127)

A comparison of uncertainty and sensitivity analysis results obtained with random and Latin hypercube sampling

A distribution-free approach to inducing rank correlation among input variables

Assessing hurricane effects. Part 1. Sensitivity analysis

Uncertainty management in simulation optimization of complex systems: algorithms and applications

Sobol'indices for problems defined in non-rectangular domains

Estimation of global sensitivity indices for models with dependent variables

Multivariate sensitivity analysis and derivative-based global sensitivity measures with dependent variables

Global sensitivity analysis for systems with independent and/or correlated inputs

Uncertainty analysis using evidence theory-confronting level-1 and level-2 approaches with data availability and computational constraints

Variance-based sensitivity indices for models with dependent inputs

Non-parametric methods for global sensitivity analysis of model output with dependent inputs

Disjoint and functional principal component analysis for infected cases and deaths due to COVID-19 in South American countries with sensor-related data

Hydrologic modeling: statistical methods and applications

Nonparametric variance-based methods of assessing uncertainty importance

Détermination des distributions dont les marges sont données

A new principal component analysis by particle swarm optimization with an environmental application for data science

A new framework for comprehensive, robust, and efficient global sensitivity analysis: 1

A new framework for comprehensive, robust, and efficient global sensitivity analysis: 2. Application

Model-free importance indicators for dependent input

Global sensitivity analysis: the primer

Sensitivity analysis in practice: a guide to assessing scientific models

Global sensitivity analysis for high-dimensional problems: How to objectively group factors and measure robustness and convergence while reducing computational cost

Sensitivity analysis for non-linear mathematical models

Sensitivity estimates for nonlinear mathematical models

Global sensitivity analysis in hydrological modeling: Review of concepts, methods, theoretical framework, and applications

Copula-based decomposition approach for the derivative-based sensitivity of variance contributions with dependent variables

Decoupling correlated and uncorrelated parametric uncertainty contributions for nonlinear models

Uncertainty and sensitivity analysis for models with correlated parameters

A new framework of variance based global sensitivity analysis for models with correlated inputs

Spatiotemporal pattern of the global sensitivity of the reference evapotranspiration to climatic variables in recent five decades over China

Distance correlation-based method for global sensitivity analysis of models with dependent inputs

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations

Acknowledgements We would like to thank Prof. G. Strang from MIT for his enlightening remark on symmetric matrix and uniqueness of dominant eigenvalue of matrix Q documented in appendix B. Furthermore, extensive implementation of SIMLAB 2.2 in this research is greatly acknowledged.

1 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5

After replacing L withỸỸ T in the matrix Q ¼X T HLHX,we have:SinceX andỸ are standardized, and matrix H is symmetric, i.e., H ¼ H T , we have (Barshan et al. 2011) :AndAs a result, the matrix Q is simplified as follows:The termsX TỸ andỸ TX are two-column and row vectors, respectively for which their arrays are COVðỸ;X i Þ:In reference to Eq. (68), Eq. (67) implies that the matrix Q can be considered as multiplication of a vector i.e., X T pÃnỸ nÃ1 (with p entries) by its transpose. Generally, multiplication of a vector (with dimension of p Â 1) by its transpose would always result in a symmetric matrix with rank 10 of one. In the following, the eigenvalues and eigenvectors of this matrix are further discussed (Personal Communication, Strang 2021) . For this reason, consider the vector W as follows:

For a linear model such as:The variance of model output with correlated inputs can be computed as:Based on the following equation:where X and X i are random variables. Thus, the V Y ð Þ can be rewritten as follow:Finally, the V Y ð Þ is decomposed by the following equation:One possible interpretation of the above equation is that the impact of each variable on the uncertainty of model output is related to variance of each factor and its correlation structure with other variables. This equation can be rewritten as follows:This equation indicates that the variance of linear model output is partitioned by all COV Y; a i X i ð Þ. Finally, we can express the variance of model output via the following equation in reference to (90) and (91):In reference to the above equation, we have:The main conclusion which can be drawn from the above equation is:COV a i X i ; P p j¼1;j6 ¼i a j X jIt is worth mentioning, COV Y; a i X i ð Þ is the total contribution of the random variable X i composed of variance structure i.e., V a i X i ð Þ and covariance structure i.e.,COV a i X i ; P p j¼1;j6 ¼i a j X j .

In this appendix, a rationale is offered to prove Eq. (40) in the text. After regressing theỸ on the new uncorrelated features based on PCR, we have (Jolliffe 1986) :where Z 0 j is a new feature on the basis of PCR. As mentioned in the main text, the coefficient of linear regression when the inputs are uncorrelated with zero means can be estimated by Eq. (17). Thus:The PCR can be expressed as regards to the original variables (Y; X i ), after some manipulations: