This may be the author’s version of a work that was submitted/accepted for publication in the following source: Pitchforth, Jay & Mengersen, Kerrie (2013) A proposed validation framework for expert elicited Bayesian Networks. Expert Systems with Applications, 40(1), pp. 162-167. This file was downloaded from: https://eprints.qut.edu.au/52041/ c© Consult author(s) regarding copyright matters This work is covered by copyright. Unless the document is being made available under a Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other uses. If the docu- ment is available under a Creative Commons License (or other specified license) then refer to the Licence for details of permitted re-use. It is a condition of access that users recog- nise and abide by the legal requirements associated with these rights. If you believe that this work infringes copyright please provide details by email to qut.copyright@qut.edu.au License: Creative Commons: Attribution-Noncommercial-No Derivative Works 2.5 Notice: Please note that this document may not be the Version of Record (i.e. published version) of the work. Author manuscript versions (as Sub- mitted for peer review or as Accepted for publication after peer review) can be identified by an absence of publisher branding and/or typeset appear- ance. If there is any doubt, please refer to the published source. https://doi.org/10.1016/j.eswa.2012.07.026 https://eprints.qut.edu.au/view/person/Pitchforth,_Jay.html https://eprints.qut.edu.au/view/person/Mengersen,_Kerrie.html https://eprints.qut.edu.au/52041/ https://doi.org/10.1016/j.eswa.2012.07.026 A proposed validation framework for expert elicited Bayesian Networks Jegar PitchforthI, Kerrie Mengersen Queensland University of Technology Abstract The popularity of Bayesian Network modelling of complex domains using ex- pert elicitation has raised questions of how one might validate such a model given that no objective dataset exists for the model. Past attempts at delin- eating a set of tests for establishing confidence in an entirely expert-elicited model have focused on single types of validity stemming from individual sources of uncertainty within the model. This paper seeks to extend the frameworks proposed by earlier researchers by drawing upon other disciplines where measuring latent variables is also an issue. We demonstrate that even in cases where no data exist at all there is a broad range of validity tests that can be used to establish confidence in the validity of a Bayesian Belief Network. Keywords: expert, validation, bayesian network, sensitivity IPhone: +61 403 961 878 Email address: jegar.pitchforth@qut.edu.au () 1Lvl 11, 126 Margaret St. , Brisbane 4000 Preprint submitted to Expert Systems with Applications May 26, 2012 1. Introduction1 Bayesian Networks (BNs) are an increasingly popular tool for modelling2 complex systems, particularly in the absence of easily accessed data. A BN3 describes the joint probability distribution of a network of factors using a4 Directed Acyclic Graph (Pearl, 1988). Factors that influence the likelihood5 of the outcome node being in any given state are represented as nodes on6 the graph. If the state of one model factor influences the state of another a7 directional arc is drawn between the two nodes representing these factors in8 the model. The combination of the nodes and their relationships is the BN9 structure. Each node in the graph can adopt any one of a finite set of states.10 For example, a factor representing magnitude could be classified as ’high’ or11 ’low’. While nodes do not strictly have to be discretised the practice is by far12 more commonly undertaken than not due to its computational convenience,13 and as such we do not discuss models that include non-discretised nodes in14 this paper. Finally, each node and relationship between nodes is quantified15 according to the likelihood of the node adopting a given state. In the case16 of input nodes these probabilities are seen as unconditional, whereas nodes17 internal to the model are dependent upon the states of the preceding nodes.18 The strength and direction of the relationship between model factors is de-19 fined in the conditional probability table associated with the child node.20 BNs are often created through a process of expert elicitation, in which ex-21 perts are asked to create a complex systems model by giving their opinions22 on the model structure, discretisation, and parameterisation. The validity23 of these models is generally tested through one of two procedures: by com-24 paring the model predictions to data available for the subject matter, or by25 2 asking the experts who contributed to the model creation to comment on its26 accuracy. This paper argues that these tests are limited in their ability to27 accurately test the validity of BNs, and presents a framework for more thor-28 ough validity testing. The work presented here stems from questions raised29 during the creation of a BN from expert elicitation to model the inbound30 passenger processing time at Australian airports. The network was elicited31 in collaboration with managerial and operational experts from Australian32 Customs and Border Protection Service (ACBPS) for the purpose of gaining33 more informative reporting of key performance indicators. In particular, the34 modelling of critical infrastructure underlined the importance of establishing35 that both experts and modellers have confidence in the final model produced.36 The paper is structured as follows. First, the concept of validation as it ap-37 plies to BNs is introduced in section 1.1. Second, the sources of confidence38 in BN validity are discussed, including network structure, discretisation, and39 parameterisation in section 1.2. Third, prior approaches to validating latent40 and expert elicited scales and models are introduced, drawing from psycho-41 metrics, system dynamics and other BN research in section sec:prevapproach.42 These principles are then applied to BNs with examples from the airport in-43 bound passenger processing model in section 3.44 1.1. Confidence in Bayesian Belief Network validity45 Model validity is often conceptualised as a simple test of a model’s fit46 with a set of data. However validity is a much broader construct: in essence,47 validity is the ability of a model to describe the system that it is intended48 to describe both in the output and in the mechanism by which that output49 is generated. In this paper we consider this broader definition of validity.50 3 The need for an explicit set of validity tests for BNs over and above com-51 parisons with data is clear. In current practice, where data are available on52 the phenomenon of interest, these data may be used to validate model pre-53 dictions. Several tests of this nature exist, such as a variety of Normal Max-54 imum Likelihood model selection criteria (Silander et al., 2009). However, a55 common reason for using BN models is a lack of available data. Examples56 of phenomena for which data are scarce include population characteristics57 in many developing countries (Shakoor et al., 1997), global epidemiological58 phenomena (Masoli et al., 2004), organised crime (Sobel and Osoba, 2009),59 conservation (Johnson, 2009) and biosecurity risk analysis (Barrett et al.,60 2010). In such cases, expert opinion can be elicited to create a Bayesian61 Belief Network (BBN). A common technique for validating BBNs based on62 expert opinion in the absence of data, is simply to ask the experts whether63 they agree with the model structure, discretisation, and parameterisation64 (see Korb and Nicholson (2010) for an excellent overview of BN applications65 and methods). This simple test is necessary, but not sufficient, to indepen-66 dently verify the validity of a complex model. Even where data are available,67 model fit is only a part of the model’s overall validity. These considerations68 lead to this paper’s proposition of a general validity framework for BNs.69 1.2. Sources of confidence in Bayesian Network validity70 In order to approach a validation framework for BNs, a short discussion of71 the background assumptions of this framework is required. First, we assume72 there exists a latent, unobservable ’true’ model (or set of acceptable ’true’73 models) for the phenomenon of interest against which the expert elicited74 model can be compared. Second, for the purposes of the validity framework75 4 presented in this paper, we consider a BN model to consist of four elements:76 model structure (section 1.2.1), node discretisation(section 1.2.2), and dis-77 crete state parameterisation(section 1.2.3). Each of these elements has been78 raised as a source of uncertainty in BN modelling. We provide a discussion of79 each element and consider the importance of validity within each model ele-80 ment, and within the model as a whole. The model elements are summarised81 in figure 1. Figure 1: Sources of confidence in Bayesian Network validity 82 5 1.2.1. Structure83 There are a number of questions when creating the structure of a BN. The84 first is the appropriate number of nodes to include which is a question of the85 modelling domain, level and scope. It is widely acknowledged that networks86 with a large number of nodes can easily become computationally intractable,87 as can networks with a large number of arcs between nodes (Koller and88 Pfeffer, 1997). The BN creator should ensure that the model is neither too89 simple nor too complex in its explanation of the system.90 1.2.2. Discretisation91 The discretisation process allows us to model systems probabilistically92 by taking continuous factors and assigning them intervals, ordinal states or93 categories, then modelling over the discrete domain. In more recent research,94 Uusitalo (2007) pointed out that such discretisation is a major disadvantage95 of BN modelling if it is necessary for the model, and Myllymaki et al. (2002)96 outlines how the process has the potential to destroy useful information.97 Given the information loss inherent in the discretisation process, ensuring98 that the states are a valid interpretation of the state space of the node is99 critical for a defensible network.100 1.2.3. Parameterisation101 Parameterisation refers to adding the values elicited from experts to the102 belief network (Woodberry et al., 2005). Much work has been conducted103 on controlling this stage of the process (Renooij, 2001), but little has been104 written about how to validate expert responses post-elicitation.105 6 1.2.4. Model Behaviour106 Finally, the behaviour of the model can be seen as the joint likelihood of107 the entire network as well as its sub-networks and relationships, hence con-108 fidence in model behaviour is founded upon the validity of the other three109 dimensions of the model. It is important to note that in the case of BNs,110 we are not only interested in whether the model can tell us what a system111 is doing under certain conditions, but also the factors and relationships that112 bring about this behaviour. This makes the problem of validating the model113 incredibly complex when attempted wholesale and justifies the need for par-114 titioning the dimensions of uncertainty for BNs. As such it is recommended115 that the structure, discretisation and parameterisation are tested for validity116 before any model behaviour tests can be run.117 2. Previous approaches to validity118 2.1. Psychometrics119 The discipline of psychometrics arose as a counterpart to the field of psy-120 chology, which at its foundation attempts to measure latent, unobserved,121 ’true’ variables such as intelligence. Due to this rich tradition, the founda-122 tions of measurement validation in psychometry are particularly solid, and123 serve as a useful base to begin discussion of a similar framework for BNs.124 Psychometrics first identified four types of validity (Cronbach and Meehl,125 1955); more recent research has reclassified and added dimensions of valid-126 ity to establish a full validation framework (Trochim, 2001). Based on the127 framework depicted in figure 2, a psychometric test can pass all these tests of128 validity to varying degrees, providing a multidimensional measure of how well129 7 a particular test measures a latent variable. In psychometric testing there130 are seven commonly tested dimensions of validity: nomological validity, face131 validity, content validity, concurrent validity, predictive validity, convergent132 validity, and discriminant validity. In psychometrics, before any other tests Figure 2: The psychometric validity testing framework adapted from Trochim (2001). 133 of validity can be undertaken, the nomological validity of the validity domain134 should be established. High nomological validity indicates that the measure-135 ment sits well within current academic thought on the subject. Face validity136 refers to the heuristic interpretation of a measure as a valid representation of137 the underlying psychometric construct. Content validity describes both the138 inclusion of all variables believed to be within a domain and the relevance of139 the factors included in the scale. Concurrent validity refers to the behaviour140 8 of a measurement scale; specifically, that the measure varies at the same point141 in time as another theoretically related measure taken on the same sample.142 Convergent validity refers to the criterion that scores on the measure to be143 validated (e.g. intelligence) should match scores on another, theoretically re-144 lated measure (e.g. school grades) in the same sample. Finally, discriminant145 validity refers to the criterion that scores on the measure to be validated146 should be different from scores on tests that measure constructs that are147 theoretically unrelated. While this is a useful paradigm upon which to base148 our exploration, the differences between judging the validity of a complex149 model and the validity of a score of a single construct are significant enough150 to necessitate further exploration into other approaches.151 The parameterisation process is the most similar to the psychometric dis-152 cipline, as the parameters can be treated as scores denoting a given belief153 about the behaviour of that node. Using this approach, we can use the ex-154 tensive literature on psychometrics and group behaviour to help validate the155 parameters we elicit from our experts.156 2.2. System Dynamics157 In his review of system dynamics validation tests Barlas (1996) describes a158 series of eight tests to validate system dynamics models; parameter confirma-159 tion, dimensional consistency, modified behaviour prediction, Turing tests,160 Qualitative Features analysis, extreme conditions testing, behaviour sensi-161 tivity tests and structure confirmation. Each of the tests can be classified in162 terms of the psychometric validity framework but can also be directly applied163 to specific sources of BN model uncertainty. For example, parameter confir-164 mation can be seen as a special test of concurrent validity applied specifically165 9 to model parameterisation. The tests introduced in the Barlas (1996) paper166 are described in more depth in the following section with specific reference167 to BN modelling.168 2.3. Machine Learning169 It is worth mentioning the significant research that has been conducted170 in the field of machine learning, particularly regarding content validity of the171 network structure. Machine learning researchers often use BNs and Bayesian172 Belief Networks to discover true networks using full datasets ( Heckerman173 et al. (1995) is a strong and widely cited example of this method). While174 this work is outside the scope of this paper, it is worth mentioning due to175 the minimalist approach used by machine learning researchers. In particular,176 the discipline is concerned with finding methods of excluding as many nodes177 and relationships from a BN as possible without losing explanatory power.178 2.4. Bayesian Network specific tests179 There are very few validity tests specific to BN modelling, but the few180 that are present are used commonly. Pollino et al. (2007) refers to the con-181 cepts of ’sensitivity to findings’ and ’sensitivity to parameters’ as methods of182 testing the predictive validity of expert-elicited networks. Other tests that183 have been introduced, such as d-separation analysis (Geiger et al., 1990) and184 causal independence-based tests (Cheng et al., 1997) are structural tests only,185 and are often used to establish internal consistency which is more elegantly186 defined as a reliability criterion.187 10 2.5. Problem Statement188 Unlike areas in which objective data are available, BNs built from expert189 elicitation cannot be validated using complete test datasets. As such, the190 concept of validity is not absolute but a question of additive strength. Often191 we cannot say whether a test has been conclusively passed or not, only take192 the weight of evidence over all the tests that have been applied. With this in193 mind we can begin to move toward a framework for validating all sources of194 uncertainty within the BN. While there are some tests introduced in previous195 research, these only test individual aspects of the network and can often only196 reflect the reliability rather than the validity of the model. For BN’s based197 either entirely upon expert elicitation, or a combination of data and expert198 elicitation, to be judged as valid assessments of the knowledge around a199 domain, a more comprehensive and robust framework of validity measures200 needs to be established.201 3. A validity testing framework for expert-elicited Bayesian Net-202 works203 The prior approaches to test and model validation are discussed and re-204 lated to BNs in the following section, with examples from the airports in-205 bound passenger processing network. When applying this validity testing206 framework to BNs, model structure, node discretisation, and overall model207 behaviour must be considered in addition to parameterisation. For this rea-208 son, in the following framework we consider the seven types of validity from209 psychometrics (including their special tests from system dynamics and BN210 modelling disciplines), and their application to the four sources of BN model211 11 uncertainty.212 213 3.1. Nomological validity214 In terms of an expert elicited BN, building nomological validity means215 establishing confidence that the model domain fits within a wider domain216 as established by the literature. For example, the passenger processing BN217 for ACBPS should sit within literature on airport terminals, way finding and218 security as well as other types of complex systems models and spatio-temporal219 model methods. If this test cannot be passed by the network, an argument220 must be made for why this model sits outside all current known research. This221 is very unusual, but may occur in fields such as advanced physics, where new222 information is shifting the entire paradigm of the discipline regularly. If this223 is the case, there may be an argument for a network having low nomological224 validity. Nomological validity is generally applied to the whole domain, but225 the nomological map serves as a reference for finding appropriate comparison226 models in later tests of specific sources of uncertainty. Given the power of227 nomological validity to place the research in a wider context, we begin the228 validation process with the questions:229 • Can we establish that the BN model fits within an appropriate context230 in the literature?231 • Which themes and ideas are nomologically adjacent to the BN model,232 and which are nomologically distant?233 12 3.2. Face validity234 Face validity is one of the most commonly used tests for expert-elicited235 BNs. For example, we can look at our passenger processing BN and check236 that baggage delivery time is part of the model and that it is related to the237 time spent picking up baggage to approximately the right level. However,238 despite the ease of establishing face validity it is considered the weakest form239 of validity within the psychometric framework. One of the primary dangers240 in establishing face validity is criterion contamination an issue that arises241 when the test dataset is the same as the validation set (Darkes et al., 1998).242 In our case, we might ask our set of experts whether they think the network243 looks the same as expected. Unsurprisingly, there are very few cases where244 the experts disagree with their own judgment. A more robust way of estab-245 lishing face validity would be to split the population of experts into test and246 validation groups, and ask the validation group only about the face validity of247 the network (Johnson et al., 2010). In cases where few experts are available,248 we can undertake a number of other strategies normally used for elicitation,249 such as using different experts for different parts of the BN, asking experts250 to assess their answers from a rival’s perspective, asking experts whether the251 model is applicable outside their domain and many others(Low Choy et al.,252 2009; James et al., 2010). In addition, often the entire model is tested at253 once (Korb and Nicholson, 2010). In order to learn as much as possible about254 the model through the validation process it is worthwhile to assess the face255 validity of the structure (including sub-networks), discretisation and param-256 eterisation independently. We therefore suggest the second set of questions257 in this validation stage:258 13 • Does the model structure (the number of nodes, node labels and arcs259 between them) look the same as the experts and/or literature predict?260 • Is each node of the network discretised into sets that reflect expert261 knowledge?262 • Are the parameters of each node similar to what the experts would263 expect?264 3.3. Content Validity265 To test for content validity of the structure we can check that all noted266 factors and relationships from the literature are included in the model, and267 discover which relationships are novel to the BN model. For example, in268 the passenger processing BN we could ensure that all the factors considered269 to important by the regulating bodies are included. To check the content270 validity of the discretisation of nodes within the model, we can ensure that271 all intervals implicated in the literature are included in the network. For272 example, if we were to discover that a node is generally classified at three273 levels in the literature, then a node with binary states would have low content274 validity. From a systems dynamics perspective, Barlas (1996) describes a275 dimensional consistency test which when applied to a BN paradigm could276 be defined as ensuring that all possible states of the node are included in277 the discrete states. For example, if a node were to include binary states278 of above twelve people and below twelve people, then the node would lack279 dimensional consistency as the possibility of there being exactly twelve people280 has been excluded. Finally, the content validity of the parameterisation can281 be checked through comparing expert elicited probabilities and relationships282 14 to analogous relationships in the literature. If parameters in the expert283 elicited model are significantly different, an argument should be made for284 the difference. To assess the content validity of a BN model, the following285 questions are suggested:286 • Does the model structure contain all and only the factors and relation-287 ships relevant to the model output?288 • Does each node of the network contain all and only the relevant states289 the node can possibly adopt?290 • Are the discrete states of the nodes dimensionally consistent?291 • Do the parameters of the input nodes and CPT reflect all the known292 possibilities from expert knowledge and domain literature?293 3.4. Concurrent Validity294 In the context of BNs, concurrent validity can refer to the possibility that295 a network or section of a network behaves identically to a section of another296 network, preferably driven by data. While this seems improbable, the na-297 ture of BN modelling seems to lend well to concurrent validity. For example,298 the passenger processing BN shares some sub networks and nodes with the299 customer satisfaction model for the same airport. In her introduction to Ob-300 ject Oriented Bayesian Networking, Koller and Pfeffer (1997) describes the301 technique as a way of capitalising on this high concurrent validity by build-302 ing networks from instances, or nodes representing sub-networks that can be303 easily transposed to other networks. This method allows large and highly304 complex BNs to be built without the researcher repeating modelling work305 15 performed by other researchers in the same domain. To test the concurrent306 validity of the structure of a BN, we can check other networks in related307 domains for sub-networks that are similar to sub-networks in the network.308 A model with high concurrent validity would have sub-networks in common309 with networks that are theoretically related, with the same number of nodes310 and relationships, with the relationships in the same direction. Similarly,311 when similar sub-networks from theoretically related networks are identified,312 we can judge the validity of the discretisation of nodes and their param-313 eterisation against the intervals of nodes and probabilities supplied in the314 comparison network. In the Barlas (1996) review of system dynamics tests,315 the application of concurrent validity criteria specifically to the parameters316 of the model factors is known as ’parameter confirmation’. Given these ap-317 proaches, the following questions are suggested as tests of a BN’s concurrent318 validity:319 • Does the model structure or sub-networks act identically to a network320 or sub network modelling a theoretically related construct?321 • In identical sub networks, are the included factors discretised in the322 same way as the comparison model?323 • Do the parameters of the input nodes and CPTs in networks of interest324 match the parameters of the sub network in the comparison model?325 3.5. Convergent Validity326 Convergent and discriminant validity are usually considered together, as327 they both reflect the relationship the BN has with other models. Convergent328 16 validity in BNs refers to how similar the model structure, discretisation,329 and parameterisation are to other models that are intended to describe a330 similar system. For example, we would expect our passenger processing BN331 to look similar to a network describing the processing of cargo at a seaport.332 The selection of comparison models is dependent upon the literature and333 knowledge of the domain at hand, but the original nomological map created334 in the first step of validation can be used as a reference for which sources may335 be of use. In particular, the comparison model for establishing convergent336 validity should be taken from an area as nomologically proximal as possible.337 In practise this could mean using a comparison model drawn from another338 complex systems discipline applied to the same domain, or alternatively using339 a BN drawn from a theoretically similar domain. As with the other types340 of validity, we can test the expert elicited BN regarding the convergent and341 discriminant validity of the structure, discretisation and parameterisation in342 isolation using the following questions:343 • How similar is the model structure to other models that are nomologi-344 cally proximal?345 • How similar is the discretisation of each node to the discretisation of346 nodes that are nomologically proximal independent of their network347 domain.348 • Are the parameters of nodes that have analogues in comparison models349 assigned similar conditional probabilities?350 17 3.6. Discriminant Validity351 The counterpart to convergent validity is discriminant validity, defined in352 this framework as the degree to which a model is different to models that353 should be describing a different system. For example, we would expect our354 passenger processing BN to look different to a model describing students’355 progression through school. As in the case of convergent validity, the com-356 parison model can be chosen using the nomological map as a reference guide357 for useful sources. The ideal method for establishing good discriminant valid-358 ity would be to select models from nomologically distal disciplines and work359 toward the construct of interest. Given that convergent validity has already360 been established, the ideal model would be one that is similar in most re-361 spects to the convergent comparison model, but dissimilar in all respects to362 the discriminant comparison model, which would be drawn from an area of363 research very close to the convergent validity comparison model.364 A system dynamics test of experts’ judgement of the discriminant validity of365 any source of uncertainty in a BN model is known as a Simulation Turing test366 (Schruben, 1980). The test requires many versions of the model to be shown367 to the researcher, only one of which is the expert-elicited model in every368 respect. Experts can be asked to choose the correct structure, discretisation369 or parameterisation from either a set of models of through binary choice ex-370 periments in which every model is compared to every other model. As in371 the case of face validity, the Turing test is ideally carried out on a separate372 set of experts to the set that originally created the model to avoid crite-373 rion contamination. The fewer differences in the final model chosen to the374 expert-elicited network, the higher the discriminant validity of that source375 18 of uncertainty. For this framework, the following questions are suggested as376 tests of the discriminant validity of the BN model:377 • How different is the model structure to other models that are nomo-378 logically distal?379 • How different is the discretisation of each node to the discretisation380 of nodes that are nomologically distal independent of their network381 domain?382 • Are the parameters of nodes in the comparison models that have oppo-383 sitional definitions to the node in question parameterised differently?384 • When presented with a range of plausible models, can experts choose385 the ’correct’ model or set of models?386 3.7. Predictive Validity387 In BNs, predictive validity can be considered to encompass both the388 model behaviour and the model output. This is the type of validity cov-389 ered by traditional model and data fitting techniques.390 When applying predictive validity tests within a complex systems and specif-391 ically a BN paradigm, the comparison model can be an alternative hypoth-392 esised model rather than a data-driven model. Such hypothesised models393 could be elicited using a number of techniques, such as case studies or for-394 mal walkthroughs (Barlas, 1996; Pollino et al., 2007). Luu et al. (2009) used395 case studies to formulate alternative hypothetical networks against which396 to compare the predictive validity of their BN model. While they did not397 specifically apply the tests presented in this paper, their work represents one398 19 of few papers to attempt to establish confidence in the predictive validity of399 an expert-elicited BN. Half of the special tests of system dynamics model400 validity presented by Barlas (1996) refer to the predictive validity of the401 model in that they test the model behaviour specifically. Of particular rele-402 vance to establishing confidence in the predictive validity of BN are behaviour403 sensitivity tests, Qualitative Features Analysis and the extreme conditions404 tests. When applied within a BN paradigm, the behaviour sensitivity test405 can be applied to the model structure and parameters by determining to406 which factors and relationships the model is sensitive, and comparing this to407 hypothetical models or alternative empirical models. The terms ’sensitivity408 to parameters’ and ’sensitivity to findings’ are used by Pollino et al. (2007) to409 describe the application of behaviour sensitivity tests to the parameters and410 model behaviour specifically, however it should be noted that this test can411 be just as easily applied to the structure and discretisation of nodes in the412 model as well. These tests are commonly used, and various versions of them413 can be executed using the GeNiE 2.0 (DSL, 2007), Hugin Expert (Andersen414 et al., 1989) or Netica (Norsys, 2007) software packages among others.415 Qualitative features analysis (Carson and Flood, 1990) is a case of predic-416 tive validity testing where behaviour in a hypothetical model is compared417 to the behaviour of individual pairs of nodes, sub-networks and the entire418 model. As in the case of predictive validity, the hypothetical models can be419 achieved through a number of formal strategies; however in this case, we are420 interested in the comparison of simulation output rather than comparison of421 model features directly. It is for this reason that model behaviour is outlined422 as the fourth source of model uncertainty. While this area is the product of423 20 the uncertainty of its component features, predictive validity requires that424 model behaviour be simulated from the model for tests to occur. For this425 reason, predictive validity should be the final type of validity to be tested.426 Finally, the extreme conditions test can be seen as a special case of qualita-427 tive features analysis, as it sets the hypothetical model to extreme conditions428 where the behaviour of the model is more predictable (Forrester and Senge,429 1980). For example, if the number of passengers is set to 0 then the model430 should reflect that there is a probability of 1 that 0 passengers are processed431 within the time range of interest. The direct extreme conditions test ex-432 amines the behaviour of individual pairs of nodes and sub-networks under433 such extreme conditions, while the indirect extreme conditions test examines434 the behaviour of the entire network against such hypotheses. The range of435 tests to establish confidence in the predictive validity of a model is notable436 considering the issue at hand that true objective data on the model are not437 available, and suggests that the lack of data available does not preclude pre-438 dictive validity testing, as hypothesis-driven models can be used in place of439 data-driven models. From examination of the various techniques associated440 with assessing predictive validity, we arrive at the following set of questions:441 • Is the model behaviour predictive of the behaviour of the system being442 modelled?443 • Once simulations have been run, are the output states of individual444 nodes predictive of aspects in the comparison models?445 • Is the model sensitive to any particular findings or parameters to which446 the system would also be sensitive?447 21 • Are there qualitative features of the model behaviour that can be ob-448 served in the system being modelled?449 • Does the model including its component relationships predict extreme450 model behaviour under extreme conditions?451 4. Conclusions and Recommendations452 In this paper we have outlined a broad range of conceptual tests that can453 be applied to validate BNs. These validity tests incorporate standard model-454 data fit comparisons, but expand the construct of validity to the broader455 definition of whether or not a model describes the system it is intended to456 describe, and produces output it is intended to produce. Many of these va-457 lidity tests can be used where no objective data exist.458 By combining existing research from BN validation with validation tests from459 psychometrics as well alternative complex systems disciplines, this paper in-460 troduces a starting point for discussing a framework for building confidence461 in the validity of BNs. The presented framework is not intended to be com-462 prehensive; instead, the aim is to establish that the validity of a BN can be463 tested, and should be tested, independent of the model fit to available data464 or expert confirmation. Disciplines such as psychometrics, with a history of465 measuring latent constructs, can provide a useful perspective on the problem.466 The framework presents a sequence of steps that can be followed to establish467 confidence in model validity, beginning with creating a nomological map of468 the literature surrounding the domain, then gradually building confidence in469 six types of model validity, using both general and specific tests.470 The application of this framework to the BN developed in conjunction with471 22 ACBPS will to our knowledge be a novel practical demonstration of such an472 approach to BN validation. The framework presented in this paper is in-473 tended to be domain-general, and there would be great value in establishing474 the versatility of the tests by applying them to complex models in other do-475 mains. Future work will extend to formalising and quantifying many of the476 tests in the context of BN modelling, and obtaining perspectives on model va-477 lidity from other disciplines that deal with unobserved variables and complex478 systems.479 5. References480 S.K. Andersen, K.G. Olesen, F.V. Jensen, and F. Jensen. Hugin - a shell481 for building bayesian belief universes for expert systems. In Proceedings482 of the Eleventh International Joint Conference on Artificial Intelligence,483 pages 1080–1085, United States of America, 1989. MIT press.484 Y. Barlas. Formal aspects of model validity and validation in system dynam-485 ics. System Dynamics Review, 12(3):183–210, 1996.486 S. Barrett, P. Whittle, K. Mengersen, and R. Stoklosa. Biosecurity threats:487 the design of surveillance systems, based on power and risk. Environmental488 and ecological statistics, 17:503–519, 2010.489 E.R. Carson and R.L. Flood. Model validation: philosophy, methodology490 and examples. Transactions of the Institute of Measurement and Control,491 12:178–185, 1990.492 J. Cheng, D.A. Bell, and W. Liu. An algorithm for bayesian belief network493 23 construction from data. In Proceedings of Conference on Artificial Intelli-494 gence and Statistics, pages 83–90, United States, 1997.495 L.J. Cronbach and P.E. Meehl. Construct validity in psychological tests.496 Psychological Bulletin, 52(4):281–302, 1955.497 J. Darkes, P.E. Greenbaum, and M.S. Goldman. Sensation seekingdisinhibi-498 tion and alcohol use: Exploring issues of criterion contamination. Psycho-499 logical Assessment, 10:71–76, 1998.500 DSL. Genie and smile, 2007. Bayesian Network Modelling software package501 and decision platform.502 J.W. Forrester and P.M. Senge. Tests for building confidence in system503 dynamics models. TIMS studies in the management sciences, 14:209–228,504 1980.505 D. Geiger, T. Verma, and J. Pearl. d-separation: From theorems to algo-506 rithms. In Proceedings of the Fifth Annual Conference on Uncertainty in507 Artificial Intelligence, UAI ’89, pages 139–148, The Netherlands, 1990.508 North-Holland Publishing Co.509 D. Heckerman, D. Geiger, and D.M. Chickering. Learning bayesian networks:510 The combination of knowledge and statistical data. Machine Learning, 20:511 197–243, 1995.512 A. James, S. Low Choy, and K. Mengersen. Elicitator: An expert elicitation513 tool for regression in ecology. Environmental modelling and Software, 25:514 129–145, 2010.515 24 S. Johnson. Integrated Bayesian network frameworks for modelling complex516 ecological issuesy. PhD thesis, Queensland University of Technology, Aus-517 tralia, 2009.518 S. Johnson, F. Harding, G. Hamilton, and K. Mengersen. An integrated519 bayesian network approach to lyngbya majuscula bloom initiation. Marine520 Environmental Research, 69:27–37, 2010.521 D. Koller and A. Pfeffer. Object-oriented bayesian networks. In Proceedings522 of the Thirteenth Annual Conference on Uncertainty in Artificial Intelli-523 gence, JMLR workshop and conference proceedings, pages 302–313, United524 States, 1997.525 K.B. Korb and A.E. Nicholson. Bayesian Artificial Intelligence, page 462 pp.526 CRC Press, United Kingdom, 2010.527 S. Low Choy, R. O’Leary, and K. Mengersen. Elicitation by design in ecol-528 ogy: using expert opinion to inform priors for bayesian statistical models.529 Ecology, 90:265–277, 2009.530 V. Luu, S. Kim, N. Tuan, and S. Ogunlana. Quantifying schedule risk in531 construction projects using bayesian belief networks. International Journal532 of Project Management, 27:39–50, 2009.533 M. Masoli, D. Fabian, S. Holt, and R. Beasley. The global burden of asthma:534 executive summary of the gina dissemination committee report. Allergy,535 59(5):469–478, 2004.536 P. Myllymaki, T. Silander, H. Tirri, and P. Uronen. B-course: A web-based537 25 tool for bayesian and causal data analysis. International Journal on Arti-538 ficial Intelligence, 11(3):369–387, 2002.539 Norsys. Netica, 2007. Proprietary Bayesian Network Modelling software540 package.541 J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausi-542 ble Inference. Morgan Kaufmann, 1988.543 C.A. Pollino, O. Woodberry, A. Nicholson, K. Korb, and B.T. Hart. Parame-544 terisation and evaluation of a bayesian network for use in an ecological risk545 assessment. Environmental Modelling and Software, 22:1140–1152, 2007.546 S. Renooij. Probability elicitation for belief networks: issues to consider. The547 Knowledge Engineering Review, 16(3):255–269, 2001.548 L.W. Schruben. Establishing the credibility of simulations. Simulation, 34:549 101–105, 1980.550 O. Shakoor, R.B. Taylor, and R.H. Behrens. Assessment of the incidence of551 substandard drugs in developing countries. Tropical Medicine and Inter-552 national Health, 2(9):839–845, 1997.553 T. Silander, T. Ross, and P. Myllymaki. Locally minimax optimal predic-554 tive modeling with bayesian networks. In Proceedings of the 12th Inter-555 national Conference on Artificial Intelligence and Statistics (AISTATS)556 2009, JMLR workshop and conference proceedings, pages 504–511, United557 States, 2009.558 26 R.S. Sobel and B.J. Osoba. Youth gangs as pseudo-governments implications559 for violent crime. Southern Economic Journal, 75(4):996–1018, 2009.560 W.M. Trochim. Research methods knowledge base, 2001. URL561 http://www.socialresearchmethods.net/kb/index.htm.562 L. Uusitalo. Advantages and challenges of bayesian networks in environmen-563 tal modelling. Ecological Modelling, 203(3), 2007.564 O. Woodberry, A. Nicholson, K. Korb, and C Pollino. Parameterising565 bayesian networks. In Geoffrey Webb and Xinghuo Yu, editors, AI 2004:566 Advances in Artificial Intelligence, volume 3339 of Lecture Notes in Com-567 puter Science, pages 711–745. Springer Berlin / Heidelberg, 2005.568 27