key: cord-0060394-hp1zda0e authors: Budde, Carlos E.; Hartmanns, Arnd title: Replicating [Formula: see text] with Prolonged Retrials: An Experimental Report date: 2021-02-26 journal: Tools and Algorithms for the Construction and Analysis of Systems DOI: 10.1007/978-3-030-72013-1_21 sha: a10e61394970dfe41af49cadcdf0fe8488f1ca44 doc_id: 60394 cord_uid: hp1zda0e Statistical model checking uses Monte Carlo simulation to analyse stochastic formal models. It avoids state space explosion, but requires rare event simulation techniques to efficiently estimate very low probabilities. One such technique is [Formula: see text] . Villén-Altamirano recently showed—by way of a theoretical study and ad-hoc implementation—that a generalisation of [Formula: see text] to prolonged retrials offers improved performance. In this paper, we demonstrate our independent replication of the original experimental results. We implemented [Formula: see text] with prolonged retrials in the and modes tools, and apply them to the models used originally. To do so, we had to resolve ambiguities in the original work, and refine our setup multiple times. We ultimately confirm the previous results, but our experience also highlights the need for precise documentation of experiments to enable replicability in computer science. In stochastic timed systems, the time between faults, customer interarrival times, transmission delays, or exponential backoff wait times follow (continuous) probability distributions. Probabilistic model checking [3] can compute dependability metrics like reliability and availability in the Markovian case. To evade state space explosion and evaluate non-Markovian systems, statistical model checking (SMC [2] ) has become a popular alternative. At its core, SMC is Monte Carlo simulation for formal models. It faces a runtime explosion when estimating the probability p of a rare event with a sufficiently low error, e.g. an error of ±10 −10 for p ≈ 10 −9 (i.e. a relative error of 0.1). Rare event simulation (RES) techniques [17] address this problem. They can broadly be categorised into importance sampling and importance splitting. The former changes the probability distributions while the latter changes the simulation algorithm to make the rare event more likely. Both techniques then compensate for these changes in the statistical evaluation. RES has garnered the interest of mathematicians and computer scientists alike. The scientific outcomes range from theoretical studies of a RES technique's limit behaviour and optimality [8, 14, 16] over experimental validation on Matlab studies or ad-hoc implementations [10, 11, 19] Authors are listed alphabetically. This work was supported by NWO via project no. 15474 (SEQUOIA) and VENI grant no. 639.021.754. reports using larger case studies [5, 12, 18] as well as automated tools [4, 6, 15, 18] that accept a loss of optimality in exchange for practicality. Two recent papers showed theoretically [21] and empirically [19] that prolonging retrials in the Restart importance splitting technique [22] reduces the required number of samples for the same error, with optimal runtime around prolonging by 1 to 2 levels. The models and parameters used in [19] are described in supplementary material [20] , but the implementation is not publicly available. In this paper, we demonstrate our replication of the results of [19, 21] , where replication "means that an independent group can obtain the same result using artifacts which they develop completely independently" in the ACM terminology [1] . To this end, we implemented Restart with prolonged retrials (Restart-P) in the FIG rare event simulator [4] and the modes statistical model checker [7] of the Modest Toolset [13] . We recreated the models in the IOSA and Modest languages, and ran experiments following the original setup. Our experiments confirm the behaviour and performance improvements of Restart-P reported in [19, 21] . However, we encountered ambiguities in the textual and pictorial descriptions of Restart-P and the experimental setup in the original papers, some of which we could only resolve with input from the author of [19, 21] . Different parts of our work thus reside on different levels between replication and reproduction (which "means that an independent group can obtain the same result using the author's own artifacts" [1] ). Throughout the paper, we document where we achieved fully independent replication, where information from private communication was needed, and where we had to ultimately resort to requesting and inspecting the source code for the original implementation. The contribution of this paper is thus threefold: (1) We provide pseudocode for Restart-P in Sect. 2 that clarifies the technical details w.r.t. [19, 21] . Let a stochastic timed discrete-event model be given as a tuple S, s 0 , step, F of a set of states S, an initial state s 0 ∈ S, a function step : S → [0, ∞) × S where step(s) samples a random path from s to the next event and returns a pair t, s of its duration and next state, and a subset of rare event states F ⊆ S. A simulation run is a sequence of states obtained by repeatedly applying step. Models with general probability distributions encode their memory in the states. Importance splitting uses an importance function f I : S → [0, ∞) indicating "how close" a state is to the rare event. Partition the range of f I into k + 1 nonempty intervals to obtain a level function f L : For simplicity, assume f I (s 0 ) = 0 and step(s) = t, s ⇒ f L (s ) ≤ f L (s)+1 (a step moves up by at most one level). Then "thresholds T i of f I are defined so that each set C i is associated Input: model S, s0, step, F , fL, fS, prolongation depth j, max. sim. time Tmax tF := 0, list ξ := {| s0, 0, 0, 0 |} // state, time, creation level, last-split level while ξ = ∅ do // run all trials to end s, t, create , split := ξ.get-remove() // data of current trial while t < Tmax do t , s := step(s) // simulate to next change in state f I , f L , and f S are specified by experts or derived automatically [6] . Importance splitting with Restart starts a run (the main trial ) from s 0 that, whenever it moves up from s in current level l − 1 to s in level l, spawns f S (l) − 1 new child runs (retrials of level l) from s . Retrials end when they move down below their creation level. The trials' weights in probability estimation are appropriately reduced to compensate. Restart with prolonged retrials of depth j, denoted Restart-P j , is defined as follows in [21] (shortened and adapted to our notation): In Restart-P j , each of the retrials of level i finishes when it leaves set C i−j ; that is, it continues until it down-crosses the threshold i − j. If one of these trials again up-crosses the threshold where it was generated (or any other between i − j + 1 and i), a new set of retrials is not performed. If j ≥ i, the retrials are cut when they reach the threshold 0. The main trial, which continues after leaving set C i−j , potentially leads to new sets of retrials if it up-crosses threshold T i after having left set C i−j . If the main trial reaches the threshold 0, it collects the weight of all the retrials (which has been cut at that threshold) and thus, new sets of retrials of level 1 are performed next time the main trial up-crosses threshold T 1 . In addition, [21, Fig. 1 ] graphically illustrates the behaviour of Restart-P 1 . The original Restart [22] is Restart-P 0 . The above textual description clearly conveys the core idea of Restart-P, but we found it to omit three technical details: -The condition for when an up-going retrial spawns new retrials is more complex than with Restart. We became aware of this when comparing the textual description with the graphical depiction in [21, Fig. 1 ]. In fact, we need to keep track of the last level at which a retrial will split, and decrement that value when it moves more than j levels down. (Independent replication.) -The definitions in [19, 21] do not include 0 in the range of values for i in C i and T i . Our definitions would associate T 0 with states s where f I (s) = 0. Implemented in FIG, this lead to increasing underestimation as the prolongation depth j increased. Only once we interpreted threshold 0 to refer to level 0 (i.e. states s where f L (s) = 0) did we obtain consistent estimations across different j. The correctness of this interpretation was confirmed by the author of [19, 21] in private communication. (Semi-independent replication.) -When a trial reaches, or spends time in, a state in F , we must weight this event's influence on the statistical estimate by a factor of 1/ [20] when j > 0. We finally requested the source code for the original experiments and found that f L (s) must be replaced by the level on which the current trial was last split, i.e. the value must not change when moving down ≤ j levels. (Resembles a reproduction.) We make these details explicit in Algorithm 1, for the case of estimating the long-run average time spent in F (i.e. steady-state simulation). FIG evolved as described above and is thus mostly a replication. modes was extended with prolongations later, using a recursive formulation of the algorithm gleaned from the original code. It thus lacks the complete independence of a replication as per [1] . Table 2 in [21] provides steady-state estimates, numbers of samples, and runtimes obtained using Restart-P j on a Jackson (i.e. Markov) 2-tandem queueing network for j ∈ { 0, . . . , 4 }. The same data is given in [19] for j ∈ { 0, . . . , 2 } on a similar system with three queues and a seven-node network, in Jackson and non-Jackson (using Erlang and hyperexponential distributions) variants. The original articles and extra material [20] describe the models, and the experimental setup: -The set F is characterised. E.g. for the 3-tandem network, it contains the states where the third queue has ≥ L = 30 (Jackson) or 14 packets (non-Jackson). -All probability distributions and the f I , f L , and f S functions are characterised. -T max time values for the steady-state simulations are specified for all models. -The statistical evaluation aims for a relative error of 0.1 with 95 % confidence (except for the tandem queue, where the error is 0.005); Restart-P runs are performed sequentially until the half-width of the 95 % confidence interval is below 10 % (resp. 0.5 %) of the current estimate. (Note that this guarantees the requested confidence only asymptotically for decreasing width [9] .) In our replication attempt, we had to resolve the following unspecified aspects: -The queue capacities C > L are not documented, but influence the estimate: for C close to L, the steady-state probability is underestimated. We settled for C = 20 · L in FIG's IOSA models (replication); the influence of C − L rapidly diminishes beyond small values. Later, from inspecting the original source code, we found that the queues are practically unbounded (implemented as 32-bit integer counters), which we reproduce in the Modest models for modes. The original experiments were realised in a single file of C code that represents both the algorithm and the models, specialised to queueing models with transition probabilities and service rates specified in constant arrays. In fact, the code we received implemented the 2-tandem queueing network only. We extended this code with a compile-time choice among the models described in [20] , and fixed few small bugs. We thus have four sets of results to compare, shown in Table 1 : the original numbers given in [19, 21] , plus those from our new executions of the adapted code, modes, and FIG. In the table, time is in seconds,p is the estimate, p is the true steady-state probability where it can be derived, and n is the number of samples needed by the statistical evaluation. The adapted code and FIG ran on an Intel Xeon E5-2630 v3 (2.4-3.2 GHz), and modes ran on a Core i7-4790 (3.6-4.0 GHz, 4 physical/8 logical cores) system. The adapted code and FIG are single-threaded whereas modes used 7 simulation threads. The adapted code is tailor-mode for these models, while FIG has to encode them in the more general IOSA framework, making it slower; modes in turn profits from a special-case implementation for CTMC to speed up the Markovian cases. Comparing runtimes between tools is thus of limited use. The estimates are the centers of confidence intervals returned by the tools with confidence and relative width as described above. Each estimate, n, time triple was selected from 5 tool executions by picking the one with the median runtime. We underline the best runtimes among values for j. However, the wide confidence intervals (except for 2-tandem), few executions, and in principle unsound stopping criterion that we reproduce from the original experiments mean that results, including best values of j, vary a lot for different random seeds. The original experimental setup is thus insufficient for drawing conclusions about the precise tradeoffs between specific values of j, but may at most expose an overall trend. Nevertheless, our estimates are mostly within the margin of error around the original or true results. We confirm the main experimental conclusion of [19, 21] : as j increases, n decreases, but from some point-mostly j > 1 or 2-runtime increases, due to the overhead of more retrials surviving longer. For the non-Jackson triple tandem network, none of our results matches the numbers of [19] . Since the original code, albeit adapted, agrees with FIG and modes rather than with the original results, we suspect an error in [19] or [20] w.r.t. this one model. We demonstrated the extension of the FIG and modes rare event simulation tools to support prolonged retrials in rare event simulation using Restart importance splitting. These implementations and experiments were the outcome of an exercise in independently replicating experimental research originally performed in mathematics, from a computer science perspective. We confirm the key findings of the earlier work. At the same time, we document several issues-small but critical technical details of the algorithm and experimental setup-where the publicly available information was insufficient for a completely independent replication. We in particular noticed that replicating randomised/statistical algorithms poses a particular challenge since small errors may result in subtle mis-estimations that are often drowned in the overall statistical error. In the end, however, all issues could be resolved due to the exceptional support, responsiveness, and openness of the original author, José Villén-Altamirano, whom we thank earnestly. However, such support cannot be expected for experimental work in general, in particular where temporary staff like Ph.D. students-who eventually graduate and move to new institutions or industry-perform the bulk of the experiments. This paper thus also highlights the need for computer science and the formal verification community to continue their push for artifact evaluation and archived, publicly available reproduction packages. A reproduction package for our experiments is archived at DOI 10.6084/m9.figshare.12269462. ACM: Artifact review and badging A survey of statistical model checking Model checking probabilistic systems FIG: The finite improbability generator Rare event simulation for non-Markovian repairable fault trees Automated compositional importance splitting A statistical model checker for nondeterminism and rare events Importance sampling for non-Markovian tandem queues using subsolutions On the asymptotic theory of fixed-width sequential confidence intervals for the mean Splitting for rare event simulation: A large deviation approach to design and analysis On the importance function in splitting simulation The 2019 comparison of tools for the analysis of quantitative formal models The Modest Toolset: An integrated environment for quantitative modelling and verification Large deviations for weighted empirical measures arising in importance sampling Plasma Lab: A modular statistical model checking platform Path-ZVA: General, efficient, and automated importance sampling for highly reliable Markovian systems Introduction to rare event simulation Rare event simulation for dynamic fault trees RESTART vs Splitting: A comparative study Simulation details of the paper "RESTART vs Splitting: a comparative study An improved variant of the rare event simulation method RESTART using prolonged retrials RESTART: a method for accelerating rare event simulations ), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original authors and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use