The person charging this material is re- 
 sponsible for its return to the library from 
 which it was withdrawn on or before the 
 Latest Date stamped below. 
 
 Theft, mutilation, and underlining of books are reasons 
 for disciplinary action and may result in dismissal from 
 the University. 
 To renew call Telephone Center, 333-8400 
 
 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN 
 
 maToTW 
 
 L161— O-1096 
 
UIUCDCS-R-80-1042 
 
 Ocrp.3 
 
 UILU-ENG 80 1742 
 
 CONSTRUCTION OF A FAULT-TOLERANT REAL-TIME SOFTWARE SYSTEM 
 
 by 
 
 Anthony Y. Wei 
 Roy H. Campbell 
 
 1981 
 
 December 1980 
 
UIUCDCS-R-80-1042 
 
 CONSTRUCTION OF A FAULT-TOLERANT REAL-TIME SOFTWARE SYSTEM 
 
 by 
 
 Anthony Y. Wei 
 Roy H. Campbell 
 
 December 1980 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA, ILLINOIS 61801 
 
 Research supported in part by NASA Project NSG 1471 
 
ABSTRACT 
 
 This paper presents an approach to the construction of a fault- 
 tolerant real-time software system with high reliability. Top-down modular 
 design and program development by stepwise refinement are used for the con- 
 struction of fault-tolerant systems. The recovery block scheme, deadline mech- 
 anism, and a majority voting unit are employed to achieve the desired relia- 
 bility for a real-time process, a segment or refinement. Criteria identifying 
 the appropriate fault-tolerant schemes for a segment or refinement are 
 described. A mathematical model estimates the effectiveness of this approach 
 using a linear approximation technique. The model supports the application of 
 modular design and stepwise refinement to fault-tolerance in construction of 
 reliable real-time software. 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/constructionoffa1042weia 
 
Page 1 
 1 Introduction . 
 
 Reliable software has traditionally been obtained through fault- 
 avoidance techniques. Efforts to achieve this goal include top-down modular 
 design [1], structured programming [2], walkthroughs, testing techniques [3], 
 design of proper programming tools, and program correctness proofs [4]. 
 Despite these approaches, software systems cannot be guaranteed to be fault- 
 free. Fault-tolerance [5, 6] is proposed to complement fault-avoidance and 
 further improve the reliability of software systems. Fault-tolerance does not 
 eliminate the need for reliable components and fault-avoidance techniques are 
 required for constructing fault-tolerance in reliable systems. 
 
 A software system can provide continuous and trustworthy service by 
 tolerating certain faults and detecting and recovering from the errors. Such 
 activity is particularly valuable in real-time systems such as air traffic 
 control systems where the delays caused by the interruption are unacceptable. 
 
 This paper presents an approach to the construction of a fault- 
 tolerant real-time software system with high reliability. Top-down modular 
 design and program development by stepwise refinement [7] are used for the 
 construction of fault-tolerant systems. The recovery block scheme [5, 6], 
 deadline mechanism [8], and a majority voting unit are employed to achieve the 
 desired reliability for a real-time process, a segment or refinement. Cri- 
 teria identifying the appropriate fault-tolerant schemes for a refinement or 
 module are described. A mathematical model estimates the effectiveness of 
 this approach using a linear approximation technique. The model supports the 
 application of modular design and stepwise refinement to fault-tolerance in 
 construction of reliable real-time software. 
 
Page 2 
 The recovery block scheme and the deadline mechanism are briefly 
 described in section 2. An effort to incorporate these recovery mechanisms 
 into an existing language as a tool for fault-tolerant real-time programming 
 is introduced in section 3. In section 4 the construction of a fault-tolerant 
 real-time software system using the tools provided is presented. A simplified 
 mathematical model and an analysis for the effectiveness of the approach are 
 also discussed in section 4. 
 
 2 Error Recovery Mechanisms in Real-Time Systems * 
 
 Graceful degradation is a desirable attribute of a real-time system. 
 The system should not simply halt operation in the event of some 
 internal/external errors. Two types of errors in real-time software systems 
 are considered in this paper: logical errors and timing errors . Logical 
 errors include design residual errors and run-time errors not detected in the 
 testing process. Timing errors only occur in real-time systems where timely 
 execution of tasks is required, especially for some time-critical systems. A 
 timing error occurs if a real-time system violates the timing constraints 
 specified for that system. The recovery block scheme [5, 6] has been proposed 
 to handle the logical errors and the deadline mechanism [8] is a solution to 
 the timing error problems. 
 
Page 3 
 2.1 Recovery Blocks . 
 
 The recovery block scheme [5, 6] has been proposed as a mechanism to 
 support fault-tolerant software. A recovery block consists of an acceptance 
 test, a primary and a collection of alternates . The acceptance test is an 
 assertion about the results computed by the primary or the alternates. If the 
 acceptance test is false after the primary execution, an alternate is exe- 
 cuted. Errors detected implicitly by the hardware or software during the exe- 
 cution of the primary or an alternate automatically set the acceptance test to 
 false. If all alternates of a recovery block fail to satisfy the acceptance 
 test, a software error is raised. 
 
 The idea of providing many alternates is very similar to providing 
 stand-by sparing circuits in hardware. However, the redundancy required in 
 software is not merely the replication of a program but redundancy of design. 
 The alternates may perform the same function as the primary but use different 
 algorithms. During the execution of the primary or an alternate, the hardware 
 automatically saves the "old" contents of modified global variables In a 
 recovery cache [9]. If the primary or an alternate fails, the recovery mecha- 
 nism restores those old contents from the recovery cache and returns the pro- 
 gram to a recovery point which Is the state before entering the recovery 
 block. Another alternate, if any, is then initiated. 
 
 Recovery blocks do provide alternative algorithms to cope with general 
 logical errors but can not handle timing errors properly. Recovery blocks are 
 insensitive to the passage of time and cannot be used to ensure that a 
 response is generated by a specific deadline. (That is, it is impossible to 
 recover from a missed deadline by resetting the system clock and executing an 
 alternate.) We need another mechanism to handle the timing errors In real-time 
 
Page 4 
 systems* 
 
 2.2 Deadline Mechanism * 
 
 The deadline mechanism, although similar to the recovery block scheme, 
 performs a different function. The deadline mechanism includes scheduling 
 mechanisms to ensure timely responses. 
 
 The deadline mechanism associates two algorithms with each particular 
 service component. The primary algorithm produces a better quality service 
 than the alternate . The alternate is a simpler, deterministic algorithm which 
 always produces an acceptable result in a known, fixed length of time. The 
 deadline mechanism provides fault-tolerance in a real-time system by ensuring 
 that either the primary or the alternate will complete before the deadline. A 
 set of simulations demonstrating the feasibility of fault-tolerance in real- 
 time systems is described in [8]. 
 
 Reliable scheduling of primaries or alternates to meet real-time con- 
 straints requires the calculation of the execution time of each algorithm. 
 Accurate determination of execution times is critical to system performance 
 and reliability. Since upper bounds on the execution times for the alternate 
 algorithms are assumed, the deadline mechanism has been designed so that the 
 scheduler reserves a time for execution of the alternate as requests arrive. 
 Primaries are scheduled in any remaining time which is called slack time. 
 
Page 5 
 3 Programming Tools for Real-Time Systems . 
 
 Reliability can be improved by the development of high level program- 
 ming languages to support real-time systems [10]. HAL/S [11] has been used to 
 program the Space Shuttle control system. Modula [12], Concurrent Pascal 
 [13], and Path Pascal [14, 15] are all efforts in this direction. The Ada 
 [16] language may be also used for real-time programming but as yet is una- 
 vailable. They all offer sequential features to improve reliable encoding 
 such as modularity, data abstraction, control structures, etc. Concurrent 
 features (called tasks or processes) are also provided in these high level 
 languages to facilitate systems programming. A process or task is an indepen- 
 dent execution sequence. Processes or tasks can interact and are coordinated 
 by performing operations on shared variables through some high level synchron- 
 ization mechanisms (e.g., Monitors [17] and Path Expressions [18]). These 
 languages also permit 1/0 device programming using hardware dependent 
 features. However, few include fault-tolerant mechanisms. Here, we briefly 
 describe our efforts to incorporate the error recovery mechanisms mentioned in 
 the previous section into the Path Pascal language. 
 
 3» 1 Recovery Blocks in Path Pascal . 
 
 The syntax of recovery blocks [6] as incorporated in Path Pascal is 
 shown below: 
 
Page 6 
 
 ensure <acceptance test> 
 
 by <procedure name> "primary" 
 
 else by <procedure name> "1st alternate" 
 
 else by <procedure name> "nth alternate" 
 else error ; 
 
 The acceptance test Is a Boolean expression which is evaluated after 
 the execution of the primary. If the result is true, the statement following 
 the recovery block is executed. However, should the result be false, the 
 state of the computation is restored to that at entry to the recovery block 
 and the first alternate is tried and so on. If all the alternates fail to 
 produce acceptable results, then this is regarded as a failure of the entire 
 recovery block. Since recovery blocks may be nested, recovery actions for a 
 failed recovery block must be undertaken by an enclosing recovery block if 
 any. 
 
 The recovery cache mechanism has been implemented in the Path Pascal 
 run-time system. A recovery cache is basically organized as a stack and is 
 used to store the address and the old content of global variables modified 
 during the execution of a recovery block. Upon a failure of the acceptance 
 test, the modified global variables will be restored to the state prior to 
 entering the recovery block. Each process has its own recovery cache. The 
 storage for recovery cache is allocated when a process is instantiated. 
 
 The domino effect, first described in [61, is an uncontrolled propaga- 
 tion of state restoration among interacting processes. A domino-free system 
 can be achieved by the approach similar to "programmer-transparent coordina- 
 tion of recovering parallel processes" presented in [19]. By introducing some 
 additional recovery points into the process that receives messages from other 
 
Page 7 
 processes, a domino effect can be eliminated. 
 
 3.2 Deadline Mechanism in Path Pascal . 
 
 Path Pascal has been extended to include the fault-tolerant deadline 
 mechanism. The extensions should not be considered as a language design, 
 rather as experimental features which have allowed the programming of example 
 reliable real-time systems. 
 
 The language construct for the fault-tolerant deadline mechanism 
 incorporated into Path Pascal is termed a deadline process . Each deadline 
 process provides fault-tolerant deadline service. The repeat statement in 
 Path Pascal has been augmented to have an optional phrase every to specify the 
 request period (time between successive requests) for periodic processes. A 
 within statement must be included in a deadline process to specify the 
 response period (time within which the system must respond to a request). 
 Within the deadline process one service statement appears, which specifies the 
 primary and alternate algorithms that are to be used to satisfy the request. 
 
 The syntax and semantic of the deadline mechanism in Path Pascal can 
 be best explained by a simple example as follows: 
 
Page 8 
 
 deadline process positlonupdate; 
 
 procedure compute; 
 
 begin 
 
 (* primary algorithm *) 
 end ; 
 
 procedure approximate; 
 begin 
 
 (* alternative algorithm *) 
 end ; 
 
 begin (* positlonupdate *) 
 
 repeat every twoseconds 
 
 within onesecond do 
 
 service compute 
 
 else approximate; 
 until navigation_terminated; 
 end; 
 
 In the example above, compute Is a primary and approximate an alter- 
 nate to do the position update. Deadline process "positlonupdate" receives 
 requests at a rate of one every two seconds. The deadline process must pro- 
 duce a response within one second upon request. The execution of the service 
 statement sets a timer which is used to detect timing errors in the primary. 
 The timer allows sufficient time for the processor to execute the alternate. 
 The results from the primary will be preferred to the alternate's if the pri- 
 mary completes successfully. 
 
 The deadline of a process is calculated from the response period when 
 a service statement Is executed. A request is generated by the system clock 
 for periodic processes based on their request periods. Non-periodic processes 
 use the minimum value among request periods as a specification for all 
 requests. The "every" phrase in the request statement is replaced by " at 
 least" . Synchronization primitives are used to generate a request for non- 
 periodic processes. If a request arrives too soon, it is delayed until the 
 correct request time calculated according to the request period. 
 
Page 9 
 Alternates are scheduled according to the rate-monotonic priority 
 assignment [20] using the response period as a static priority, with small 
 values having high priority. At any given instant the alternate with the 
 highest priority (smallest response period) executes. This scheme requires 
 preemption of alternates. Such a system is timely if processor utilization is 
 less than In 2 (approx. 0.69) [20]. The primaries are scheduled in earliest- 
 deadline-first order. Within a request period, the alternate is executed 
 first. This is called first-chance scheduling algorithm. Another optimal 
 scheduling algorithm which maximizes the number of the primaries executed for 
 a set of periodic processes is developed and reported in [21]. 
 
 The recovery caches of the recovery block are also used for the dead- 
 line mechanism and can be shared. Recovery cache algorithms have been modi- 
 fied to hold results from the alternate until the primary either fails or com- 
 pletes successfully in the deadline mechanism. 
 
 A compiler option also provides an estimate of the execution time for 
 compiled functions, procedures and processes. The compiler checks the execu- 
 tion time for the code(s) included in the "within" statement and verifies that 
 it is less than the corresponding response period. A global verification of 
 all deadline processes meeting deadlines is also made by the compiler. 
 
 A satellite on-board computer system simulation using deadlines in 
 Path Pascal has been reported in [22]. The results show the potential of the 
 deadline mechanism to describe a time-critical real-time system. 
 
Page 10 
 4 Top-Dovn Modular Design * 
 
 In this section, a top-down modular design of a real-time system is 
 presented. A mathematical model for periodic real-time systems estimates the 
 effectiveness of the approach. 
 
 4.1 Concurrent Processes . 
 
 Beal-time programs often monitor and control several external 
 processes. This leads to decomposition of a real-time system into a disjoint 
 set of processes. This disjoint set of processes may be expressed as operating 
 in parallel. The description of a real-time system as distinct concurrent 
 processes may be simpler than as a single integrated sequential task. Con- 
 current features in Path Pascal support such decomposition. Furthermore, tim- 
 ing constraints are assumed to impose upon each process. A timing constraint 
 specifies how often the process should run (request period) and how soon the 
 process should respond (response period). The augmented options in the 
 "repeat" statement and the "within" statement in Path Pascal can be used to 
 achieve this goal. 
 
 For simplicity of analysis, it is assumed in this paper that a real- 
 time software system can be expressed by a set of periodic processes and each 
 process's request period is a multiple of the next smallest request period. 
 To ensure the timely responses of the whole real-time system, the deadline 
 mechanism is applied to each periodic process. The "service" statement embed- 
 ded in a "deadline process" in Path Pascal can be employed to perform this 
 function. 
 
Page 11 
 
 Now suppose the underlying real-time software system consists of M 
 
 (>0) periodic processes with request periods t., t_, ... t^, respectively. 
 
 Assume t. < t_ < .... < t^ and t is a multiple of t for i » 1, 2, ..., 
 
 M-l. 
 
 Let n ' ~r- i " 1» 2, M 
 
 i 
 and suppose f . is the failure probability of process i during one request 
 
 period t.. Note that the failure here includes the failures caused by logical 
 
 errors or timing errors. The failure probability of the total system during 
 
 one request period t^, denoted by F, can be approximated as follows: 
 
 During the period of t , process i will be executed n times. The 
 
 n i 
 reliability of the process i during t_, will be (1-f.) . For the system con- 
 
 M n. 
 
 sisting of M processes, the reliability during t^ is then II (1-f ) 
 
 ^ i-1 * 
 
 So the failure probability F will be 
 
 M n. 
 F - 1 - It(l-f ) 
 i-1 1 
 - 1 - (1-n f t +. .. ) (l-n 2 f 2 +. .. ).... (1-11^+. .. ) 
 
 « n.f. + n^f^ + ... + T\,f M ± some higher order terms 
 
 Since f. is very small (i.e., < f « 1), the higher order terms can 
 
 be ignored. Then we have 
 
 M 
 F 5 I n *f (1) 
 
 i-1 * X 
 
 A. 2 Stepwise Refinement . 
 
 The code in a process can be considered as a sequential program. The 
 stepwise refinement techniques now can be applied to the construction of a 
 real-time process. Each process in the underlying real-time system can be 
 further decomposed to a set of segments in a structured way. Three basic 
 
Page 12 
 structures are used extensively for the refinement, i.e., sequence, if-then- 
 else, and while loop. Fig. 1 shows their basic flow structures. 
 
 Sequence If -then-else 
 
 Fig. 1 Basic structures for refinement 
 
 While loop 
 
 Note that either 
 
 □ « O 
 
 represents a segment. Assume pro- 
 
 cess i contains N segments of which each has failure probability f , j = 1, 
 
 2 N . Let e be the execution probability associated with segment j in 
 
 process i. (e . . represents the conditional probability that segment j will be 
 executed during the execution of process i.) The failure probability of pro- 
 cess i, f , can then be approximated as: 
 
 N, 
 
 f 2 Z e *f 
 x i £ e ij r ij 
 
 •(2) 
 
 The formal proof can be found in [23] Appendix B, though the notations 
 used in [23] are different. Here we briefly sketch the proof by illustrating 
 a simple example of a real-time process shown in Fig. 2. 
 
Page 13 
 
 Fig. 2 Real-time process i - a simple example 
 
 In the example above, there are 3 possible paths between the beginning 
 
 and the end of the process (within one request period). Let tt be the proba- 
 
 mn 
 
 bility that segment n will be executed given the fact that segment m has been 
 executed. Each segment execution probability (e ) and path execution proba- 
 bility can then be expressed in terms of tt's. In Fig. 2, 
 
 e il " *81 - 1 
 
 '12 
 
 *12 e il " *12 
 
 > 
 
 e i8 " "l/ 23*35*57*78 + * 12*23*35*58 + *12*2A*46*68 " X 
 
 (3) 
 
Page 14 
 
 and Pr(path 1) - * 12* 23* 35* 57* 78 
 
 Pr(path 2) - * 12 » 2 3*35*58 
 
 Pr(path 3) - * 12 * 24*46* 
 
 68 
 
 ■(4) 
 
 Let Pr (failure | path j) be the failure probability during the execution 
 of path j. For path 1 we have 
 
 Pr (failure | path 1) = f ±J + (l-*^)*^ + (1 " f n )(1 " f i2 )f i3 + •••• + 
 
 (l-f il )(l-f i2 )(l-f i3 )(l-f i5 )(l-f i7 )f 18 
 ■ f., + f . ? + ^*-i + ^is + ^17 + ^*ft ± some higher order terms 
 
 A linear approximation results in 
 
 Pr( failure | path 1) 3 f^ + f ±2 + f ±3 + f ±5 + f ±J + i ±Q 
 
 Similarly, 
 
 Pr (failure | path 2) s f ±J + f ±2 + f^ + f ±5 + t ±Q 
 
 and Pr (failure | path 3) = f . + f . + f . + f , + f 
 
 i8 
 
 > (5) 
 
 Since f - Z Pr (failure | path j)*Pr(path j) 
 J-l 
 
 -(6) 
 
 Substitute (4) and (5) into (6) and group the terms for each f,.* If 
 we compare the coefficients of f. 's with (3), we then have 
 
 N i 
 
 f 3 E e *f where N - 8. 
 
 Combine (1) and (2), in general, we have 
 
 M N i • 
 F 3 E n E e *f 
 i-1 1 j-l 1J 1J 
 
 Let g. ■ n.e.. we then obtain 
 
 M N i 
 
 F 3 E Z 2 *f 
 
 *11 11 
 
 1-1 j-l X J X J 
 
 (7) 
 
 Now consider the recovery block scheme Is applied to segment j in 
 
Page 15 
 process i. Assume there are 2 alternates in the recovery and the failure pro- 
 bability of the primary or alternates is denoted by f , , (k - 1, 2, 3). Let 
 p be the failure probability of the corresponding acceptance test (i.e., the 
 probability that the acceptance test itself contains some logical errors). 
 The flow of a recovery block suggests that either the acceptance test failure 
 or the exhaustion of the alternates would cause a failure of the segment. So 
 we have 
 
 £ U " P U + f ijl (1 - p ij )p ij + f iJl f ij2 (1 -Pl/ P ij + 
 'ljl f l J 2 f lj3< 1 -"ij) 3 -— < 8 > 
 
 This is different from the "transition model for an application 
 routine" developed in [24] where transitions of a recovery block during some 
 arbitrary interval are considered. Formula (8) implies that f is limited by 
 the value of p. .. If the acceptance test can be so designed that p is very 
 closed to zero, then the failure probability of the segment is approximated by 
 the product of the failure probability of the primary and alternates. Such an 
 acceptance test can be obtained if the inverse of the segment function exists 
 or the results from the segment can be clearly predicted. However, that kind 
 of acceptance is very hard to achieve in many cases. Another approach is pro- 
 posed to replace the recovery block scheme. 
 
 Consider a two-out-of-three voting unit which has three alternates and 
 one voting component. The alternates are very similar to those in recovery 
 blocks except they must perform exactly the same functions (using different 
 algorithms). The voting component can be much simpler than the acceptance 
 test in recovery blocks. The logic of the voting component compares the 
 results produced by three alternates and makes a majority vote. The major 
 difference between the recovery block scheme and the majority voting unit is 
 
Page 16 
 that once the acceptance is passed, no further alternate will be tried in the 
 former scheme but all alternates must be executed in the latter case. 
 
 A possible syntax for the majority voting unit is as follows: 
 
 majority vote from 
 
 <alternate 1> 
 and <alternate 2> 
 and <alternate 3>; 
 
 One possible implementation holds the results from each alternate in a 
 recovery cache. Global variables are updated only if any two alternates pro- 
 duce agreeable results by comparing recovery caches (paying careful attention 
 to comparison of floating point variables). 
 
 Suppose this mechanism is applied to segment j in process i. Since 
 the logic of the voting component is so simple the failure probability is 
 assumed to be zero. Using the same notations employed in recovery blocks for 
 the failure probability of the alternates, we then have 
 
 f iJ " £ iJl f lJ2 f U3 + <1 - f iJl )f iJ2 f l J 3 + «- f iJ2> f i j l f iJ3 + 
 < 1 - f l J 3 )f lJl f l j 2 (9) 
 
 By comparing (8) and (9), we can conclude that if the acceptance test 
 of a recovery block cannot be designed with confidence then a majority voting 
 unit should be used, provided that alternates perform the same functions. 
 
 4.3 Analysis of the Model . 
 
 The model shows where the critical paths are. The larger the value 
 g. is in (7), the more critical the software segment is. In other words, the 
 segment with high execution probability residing in a highly frequently exe- 
 cuted process should include more fault-tolerance. 
 
Page 17 
 A simple analysis for the effectiveness of the approach can be made by 
 exercising the model. The example real-time software system which controls 
 the attitude of a satellite is taken from [22]. The request period (t ), fre- 
 quency count (n.), number of segments (N . ) , approximated segment execution 
 probabilities (e, 's) and g.,.,'s for each process are shown in Table 1. Each 
 segment consists of no more than 10 Path Pascal statements. 
 
 Process 
 
 Name 
 
 c i 
 
 n i 
 
 N i 
 
 6 ij 
 
 8 ij 
 
 1 
 
 Read Gyro 
 
 125 ms 
 
 8 
 
 1 
 
 1.0 
 
 8.0 
 
 2 
 
 Inertial Wheel 
 
 125 ms 
 
 8 
 
 1 
 
 1.0 
 
 8.0 
 
 3 
 
 Gyro Process 
 
 0.5 sec 
 
 2 
 
 21 
 
 6 of 0.4 
 
 2 of 0.5 
 
 13 of 1.0 
 
 0.8 
 1.0 
 2.0 
 
 4 
 
 Read Sensors 
 
 1 sec 
 
 1 
 
 1 
 
 1.0 
 
 1.0 
 
 5 
 
 Attitude Determi- 
 nation & control 
 
 1 sec 
 
 1 
 
 33 
 
 6 of 0.25 
 
 13 of 0.5 
 
 14 of 1.0 
 
 0.25 
 
 0.5 
 
 1.0 
 
 Table 1. 
 
 Without loss of generality, indices of the segments are assumed to be 
 an increasing order of the values of e 's. Substituting the figures in Table 
 
 1 into (7) we have 
 
 19 
 
 33 
 
 6 
 
 Z 
 j-1 " J j-7 
 
 F , 8 f n + 8 f^ + 0.8 Z f 3J +^f 3j + 2Z q f 3j + f 41 + 0.25^ ■ 
 
 21 
 
 Z 
 j=9 
 
 6 
 Z 
 
 0.5 Z f_. + Z f 
 j-7 DJ j-20 D3 
 
 ,-e 
 
 If the system failure probability is required to be 10 (which is 
 
 approximately equivalent to MTBF » 11 days), then each f needs to be in the 
 
 —7 —8 
 order of 10 or 10 . By applying recovery blocks and the two-out-of- three 
 
 majority voting units properly to the segments, the failure probabilities of 
 
 -3 
 
 .-4 
 
 the primaries or alternates can be constructed in the order of 10 or 10 . 
 
Page 18 
 5 Conclusions . 
 
 Fault-tolerance complements the fault-avoidance approach and further 
 improves the systems reliability. In this paper, a construction of a fault- 
 tolerant real-time software system by top-down modular design and stepwise 
 refinement techniques is presented. Various programming tools for fault- 
 tolerance in real-time systems are discussed. A simplified mathematical model 
 to estimate the effectiveness of the approach is also described. Though accu- 
 rate evaluation of the reliability of a software component is difficult, the 
 model supports the application of modular design and stepwise refinement to 
 fault-tolerance in construction of reliable real-time software. The model 
 suggests that large, highly reliable systems can be built from segments of 
 software that are as reliable as current software engineering primitives 
 (testing, verification, etc.) permit. 
 
 6 Acknowledgements . 
 
 We would like to thank Professor G. Belford (University of Illinois) 
 for her helpful suggestions during the preparation of this paper. The finan- 
 cial support from NASA (Grant NSG 1471) is also acknowledged. 
 
Page 19 
 References 
 
 [I] Parnas, D. , "On the Criteria to be Used in Decomposing Systems into 
 Modules", CACM, Vol. 15, No. 12, pp. 1053-1058, December, 1972. 
 
 [2] Dijkstra, E. W. , "Notes on Structured Programming," EWD 249, Technical U. 
 Eindhoven, The Netherlands, 1969. 
 
 [3J Goodenough, J. B. and S. L. Gerhart, "Toward a Theory of Test Data Selec- 
 tion," IEEE Transactions on Software Engineering, Vol. 1, No. 3, pp. 
 156-173, 1975. 
 
 [4] Hantler, S. L. , and J. C. King, "An Introduction to Proving the Correct- 
 ness of Programs," ACM Computing Surveys, Vol. 8, No. 4, pp. 391-407, 
 Dec. 1976. 
 
 [5] Horning, J. J., H. C. Lauer, P. M. Melliar-Smith and B. Randell, "A Pro- 
 gram Structure for Error Detection and Recovery," in Proc. Conf. Operat- 
 ing Systems; Theoretical and Practical Aspects. IRIA, 1974, pp. 177-193. 
 (Reprinted in Lecture Notes in Computer Science , Vol. 16, Springer- 
 Verlag, New York.) 
 
 [6] Randell, B., "System Structure for Software Fault Tolerance," IEEE Trans. 
 on Software Engineering, Vol. SE-1, No. 2, 1975, pp. 220-232. 
 
 [7] Wirth, N. , "Program Development by Stepwise Refinement," CACM, Vol. 14, 
 No. 4, pp. 221-227, April, 1971. 
 
 [8] Campbell, R. H. , K. H. Horton and G. G. Belford, "Simulations of a 
 Fault-Tolerant Deadline Mechanism," Proc. of the 9th International Conf. 
 on Fault-Tolerant Computing, Madison, Wisconsin, June, 1979. 
 
 [9] Anderson, T., and R. Kerr, "Recovery Blocks in Action: A system support- 
 ing high reliability," 2nd International Conference on Software Engineer- 
 ing, pp. 447-457, October, 1976. 
 
 [10] Wirth, N. , "Toward a Discipline of Real-Time Programming," CACM, Vol. 20, 
 No. 8, pp. 577-583, August, 1977. 
 
 [II] Intermetric Inc., HAL/S Manual, 1975. 
 
 [12] Wirth, N., "Modular a Language for Modular Multiprogramming," Software- 
 Practice and Experience, 7, pp. 3-84, 1977. 
 
 [13] Brinch Hansen, P., The Architecture of Concurrent Programs , Prentice- 
 Hall, Englewood Cliffs, New Jersey, 1977. 
 
 [14] Campbell, R. H. and R. B. Kolstad, "Path Expressions in Pascal," Fourth 
 International Conference on Software Engineering, Munich, Germany, Sept. 
 1979. 
 
 [15] Campbell, R. H. and R. B. Kolstad, "Practical Applications of Path Pascal 
 to Systems Programming," ACM79, Detroit, 1979. 
 
Page 20 
 
 [16] Wegner, P., "Programming with ADA: An Introduction by Means of Graduated 
 Examples," SIGPLAN NOTICES, Vol. 14, No. 12, pp. 2-46, Dec. 1979. 
 
 [17] Hoare, C. A. R., "Monitors: An Operating System Structuring Concept," 
 CACM, Vol. 17, No. 10, pp. 549-557, October, 1974. 
 
 [18] Campbell R. H., "Path Expressions: A Technique for Specifying Process 
 Synchronization," Ph. D. Thesis, The University of Newcastle Upon Tyne, 
 August, 1976. 
 
 [19] Kim, K. H. , "An Approach to Programmer-Transparent Coordination of Recov- 
 ering Parallel Processes and Its Efficient Implementation Rules," Proc. 
 of the 1978 International Conference on Parallel Processing, pp. 58-68, 
 1978. 
 
 [20] Liu, C. L. and J. W. Layland, "Scheduling Algorithms for Multiprogramming 
 in a Hard-Real-Time Environment," JACM, Vol. 20, No. 1, 1973, pp. 46-61. 
 
 [21] Liestman, A. L. and R. H. Campbell, "A Fault-tolerant Scheduling Prob- 
 lem," Tech. Report UIUCDCS-R-80-1010, Dept. of Computer Science, Univer- 
 sity of Illinois, Feb. 1980. 
 
 [22] Wei, A. Y. , K. Hiraishi, R. Cheng, R. H. Campbell, "Application of the 
 Fault-Tolerant Deadline Mechanism to a Satellite On-Board Computer Sys- 
 tem," Proc. of the 10th International Conf . on Fault-Tolerant Computing, 
 October, 1980. 
 
 [23] Gannon, T. F., and S. D. Shapiro, "An Optimal Approach to Fault Tolerant 
 Software Systems Design," IEEE Trans, on Software Engineering, Vol. SE-4, 
 No. 5, 1978, pp. 390-409. 
 
 [24] Hecht H. , "Fault-Tolerant Software for Real-Time Applications," Computing 
 Surveys, Vol. 8, No. 4, 1976, pp. 391-407. 
 
HBLIOGRAPHIC DATA 
 HEET 
 
 1. Report No. 
 
 R-80-1042 
 
 Title and Subtitle 
 
 CONSTRUCTION OF A FAULT- TOLERANT REAL-TIME SOFTWARE SYSTEM 
 
 3. Recipient's Accession No. 
 
 5. Report Date 
 
 December 1980 
 
 6. 
 
 Author(s) 
 
 Anthony Y. Wei and Roy H. Campbell 
 
 8. Performing Organization Rept. 
 
 No. R-80-1042 
 
 Performing Organization Name and Address 
 
 Department of Computer Science 
 University of Illinois 
 Urbana, IL 61801 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract /Grant No. 
 
 NASA NSG 1471 
 
 2. Sponsoring Organization Name and Address 
 
 NASA Langley Research Center 
 Hampton, VA 23665 
 
 13. Type of Report & Period 
 Covered 
 
 technical 
 
 14. 
 
 5. Supplementary Notes 
 
 S. Abstracts 
 
 This paper presents an approach to the construction of a fault-tolerant real-time 
 software system with high reliability. Top-down modular design and program 
 development by stepwise refinement are used for the construction of fault-tolerant 
 systems. The recovery block scheme, deadline mechanism, and a majority voting 
 unit are employed to achieve the desired reliability for a real-time process, 
 a segment or refinement. Criteria identifying the appropriate fault-tolerant 
 schemes for a segment or refinement are described. A mathematical model estimates 
 the effectiveness of this approach using a linear approximation technique. The 
 model supports the application of modular design and stepwise refinement to 
 fault-tolerance in construction of reliable real-time software. 
 
 7. Key Words and Document Analysis. 17a. Descriptors 
 
 fault- tolerance 
 recovery block 
 deadline mechanism 
 majority vote 
 stepwise refinement 
 Path Pascal 
 
 7b. Identifiers/Open-Ended Terms 
 
 7c. COSATI Field/Group 
 
 B. Availability Statement 
 
 19. Security Class (This 
 Report) 
 
 UNCLASSIFIED 
 
 20. Security Class (This 
 Page 
 
 UNCLASSIFIED 
 
 21. No. of Pages 
 
 24 
 
 22. Price 
 
 ORM NTIS-35 ( 10-70) 
 
 USCOMM-DC 40329-P71