The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. To renew call Telephone Center, 333-8400 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN maToTW L161— O-1096 UIUCDCS-R-80-1042 Ocrp.3 UILU-ENG 80 1742 CONSTRUCTION OF A FAULT-TOLERANT REAL-TIME SOFTWARE SYSTEM by Anthony Y. Wei Roy H. Campbell 1981 December 1980 UIUCDCS-R-80-1042 CONSTRUCTION OF A FAULT-TOLERANT REAL-TIME SOFTWARE SYSTEM by Anthony Y. Wei Roy H. Campbell December 1980 DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS 61801 Research supported in part by NASA Project NSG 1471 ABSTRACT This paper presents an approach to the construction of a fault- tolerant real-time software system with high reliability. Top-down modular design and program development by stepwise refinement are used for the con- struction of fault-tolerant systems. The recovery block scheme, deadline mech- anism, and a majority voting unit are employed to achieve the desired relia- bility for a real-time process, a segment or refinement. Criteria identifying the appropriate fault-tolerant schemes for a segment or refinement are described. A mathematical model estimates the effectiveness of this approach using a linear approximation technique. The model supports the application of modular design and stepwise refinement to fault-tolerance in construction of reliable real-time software. Digitized by the Internet Archive in 2013 http://archive.org/details/constructionoffa1042weia Page 1 1 Introduction . Reliable software has traditionally been obtained through fault- avoidance techniques. Efforts to achieve this goal include top-down modular design [1], structured programming [2], walkthroughs, testing techniques [3], design of proper programming tools, and program correctness proofs [4]. Despite these approaches, software systems cannot be guaranteed to be fault- free. Fault-tolerance [5, 6] is proposed to complement fault-avoidance and further improve the reliability of software systems. Fault-tolerance does not eliminate the need for reliable components and fault-avoidance techniques are required for constructing fault-tolerance in reliable systems. A software system can provide continuous and trustworthy service by tolerating certain faults and detecting and recovering from the errors. Such activity is particularly valuable in real-time systems such as air traffic control systems where the delays caused by the interruption are unacceptable. This paper presents an approach to the construction of a fault- tolerant real-time software system with high reliability. Top-down modular design and program development by stepwise refinement [7] are used for the construction of fault-tolerant systems. The recovery block scheme [5, 6], deadline mechanism [8], and a majority voting unit are employed to achieve the desired reliability for a real-time process, a segment or refinement. Cri- teria identifying the appropriate fault-tolerant schemes for a refinement or module are described. A mathematical model estimates the effectiveness of this approach using a linear approximation technique. The model supports the application of modular design and stepwise refinement to fault-tolerance in construction of reliable real-time software. Page 2 The recovery block scheme and the deadline mechanism are briefly described in section 2. An effort to incorporate these recovery mechanisms into an existing language as a tool for fault-tolerant real-time programming is introduced in section 3. In section 4 the construction of a fault-tolerant real-time software system using the tools provided is presented. A simplified mathematical model and an analysis for the effectiveness of the approach are also discussed in section 4. 2 Error Recovery Mechanisms in Real-Time Systems * Graceful degradation is a desirable attribute of a real-time system. The system should not simply halt operation in the event of some internal/external errors. Two types of errors in real-time software systems are considered in this paper: logical errors and timing errors . Logical errors include design residual errors and run-time errors not detected in the testing process. Timing errors only occur in real-time systems where timely execution of tasks is required, especially for some time-critical systems. A timing error occurs if a real-time system violates the timing constraints specified for that system. The recovery block scheme [5, 6] has been proposed to handle the logical errors and the deadline mechanism [8] is a solution to the timing error problems. Page 3 2.1 Recovery Blocks . The recovery block scheme [5, 6] has been proposed as a mechanism to support fault-tolerant software. A recovery block consists of an acceptance test, a primary and a collection of alternates . The acceptance test is an assertion about the results computed by the primary or the alternates. If the acceptance test is false after the primary execution, an alternate is exe- cuted. Errors detected implicitly by the hardware or software during the exe- cution of the primary or an alternate automatically set the acceptance test to false. If all alternates of a recovery block fail to satisfy the acceptance test, a software error is raised. The idea of providing many alternates is very similar to providing stand-by sparing circuits in hardware. However, the redundancy required in software is not merely the replication of a program but redundancy of design. The alternates may perform the same function as the primary but use different algorithms. During the execution of the primary or an alternate, the hardware automatically saves the "old" contents of modified global variables In a recovery cache [9]. If the primary or an alternate fails, the recovery mecha- nism restores those old contents from the recovery cache and returns the pro- gram to a recovery point which Is the state before entering the recovery block. Another alternate, if any, is then initiated. Recovery blocks do provide alternative algorithms to cope with general logical errors but can not handle timing errors properly. Recovery blocks are insensitive to the passage of time and cannot be used to ensure that a response is generated by a specific deadline. (That is, it is impossible to recover from a missed deadline by resetting the system clock and executing an alternate.) We need another mechanism to handle the timing errors In real-time Page 4 systems* 2.2 Deadline Mechanism * The deadline mechanism, although similar to the recovery block scheme, performs a different function. The deadline mechanism includes scheduling mechanisms to ensure timely responses. The deadline mechanism associates two algorithms with each particular service component. The primary algorithm produces a better quality service than the alternate . The alternate is a simpler, deterministic algorithm which always produces an acceptable result in a known, fixed length of time. The deadline mechanism provides fault-tolerance in a real-time system by ensuring that either the primary or the alternate will complete before the deadline. A set of simulations demonstrating the feasibility of fault-tolerance in real- time systems is described in [8]. Reliable scheduling of primaries or alternates to meet real-time con- straints requires the calculation of the execution time of each algorithm. Accurate determination of execution times is critical to system performance and reliability. Since upper bounds on the execution times for the alternate algorithms are assumed, the deadline mechanism has been designed so that the scheduler reserves a time for execution of the alternate as requests arrive. Primaries are scheduled in any remaining time which is called slack time. Page 5 3 Programming Tools for Real-Time Systems . Reliability can be improved by the development of high level program- ming languages to support real-time systems [10]. HAL/S [11] has been used to program the Space Shuttle control system. Modula [12], Concurrent Pascal [13], and Path Pascal [14, 15] are all efforts in this direction. The Ada [16] language may be also used for real-time programming but as yet is una- vailable. They all offer sequential features to improve reliable encoding such as modularity, data abstraction, control structures, etc. Concurrent features (called tasks or processes) are also provided in these high level languages to facilitate systems programming. A process or task is an indepen- dent execution sequence. Processes or tasks can interact and are coordinated by performing operations on shared variables through some high level synchron- ization mechanisms (e.g., Monitors [17] and Path Expressions [18]). These languages also permit 1/0 device programming using hardware dependent features. However, few include fault-tolerant mechanisms. Here, we briefly describe our efforts to incorporate the error recovery mechanisms mentioned in the previous section into the Path Pascal language. 3» 1 Recovery Blocks in Path Pascal . The syntax of recovery blocks [6] as incorporated in Path Pascal is shown below: Page 6 ensure by "primary" else by "1st alternate" else by "nth alternate" else error ; The acceptance test Is a Boolean expression which is evaluated after the execution of the primary. If the result is true, the statement following the recovery block is executed. However, should the result be false, the state of the computation is restored to that at entry to the recovery block and the first alternate is tried and so on. If all the alternates fail to produce acceptable results, then this is regarded as a failure of the entire recovery block. Since recovery blocks may be nested, recovery actions for a failed recovery block must be undertaken by an enclosing recovery block if any. The recovery cache mechanism has been implemented in the Path Pascal run-time system. A recovery cache is basically organized as a stack and is used to store the address and the old content of global variables modified during the execution of a recovery block. Upon a failure of the acceptance test, the modified global variables will be restored to the state prior to entering the recovery block. Each process has its own recovery cache. The storage for recovery cache is allocated when a process is instantiated. The domino effect, first described in [61, is an uncontrolled propaga- tion of state restoration among interacting processes. A domino-free system can be achieved by the approach similar to "programmer-transparent coordina- tion of recovering parallel processes" presented in [19]. By introducing some additional recovery points into the process that receives messages from other Page 7 processes, a domino effect can be eliminated. 3.2 Deadline Mechanism in Path Pascal . Path Pascal has been extended to include the fault-tolerant deadline mechanism. The extensions should not be considered as a language design, rather as experimental features which have allowed the programming of example reliable real-time systems. The language construct for the fault-tolerant deadline mechanism incorporated into Path Pascal is termed a deadline process . Each deadline process provides fault-tolerant deadline service. The repeat statement in Path Pascal has been augmented to have an optional phrase every to specify the request period (time between successive requests) for periodic processes. A within statement must be included in a deadline process to specify the response period (time within which the system must respond to a request). Within the deadline process one service statement appears, which specifies the primary and alternate algorithms that are to be used to satisfy the request. The syntax and semantic of the deadline mechanism in Path Pascal can be best explained by a simple example as follows: Page 8 deadline process positlonupdate; procedure compute; begin (* primary algorithm *) end ; procedure approximate; begin (* alternative algorithm *) end ; begin (* positlonupdate *) repeat every twoseconds within onesecond do service compute else approximate; until navigation_terminated; end; In the example above, compute Is a primary and approximate an alter- nate to do the position update. Deadline process "positlonupdate" receives requests at a rate of one every two seconds. The deadline process must pro- duce a response within one second upon request. The execution of the service statement sets a timer which is used to detect timing errors in the primary. The timer allows sufficient time for the processor to execute the alternate. The results from the primary will be preferred to the alternate's if the pri- mary completes successfully. The deadline of a process is calculated from the response period when a service statement Is executed. A request is generated by the system clock for periodic processes based on their request periods. Non-periodic processes use the minimum value among request periods as a specification for all requests. The "every" phrase in the request statement is replaced by " at least" . Synchronization primitives are used to generate a request for non- periodic processes. If a request arrives too soon, it is delayed until the correct request time calculated according to the request period. Page 9 Alternates are scheduled according to the rate-monotonic priority assignment [20] using the response period as a static priority, with small values having high priority. At any given instant the alternate with the highest priority (smallest response period) executes. This scheme requires preemption of alternates. Such a system is timely if processor utilization is less than In 2 (approx. 0.69) [20]. The primaries are scheduled in earliest- deadline-first order. Within a request period, the alternate is executed first. This is called first-chance scheduling algorithm. Another optimal scheduling algorithm which maximizes the number of the primaries executed for a set of periodic processes is developed and reported in [21]. The recovery caches of the recovery block are also used for the dead- line mechanism and can be shared. Recovery cache algorithms have been modi- fied to hold results from the alternate until the primary either fails or com- pletes successfully in the deadline mechanism. A compiler option also provides an estimate of the execution time for compiled functions, procedures and processes. The compiler checks the execu- tion time for the code(s) included in the "within" statement and verifies that it is less than the corresponding response period. A global verification of all deadline processes meeting deadlines is also made by the compiler. A satellite on-board computer system simulation using deadlines in Path Pascal has been reported in [22]. The results show the potential of the deadline mechanism to describe a time-critical real-time system. Page 10 4 Top-Dovn Modular Design * In this section, a top-down modular design of a real-time system is presented. A mathematical model for periodic real-time systems estimates the effectiveness of the approach. 4.1 Concurrent Processes . Beal-time programs often monitor and control several external processes. This leads to decomposition of a real-time system into a disjoint set of processes. This disjoint set of processes may be expressed as operating in parallel. The description of a real-time system as distinct concurrent processes may be simpler than as a single integrated sequential task. Con- current features in Path Pascal support such decomposition. Furthermore, tim- ing constraints are assumed to impose upon each process. A timing constraint specifies how often the process should run (request period) and how soon the process should respond (response period). The augmented options in the "repeat" statement and the "within" statement in Path Pascal can be used to achieve this goal. For simplicity of analysis, it is assumed in this paper that a real- time software system can be expressed by a set of periodic processes and each process's request period is a multiple of the next smallest request period. To ensure the timely responses of the whole real-time system, the deadline mechanism is applied to each periodic process. The "service" statement embed- ded in a "deadline process" in Path Pascal can be employed to perform this function. Page 11 Now suppose the underlying real-time software system consists of M (>0) periodic processes with request periods t., t_, ... t^, respectively. Assume t. < t_ < .... < t^ and t is a multiple of t for i » 1, 2, ..., M-l. Let n ' ~r- i " 1» 2, M i and suppose f . is the failure probability of process i during one request period t.. Note that the failure here includes the failures caused by logical errors or timing errors. The failure probability of the total system during one request period t^, denoted by F, can be approximated as follows: During the period of t , process i will be executed n times. The n i reliability of the process i during t_, will be (1-f.) . For the system con- M n. sisting of M processes, the reliability during t^ is then II (1-f ) ^ i-1 * So the failure probability F will be M n. F - 1 - It(l-f ) i-1 1 - 1 - (1-n f t +. .. ) (l-n 2 f 2 +. .. ).... (1-11^+. .. ) « n.f. + n^f^ + ... + T\,f M ± some higher order terms Since f. is very small (i.e., < f « 1), the higher order terms can be ignored. Then we have M F 5 I n *f (1) i-1 * X A. 2 Stepwise Refinement . The code in a process can be considered as a sequential program. The stepwise refinement techniques now can be applied to the construction of a real-time process. Each process in the underlying real-time system can be further decomposed to a set of segments in a structured way. Three basic Page 12 structures are used extensively for the refinement, i.e., sequence, if-then- else, and while loop. Fig. 1 shows their basic flow structures. Sequence If -then-else Fig. 1 Basic structures for refinement While loop Note that either □ « O represents a segment. Assume pro- cess i contains N segments of which each has failure probability f , j = 1, 2 N . Let e be the execution probability associated with segment j in process i. (e . . represents the conditional probability that segment j will be executed during the execution of process i.) The failure probability of pro- cess i, f , can then be approximated as: N, f 2 Z e *f x i £ e ij r ij •(2) The formal proof can be found in [23] Appendix B, though the notations used in [23] are different. Here we briefly sketch the proof by illustrating a simple example of a real-time process shown in Fig. 2. Page 13 Fig. 2 Real-time process i - a simple example In the example above, there are 3 possible paths between the beginning and the end of the process (within one request period). Let tt be the proba- mn bility that segment n will be executed given the fact that segment m has been executed. Each segment execution probability (e ) and path execution proba- bility can then be expressed in terms of tt's. In Fig. 2, e il " *81 - 1 '12 *12 e il " *12 > e i8 " "l/ 23*35*57*78 + * 12*23*35*58 + *12*2A*46*68 " X (3) Page 14 and Pr(path 1) - * 12* 23* 35* 57* 78 Pr(path 2) - * 12 » 2 3*35*58 Pr(path 3) - * 12 * 24*46* 68 ■(4) Let Pr (failure | path j) be the failure probability during the execution of path j. For path 1 we have Pr (failure | path 1) = f ±J + (l-*^)*^ + (1 " f n )(1 " f i2 )f i3 + •••• + (l-f il )(l-f i2 )(l-f i3 )(l-f i5 )(l-f i7 )f 18 ■ f., + f . ? + ^*-i + ^is + ^17 + ^*ft ± some higher order terms A linear approximation results in Pr( failure | path 1) 3 f^ + f ±2 + f ±3 + f ±5 + f ±J + i ±Q Similarly, Pr (failure | path 2) s f ±J + f ±2 + f^ + f ±5 + t ±Q and Pr (failure | path 3) = f . + f . + f . + f , + f i8 > (5) Since f - Z Pr (failure | path j)*Pr(path j) J-l -(6) Substitute (4) and (5) into (6) and group the terms for each f,.* If we compare the coefficients of f. 's with (3), we then have N i f 3 E e *f where N - 8. Combine (1) and (2), in general, we have M N i • F 3 E n E e *f i-1 1 j-l 1J 1J Let g. ■ n.e.. we then obtain M N i F 3 E Z 2 *f *11 11 1-1 j-l X J X J (7) Now consider the recovery block scheme Is applied to segment j in Page 15 process i. Assume there are 2 alternates in the recovery and the failure pro- bability of the primary or alternates is denoted by f , , (k - 1, 2, 3). Let p be the failure probability of the corresponding acceptance test (i.e., the probability that the acceptance test itself contains some logical errors). The flow of a recovery block suggests that either the acceptance test failure or the exhaustion of the alternates would cause a failure of the segment. So we have £ U " P U + f ijl (1 - p ij )p ij + f iJl f ij2 (1 -Pl/ P ij + 'ljl f l J 2 f lj3< 1 -"ij) 3 -— < 8 > This is different from the "transition model for an application routine" developed in [24] where transitions of a recovery block during some arbitrary interval are considered. Formula (8) implies that f is limited by the value of p. .. If the acceptance test can be so designed that p is very closed to zero, then the failure probability of the segment is approximated by the product of the failure probability of the primary and alternates. Such an acceptance test can be obtained if the inverse of the segment function exists or the results from the segment can be clearly predicted. However, that kind of acceptance is very hard to achieve in many cases. Another approach is pro- posed to replace the recovery block scheme. Consider a two-out-of-three voting unit which has three alternates and one voting component. The alternates are very similar to those in recovery blocks except they must perform exactly the same functions (using different algorithms). The voting component can be much simpler than the acceptance test in recovery blocks. The logic of the voting component compares the results produced by three alternates and makes a majority vote. The major difference between the recovery block scheme and the majority voting unit is Page 16 that once the acceptance is passed, no further alternate will be tried in the former scheme but all alternates must be executed in the latter case. A possible syntax for the majority voting unit is as follows: majority vote from and and ; One possible implementation holds the results from each alternate in a recovery cache. Global variables are updated only if any two alternates pro- duce agreeable results by comparing recovery caches (paying careful attention to comparison of floating point variables). Suppose this mechanism is applied to segment j in process i. Since the logic of the voting component is so simple the failure probability is assumed to be zero. Using the same notations employed in recovery blocks for the failure probability of the alternates, we then have f iJ " £ iJl f lJ2 f U3 + <1 - f iJl )f iJ2 f l J 3 + «- f iJ2> f i j l f iJ3 + < 1 - f l J 3 )f lJl f l j 2 (9) By comparing (8) and (9), we can conclude that if the acceptance test of a recovery block cannot be designed with confidence then a majority voting unit should be used, provided that alternates perform the same functions. 4.3 Analysis of the Model . The model shows where the critical paths are. The larger the value g. is in (7), the more critical the software segment is. In other words, the segment with high execution probability residing in a highly frequently exe- cuted process should include more fault-tolerance. Page 17 A simple analysis for the effectiveness of the approach can be made by exercising the model. The example real-time software system which controls the attitude of a satellite is taken from [22]. The request period (t ), fre- quency count (n.), number of segments (N . ) , approximated segment execution probabilities (e, 's) and g.,.,'s for each process are shown in Table 1. Each segment consists of no more than 10 Path Pascal statements. Process Name c i n i N i 6 ij 8 ij 1 Read Gyro 125 ms 8 1 1.0 8.0 2 Inertial Wheel 125 ms 8 1 1.0 8.0 3 Gyro Process 0.5 sec 2 21 6 of 0.4 2 of 0.5 13 of 1.0 0.8 1.0 2.0 4 Read Sensors 1 sec 1 1 1.0 1.0 5 Attitude Determi- nation & control 1 sec 1 33 6 of 0.25 13 of 0.5 14 of 1.0 0.25 0.5 1.0 Table 1. Without loss of generality, indices of the segments are assumed to be an increasing order of the values of e 's. Substituting the figures in Table 1 into (7) we have 19 33 6 Z j-1 " J j-7 F , 8 f n + 8 f^ + 0.8 Z f 3J +^f 3j + 2Z q f 3j + f 41 + 0.25^ ■ 21 Z j=9 6 Z 0.5 Z f_. + Z f j-7 DJ j-20 D3 ,-e If the system failure probability is required to be 10 (which is approximately equivalent to MTBF » 11 days), then each f needs to be in the —7 —8 order of 10 or 10 . By applying recovery blocks and the two-out-of- three majority voting units properly to the segments, the failure probabilities of -3 .-4 the primaries or alternates can be constructed in the order of 10 or 10 . Page 18 5 Conclusions . Fault-tolerance complements the fault-avoidance approach and further improves the systems reliability. In this paper, a construction of a fault- tolerant real-time software system by top-down modular design and stepwise refinement techniques is presented. Various programming tools for fault- tolerance in real-time systems are discussed. A simplified mathematical model to estimate the effectiveness of the approach is also described. Though accu- rate evaluation of the reliability of a software component is difficult, the model supports the application of modular design and stepwise refinement to fault-tolerance in construction of reliable real-time software. The model suggests that large, highly reliable systems can be built from segments of software that are as reliable as current software engineering primitives (testing, verification, etc.) permit. 6 Acknowledgements . We would like to thank Professor G. Belford (University of Illinois) for her helpful suggestions during the preparation of this paper. The finan- cial support from NASA (Grant NSG 1471) is also acknowledged. Page 19 References [I] Parnas, D. , "On the Criteria to be Used in Decomposing Systems into Modules", CACM, Vol. 15, No. 12, pp. 1053-1058, December, 1972. [2] Dijkstra, E. W. , "Notes on Structured Programming," EWD 249, Technical U. Eindhoven, The Netherlands, 1969. [3J Goodenough, J. B. and S. L. Gerhart, "Toward a Theory of Test Data Selec- tion," IEEE Transactions on Software Engineering, Vol. 1, No. 3, pp. 156-173, 1975. [4] Hantler, S. L. , and J. C. King, "An Introduction to Proving the Correct- ness of Programs," ACM Computing Surveys, Vol. 8, No. 4, pp. 391-407, Dec. 1976. [5] Horning, J. J., H. C. Lauer, P. M. Melliar-Smith and B. Randell, "A Pro- gram Structure for Error Detection and Recovery," in Proc. Conf. Operat- ing Systems; Theoretical and Practical Aspects. IRIA, 1974, pp. 177-193. (Reprinted in Lecture Notes in Computer Science , Vol. 16, Springer- Verlag, New York.) [6] Randell, B., "System Structure for Software Fault Tolerance," IEEE Trans. on Software Engineering, Vol. SE-1, No. 2, 1975, pp. 220-232. [7] Wirth, N. , "Program Development by Stepwise Refinement," CACM, Vol. 14, No. 4, pp. 221-227, April, 1971. [8] Campbell, R. H. , K. H. Horton and G. G. Belford, "Simulations of a Fault-Tolerant Deadline Mechanism," Proc. of the 9th International Conf. on Fault-Tolerant Computing, Madison, Wisconsin, June, 1979. [9] Anderson, T., and R. Kerr, "Recovery Blocks in Action: A system support- ing high reliability," 2nd International Conference on Software Engineer- ing, pp. 447-457, October, 1976. [10] Wirth, N. , "Toward a Discipline of Real-Time Programming," CACM, Vol. 20, No. 8, pp. 577-583, August, 1977. [II] Intermetric Inc., HAL/S Manual, 1975. [12] Wirth, N., "Modular a Language for Modular Multiprogramming," Software- Practice and Experience, 7, pp. 3-84, 1977. [13] Brinch Hansen, P., The Architecture of Concurrent Programs , Prentice- Hall, Englewood Cliffs, New Jersey, 1977. [14] Campbell, R. H. and R. B. Kolstad, "Path Expressions in Pascal," Fourth International Conference on Software Engineering, Munich, Germany, Sept. 1979. [15] Campbell, R. H. and R. B. Kolstad, "Practical Applications of Path Pascal to Systems Programming," ACM79, Detroit, 1979. Page 20 [16] Wegner, P., "Programming with ADA: An Introduction by Means of Graduated Examples," SIGPLAN NOTICES, Vol. 14, No. 12, pp. 2-46, Dec. 1979. [17] Hoare, C. A. R., "Monitors: An Operating System Structuring Concept," CACM, Vol. 17, No. 10, pp. 549-557, October, 1974. [18] Campbell R. H., "Path Expressions: A Technique for Specifying Process Synchronization," Ph. D. Thesis, The University of Newcastle Upon Tyne, August, 1976. [19] Kim, K. H. , "An Approach to Programmer-Transparent Coordination of Recov- ering Parallel Processes and Its Efficient Implementation Rules," Proc. of the 1978 International Conference on Parallel Processing, pp. 58-68, 1978. [20] Liu, C. L. and J. W. Layland, "Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment," JACM, Vol. 20, No. 1, 1973, pp. 46-61. [21] Liestman, A. L. and R. H. Campbell, "A Fault-tolerant Scheduling Prob- lem," Tech. Report UIUCDCS-R-80-1010, Dept. of Computer Science, Univer- sity of Illinois, Feb. 1980. [22] Wei, A. Y. , K. Hiraishi, R. Cheng, R. H. Campbell, "Application of the Fault-Tolerant Deadline Mechanism to a Satellite On-Board Computer Sys- tem," Proc. of the 10th International Conf . on Fault-Tolerant Computing, October, 1980. [23] Gannon, T. F., and S. D. Shapiro, "An Optimal Approach to Fault Tolerant Software Systems Design," IEEE Trans, on Software Engineering, Vol. SE-4, No. 5, 1978, pp. 390-409. [24] Hecht H. , "Fault-Tolerant Software for Real-Time Applications," Computing Surveys, Vol. 8, No. 4, 1976, pp. 391-407. HBLIOGRAPHIC DATA HEET 1. Report No. R-80-1042 Title and Subtitle CONSTRUCTION OF A FAULT- TOLERANT REAL-TIME SOFTWARE SYSTEM 3. Recipient's Accession No. 5. Report Date December 1980 6. Author(s) Anthony Y. Wei and Roy H. Campbell 8. Performing Organization Rept. No. R-80-1042 Performing Organization Name and Address Department of Computer Science University of Illinois Urbana, IL 61801 10. Project/Task/Work Unit No. 11. Contract /Grant No. NASA NSG 1471 2. Sponsoring Organization Name and Address NASA Langley Research Center Hampton, VA 23665 13. Type of Report & Period Covered technical 14. 5. Supplementary Notes S. Abstracts This paper presents an approach to the construction of a fault-tolerant real-time software system with high reliability. Top-down modular design and program development by stepwise refinement are used for the construction of fault-tolerant systems. The recovery block scheme, deadline mechanism, and a majority voting unit are employed to achieve the desired reliability for a real-time process, a segment or refinement. Criteria identifying the appropriate fault-tolerant schemes for a segment or refinement are described. A mathematical model estimates the effectiveness of this approach using a linear approximation technique. The model supports the application of modular design and stepwise refinement to fault-tolerance in construction of reliable real-time software. 7. Key Words and Document Analysis. 17a. Descriptors fault- tolerance recovery block deadline mechanism majority vote stepwise refinement Path Pascal 7b. Identifiers/Open-Ended Terms 7c. COSATI Field/Group B. Availability Statement 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 24 22. Price ORM NTIS-35 ( 10-70) USCOMM-DC 40329-P71