UNIVERSITY OF 
 
 ILLINOIS LIBRARY 
 
 AT URBANA-CHAMPAIGN 
 
 ENGINEERING 
 
m 3 logo 
 
 The person charging ttl?s w material is re- 
 sponsible for its return to the library from 
 which it was withdrawn on or before the 
 Latest Date stamped below. 
 
 Theft, mutilation, and underlining of books are reasons 
 for disciplinary action and may result in dismissal from 
 the University. 
 To renew call Telephone Center, 333-8400 
 
 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN 
 
 MAR 2 1981? 
 
 .;-t;i n 
 
 L161— 0-10% 
 

 CONFERENCE ROOM 
 
 ENGINEERING LIBRARY 
 UNIVERSITY OF ILLINOIS 
 
 URBANA, ILLINOIS 
 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA, ILLINOIS 61801 
 
 CAC Document Number 181 
 CCTC-WAD Document Number 6501 
 
 Research in 
 
 Network Data Management and 
 
 Resource Sharing 
 
 The Effect of Backup Strategy on 
 Data Base Availability 
 
 T ** Uhr* 
 
 February 1, 1976 ^*y 
 
 n/ *» , s 
 
 '"'"fcij./ji, „ 
 
Digitized by the Internet Archive 
 
 in 2012 with funding from 
 
 University of Illinois Urbana-Champaign 
 
 http://archive.org/details/researchinnetwor181belf 
 
CAC Document Number 181 
 CCTC-WAD Document Number 6501 
 
 Research in 
 Network Data Management and 
 Resource Sharing 
 
 The Effect of Backup Strategy on 
 Data Base Availability 
 
 by 
 
 Geneva G. Belford 
 Paul M. Schwartz 
 Suzanne Sluizer 
 
 Prepared for the 
 
 Command and Control Technical Center 
 
 WWMCCS ADP Directorate 
 
 of the 
 
 Defense Communication Agency 
 
 Washington, D.C. 
 
 under contract 
 DCA100-75-C-0021 
 
 Center for Advanced Computation 
 University of Illinois at Urbana-Champaign 
 Urbana, Illinois 61801 
 
 February 1, 1976 
 
 Approved for release 
 
 5^ A 
 
 Peter A. Alsberg, Principal Jp^^^Np-^nr^. ^ 
 
TABLE OF CONTENTS 
 
 IN6INCEJUNG USRARl 
 
 Page 
 
 Executive Summary 1 
 
 The problem 1 
 
 The model 2 
 
 Conclusions 2 
 
 Introduction A 
 
 File allocation 4 
 
 Network reliability modeling 5 
 
 Modeling computer system reliability 6 
 
 Modeling backup and recovery strategies 6 
 
 The present work 7 
 
 The Model 9 
 
 Overview 9 
 
 Parameters 11 
 
 Single-copy availability 11 
 
 Discussion of the parameter k 12 
 
 Availabilities for two backup strategies 13 
 
 Experiments and Discussion 17 
 
 Remote journaling 17 
 
 Frequently updated remote journal 20 
 
 Running spares 21 
 
 Effect of varying Y 22 
 
 Conclusions 23 
 
 References 24 
 
 Appendix 1: Extension of Chu's Formula 25 
 
TABLE OF CONTENTS (continued) 
 
 Page 
 
 Appendix 2: Stochastic Considerations 27 
 
 Basic assumptions 27 
 
 Probability that backup fails before master is ready 
 
 for use 28 
 
 Probability that backup site fails before its copy 
 
 is ready 32 
 
 Appendix 3: Time to Process the Audit Trail 34 
 
 Elementary analysis 34 
 
 Analysis using queueing theory 34 
 
 Appendix 4: Sensitivity to Parameter Values 39 
 
Executive Summary 
 
 The problem . The availability of a data base may be simply 
 defined as the fraction of time that the data is available to users. 
 Many things can cause a data base to become unavailable in a network, 
 setting. If the data base is stored at the same location as the user, 
 the system through which the data must be accessed may fail, or the device 
 on which the data base is resident may crash. If the data base is located 
 at a remote site on the network, the remote site or system may fail, the 
 network may partition so that the remote site cannot be reached, or some 
 local failure may make the network inaccessible to the user. 
 
 In most of these cases, availability can be considerably improved 
 if a backup copy of the data base exists. If copies of the data base 
 exist at two sites in the network, the danger of losing access because 
 of network partitioning or site failure is reduced. Furthermore, if a 
 local device holding all or part of the data base crashes, data may be 
 destroyed. It is likely to be much faster (as well as more reliable) to 
 ready a locally archived backup copy of the data for usage than to try to 
 recover the lost or degraded data from audit trails, etc. 
 
 How much the existence of a backup copy improves availability 
 depends on a number of factors. For example: 
 
 1) How available is the backup copy? (Is it stored on disk for 
 immediate access? If it is stored on tapes, a sizeable delay 
 
 may be incurred while the tapes are located, mounted, and loaded onto a 
 rapid-access device.) 
 
 2) How up-to-date is the backup copy? (Are all updates 
 applied to the backup copy as rapidly as possible? Is there a long 
 backlog of updates that must be processed before the data base is really 
 ready for use?) 
 
3) How often is the site (or device) holding the data base 
 likely to fail? (If failures are infrequent, the backup copy may provide 
 little improvement in availability.) 
 
 Even small improvements in availability can, of course, be 
 important. Availability can be over 0.99 and still be disastrously low 
 if, say, the data is unavailable for one 24-hour period during a year 
 and that period happens to be during a crisis. It is important, therefore, 
 to understand thoroughly how availability is affected by the factors 
 discussed in the preceding paragraph, and hence by the strategy used for 
 backing up a data base. 
 
 The model . We have developed simple algebraic formulas for 
 availability as a function of the factors listed above. Additional 
 parameters are incorporated to model the delay incurred in initiating 
 the process of readying the backup copy, the rate at which updates are 
 generated, and the rate at which updates are processed. We have assumed 
 the existence of a single backup copy, and have studied the improvement 
 in availability that the existence of a backup provides over single-copy 
 availability. The formulas are kept simple by using average values for 
 parameters that are actually random variables. For example, we use the 
 "mean time between failures" in the availability formula, while system 
 failure is actually a random process. In appendix 2, we look into the 
 validity of this simplification and conclude that its affect on computed 
 availabilities is, in most realistic situations, to make them appear 
 only slightly larger than they actually would be. 
 
 Conclusions . One main conclusion from studying the model is 
 that a backup copy can improve the availability of a data base by as 
 much as 5 to 10 per cent. To put this result into more concrete terms, 
 
suppose that a single copy is likely to be down for two hours per day 
 (availability = .917). A 5 percent improvement would produce an availa- 
 bility of .963, or a reduction of probable down time to about 54 minutes. 
 
 A second important conclusion is that if the backup copy is 
 readily accessible and kept reasonably up to date, the availability is 
 very close to 1. On the other hand, if the backup copy is stored on 
 tape, so that it is relatively out of date and locating it is a time- 
 consuming process, availability may be little better than was provided 
 by a single copy. (This is because one can probably repair the original 
 system about as rapidly as one can ready the backup.) Indeed, a backup 
 of this sort tends to be mainly useful for recovery from some accident 
 which destroys data in the original data base. 
 
 In this study, we have necessarily restricted ourselves to 
 trying to answer a few specific questions and to computing availabil- 
 ities for only a limited number, or range, of parameter values. However, 
 the formulas we have developed - and, even more, the simple, straight- 
 forward approach which yielded those formulas - have applicability in a 
 wide variety of settings. The most important aspect of this work is 
 not the particular numbers or formulas obtained but the tools developed 
 for studying availability in general. With little additional effort, 
 these tools can be used to provide answers to other questions regarding 
 the effect of backup strategy on availability. 
 
Introduction 
 
 We here use the terra availability to mean the fraction of time 
 that a data base is available to respond to user requests or queries. 
 In any setting, and particularly in a network, availability is a function 
 of the reliability (or availability) of many components - host computers, 
 network communications lines, etc. - as well as of strategies for backup 
 and recovery. In this section we first discuss some of the past modeling 
 research that has yielded results relevant to database availability, and 
 then introduce the line of work which we have pursued. 
 
 File allocation . One of the factors to be taken into account 
 in distributing copies of a file to various network sites is the number 
 of copies needed for an acceptable degree of availability. Chu [1973] 
 takes account of this factor in the following way. First, he defines 
 the availability of a piece of equipment (e.g., communication line or 
 
 computer) as 
 
 F 
 Availability = p + x , 
 
 where F is the mean time between failures and X is the mean time to 
 repair. Then, assuming 
 
 1) all computers in the network have identical availability A, 
 
 2) all communication channels have identical availability c, and 
 
 3) the network is completely connected; 
 
 Chu obtains the following formula for the availability of the j th file: 
 
 r . 
 A(l - (1 - Ac) J ), 
 
 where r is the number of copies of the jth file in the network. Once A 
 and c are known, it is a simple matter to choose r. so as to bring the 
 availability of a remote copy up to a satisfactory level. Overall avail- 
 ability, however, is bounded by the factor A, the availability of the 
 
requesting computer, which is apparently assumed not to possess a copy 
 of the file. 
 
 Although Chu's model, with its assumption of complete homo- 
 geneity of network components, may seem oversimplified, an analogous 
 analysis can be readily carried out in the heterogeneous case to yield 
 only slightly more complex expressions. (See appendix 1.) Notice, 
 however, that this model presents another problem. It implicitly assumes 
 that the files are static, or are simultaneously kept up to date by some 
 trouble-free process. In fact, the development of algorithms to keep 
 segments of a data base identical (or nearly so) is a topic of current 
 research. (See the chapter on Automated Backup in CAC Doc. No. 162, 
 JTSA Doc. No. 5509.) 
 
 Network reliability modeling . Another simplification in Chu's 
 model is the assumption that a direct communication line connects every 
 pair of sites. This assumption allows Chu to use a single parameter to 
 describe availability of a link from one site to another. In a general 
 network, this availability will depend in a complex way upon network 
 topology. Several alternate paths may exist between two given sites. 
 Each of these paths may involve more than one "hop" and so more than one 
 piece of subnet hardware. Indeed, in the ARPA network it has been found 
 that the failure rate for IMP's is about the same as that for communica- 
 tion channels, and that IMP failures therefore have the more drastic 
 effect on communications reliability [Frank, Kahn, and Kleinrock, 1972]. 
 Graph theoretical techniques for computing availability from component 
 reliabilities are, however, well known. The paper by Frank et al. con- 
 tains a brief review of these techniques. No great difficulty is envi- 
 sioned in applying them to any given network (such as the WIN) to obtain 
 
host availabilities. These may then be used in the formula given in 
 appendix 1 to obtain rough estimates of file (or data base) availability. 
 
 Modeling computer system reliability . Another parameter in 
 Chu's model that requires more detailed analysis for complete understanding 
 is computer availability. One source of information on computer avail- 
 ability is direct system measurement. On a lower level, however, 
 failures can be modeled to yield, in addition to overall figures on 
 expected system reliability, useful insights into repair and backup 
 strategies. 
 
 Borgerson and Freitas [1975] recently published a fairly 
 detailed stochastic model for computer system failure. Their model is 
 based on four distinct causes of crashes and their interrelationships. 
 Their ultimate result is a formula giving the probability density for 
 the event that the system crashes due to a failure. For our availability 
 analysis, however, there seems to be little need to include this level 
 of detail; we are simply concerned with failure rate - a measurable 
 quantity. 
 
 Modeling backup and recovery strategies . The discussion above 
 has been limited to availability questions involving network and site 
 reliabilities. On a lower level, the data base itself may "crash" or 
 may acquire errors. It is important that strategies for returning a 
 data base to its correct state be devised and studied. 
 
 A recent paper [Chandy et al., 1975] provides models for 
 rollback and recovery strategies. These strategies run as follows, 
 certain points in time ( checkpoints ) , a copy of the data is made and 
 stored. A listing of subsequent data updates (i.e., an audit trail ) is 
 then kept. When the master data base fails, it may then be recovered by 
 beginning with the old copy from the checkpoint and using the audit 
 trail to bring it up to date. Chandy et al. use queueing theory to 
 
 6 
 
model the processing of the audit trail. From the expected time to 
 complete this process, they can compute the total recovery time. The 
 
 length of the audit trail, and hence the time to recover, is a function 
 of the time interval between checkpoints. Optimization of availability 
 with respect to intercheckpoint time can then be carried out. Models of 
 some complexity are developed which take into consideration the possi- 
 bility of errors during recovery and the possibility of a transaction 
 arrival rate which varies in a cyclic manner (as opposed to being con- 
 stant) . The results appear to be very useful for developing insights 
 into recovery strategies, particularly for single-site systems. In a 
 network environment, however, it may be reasonable to assume that the 
 backup copy is stored remotely. In this case it does not make sense to 
 assume that the data is always restored from the backup, because of the 
 long time required to transfer a data base through the network. The 
 strategy then is to transfer the queries to the available copy. 
 
 The present work . In this note we attempt to quantify the 
 improvement in data base availability which can be achieved by storing a 
 backup copy at one (or more) remote sites in a network and transferring 
 usage to the backup when the master fails. We also discuss the practi- 
 cality of certain alternative management strategies. 
 
 To simplify the analysis, we will not consider various possible 
 causes of data base failure, but will assume that the data is available 
 when the host computer is running and is available (if remote) by way of 
 the network. We will therefore not be considering a detailed analysis of 
 the type of Borgerson and Freitas, nor will we be concerned with network 
 reliability modeling. Host failures are so much more common than communi- 
 cations link failures that the latter can be neglected in our simple model, 
 
Furthermore, we will not take into account scheduled down 
 time of the host computer, on the assumption that if down time is scheduled, 
 transfer to a backup copy is automatic and immediate, and leads to no loss 
 in availability. The very existence of a backup copy at an alternate 
 network site will of course improve availability considerably over the 
 case where only one site has a copy. Indeed, Chu's model (or a simple 
 modification of it) can be used to determine the improvement in availability 
 due to multiple copies when all copies are equally usable. Since some 
 readers may find this question of interest, we have included a discussion 
 of it in appendix 1. 
 
The Model 
 
 Overview . The process we are modeling may be described as 
 follows. Several sites in a network possess copies of a data base. One 
 of these copies is designated as the master copy . The others are re- 
 ferred to as spares or backups . All queries for the data base are sent 
 to the master site (i.e., the site holding the master copy). Updates are 
 applied to the master copy as soon as possible after they are generated, 
 so that the master copy is kept up to date. Two basic strategies for 
 updating the spares are encompassed by our model: 
 
 1) Running spares . Spares are updated almost as rapidly as 
 the master. 
 
 2) Remote journaling . Up-to-date copies of the data base are 
 periodically sent to the backup sites for storage. In be- 
 tween this periodic journaling, updates are logged in an 
 audit trail for application to a spare if and when one is 
 needed . 
 
 Occasionally the master copy becomes unavailable. We assume 
 this is caused by a failure of the host possessing the master copy and 
 not by, say, communication line failure. When the master site fails, 
 some sort of communication among sites takes place to determine which of 
 the spares should take over the responsibility of being the new master. 
 The length of the time interval from when the old master fails to when the new 
 master is decided upon is assumed to be a fixed constant. 
 
 Once a new master site has been selected, the spare copy at 
 that site must be readied to receive queries. This process of getting the 
 backup ready may involve time-consuming operations such as loading the data 
 from tape and processing the audit trail of updates which have not yet 
 
been applied to the backup copy. How close to "ready" the backups should 
 be kept is another strategy question which may be studied by our model. 
 
 As soon as the old master fails, the process of repairing it 
 begins. After the host has been repaired, the data base itself must be 
 readied. A backlog of updates has been accumulating while the master 
 was being repaired, and these updates must be applied to the data. 
 Thus, after a certain time lapse, the old master (i.e., the primary master) 
 is again ready to receive queries. The question is, should we immediately 
 reinstate the old master, or should we continue to send queries to the 
 new master until it fails? With our model we can study the impact on 
 availability of how we answer this question. There may, however, be 
 other issues involved. For example, most of the queries to the data base 
 may originate at the primary master site. In this case there are cost 
 and/or response advantages to be gained by transferring usage back to the 
 primary site as soon as possible. 
 
 For simplicity, we have described the process we are modeling in 
 fairly specific terms. It should be noted, however, that little change is 
 needed to model other, similar processes. For example, the backup and 
 master copies may be located at the same site. And the failures we are 
 concerned with may be the crashing (with accompanying data destruction) 
 of the device holding the master copy. In this case there are no network 
 messages to transfer usage to a remote site, nor need we worry about 
 repairing the host. But the need to get the backup ready by loading the 
 copy and then bringing it up to date remains the same. Only trivial 
 changes in the availability formulas we have derived will allow us to 
 study this sort of closely related process. 
 
 10 
 
Parameters . The parameters in our model are as follows: 
 F = mean time between computer failures, assumed to be the same 
 
 for all host computers. 
 X = expected time to repair computer. 
 
 L = expected time to load the data base copy at the remote site. 
 Y = time that the audit trail of updates has been growing (i.e., 
 
 time since the copy was correct) . 
 k = the ratio of update arrival rate to update processing rate.* 
 D = time delay between when the master fails and when the remote 
 site determines this fact and starts to get its copy ready 
 for use. 
 Single-copy availability . First, consider the case where there 
 is a single copy of the data base. The availability of this copy is then 
 
 A = F 
 
 o F + X + kX 
 
 This is the usual formula for availability (mean time between failures 
 divided by mean time between failures plus mean time to recover) . The 
 mean time to recover includes repair time X plus the time kX to process 
 the updates accumulated while repairs were made. (This formula for 
 recovery time is that used by Chandy et al. [1975].) There is a 
 question as to whether the term kX should be included here, since the 
 site is technically "up" after time X. But in a network setting, it 
 does seem appropriate to assume that updates initiated at remote sites 
 
 The parameter k is referred to in the literature as a "compression 
 factor [Chandy et al., 1975]. This is not to be confused with the 
 usual data compression factor which indicates by how much data is com- 
 pressed for storage or transfer. 
 
 11 
 
are being logged somewhere, so that there does exist an update list to 
 be processed. In addition, we are interested primarily in comparing A q 
 with availabilities computed for multi-copy strategies, where the copies 
 are assumed to be up to date. 
 
 Discussion of the parameter k . The rationale for using the 
 formula kY for the time to process an audit trail that has been accumul- 
 ating for a time period of length Y is as follows. Suppose u is the 
 rate of arrival of updates, and b is the rate of processing them. Then 
 during time Y a total of uY updates have accumulated and it takes time 
 uY/b=kY to process these. (We have defined k = u/b.) However, the 
 system can not really be said to be caught up after this much time, 
 since more updates were accumulating while the backlog was being processed. 
 Let us define T as the catch-up time, or time for the system to catch 
 up after a backlog of updates has accumulated. The determination of an 
 appropriate expression for T turns out to be a nontrivial problem. 
 This problem is examined in detail in appendix 3. We find there that 
 for a reasonable range of values of k, 2kY may be a more appropriate 
 expression for T than is kY. 
 
 In the remainder of this note, however, we will consider k as 
 an effective proportionality constant, defined by the assumption that kY 
 is the time to catch up after updates have been accumulating for time Y. 
 The reader should keep in mind that then k is not equal to u/b but is 
 somewhat larger, perhaps by as much as a factor of 2 or more. It is, 
 of course, possible for a site to obtain an effective k by measurement. 
 A T can be measured as the length of time between the time when processing 
 of the update backlog begins and when the update queue is first noted to 
 be empty. An average over several observations of T /Y should yield an 
 acceptable value for the effective k. 
 
 12 
 
Availabilities for two backup strategies . We shall consider 
 two strategies for transferring usage back and forth between master 
 copy and backup copy. Strategy 1 runs as follows. After the master copy 
 is determined to have failed, the remote copy is then brought up (after 
 a time lapse of D + L + kY) and usage is transferred to it. Meanwhile 
 the old master is being repaired. Queries and updates are sent to the 
 new master, however, until it fails, at which time the process repeats: 
 another "new" master is identified and activated. (This may or may not be 
 the "old" master.) Since the remote site may have been up for some time 
 since its last failure, one might think that, after the new master site is 
 identified, time until failure is only F/2. This is only true, however, 
 if the time between failures is always precisely F. If, as we are assuming, 
 failures form a Poisson process (i.e., occur randomly) it may be shown 
 that the expected time until failure is not F/2 but F. (See, for example, 
 [Kleinrock, 1975, pp. 169-174].) This result, known in renewal theory 
 as the "paradox of residual life", may be explained intuitively as occurring 
 because the old master has a higher probability of failing during a rel- 
 atively long inter-failure period at the new site. 
 
 Strategy 1 is diagrammed in figure 1. Looking at the diagram 
 and ignoring the initial time period, one can see that the fraction of 
 time some copy of the data base is available is 
 
 A 1 = (F - L - kY)/(F + D). 
 The quantity A is then the data base availability under strategy 1. 
 
 Notice also that an obvious built-in assumption can be read from 
 the figure. 
 
 (1) D + L + kY < X + kX 
 If this inequality is not satisfied, it theoretically does not pay to 
 
 13 
 
X + kX 
 
 Ho 
 
 MASTER UP 
 
 + kY 
 
 -H D + L U- 
 + kY 
 
 COPY 
 
 (NEW 
 
 MASTER) 
 
 UP 
 
 TIME 
 
 Figure 1 
 Diagram of strategy 1. 
 
 X + kX 
 
 -■■K- 
 
 ^ V b 
 
 Hil j. LV L-_ 
 
 « I- V 
 
 U + L + KY 
 
 r 
 
 
 MASTER UP 
 
 
 COPY 
 UP 
 
 MASTER UP 
 
 
 TIME 
 
 Figure 2 
 Diagram of strategy 2, 
 
 14 
 
store a remote copy, since the master is expected to be repaired and 
 updated before the remote copy can be activated. 
 
 Strategy 2 is to immediately replace the copy by the old 
 master as soon as the latter has been brought back up. This scheme is 
 diagrammed in figure 2. Again, inequality (1) must hold in order for 
 the diagram to be meaningful, and the availability formula can be read 
 
 from the diagram: 
 
 D + L + kY 
 A 2 F + X + kX* 
 
 By looking at the ratio A /A , one can easily show that as long as D is 
 small compared to the other parameters (a realistic assumption) A~ is 
 always greater than A . That is, strategy 2 is the better strategy, as 
 one might intuitively infer from comparison of figures 1 and 2. In the 
 following sections we will therefore restrict consideration to strategy 2. 
 
 There are two additional assumptions which must be made in 
 order for our model of either strategy to be valid. One assumption is 
 that D + L + kY is sufficiently small compared to F that there is 
 little likelihood of a failure of the remote host during the recovery pro- 
 cess. In addition, we assume that there is a negligible probability that 
 the copy may fail before the master is again ready. If either of these 
 assumptions is false, availability will generally be less than what 
 we compute from our model. A probabilistic analysis of these assumptions 
 is contained in appendix 2. Notice that strategy 2 is a two-copy 
 strategy. Transfer of usage back and forth between the primary site and a 
 single backup is specifically modeled. In strategy 1, however, after 
 the backup fails, usage may be transferred to a third copy instead of to the 
 copy at the primary site. Thus in this strategy, even if the new master 
 
 15 
 
is likely to fail before the old one is again ready, the model is not 
 invalidated as long as a second backup is available. 
 
 In the experiments to be discussed in the next section, we 
 have ignored probabilistic considerations. The reader should simply 
 keep in mind that availabilities are always slightly less than we compute 
 there. The quantities that we investigate are: 
 
 1) A„, the availability under strategy 2, and 
 
 2) I, the improvement in availability due to the existence of a 
 backup copy. 
 
 That is, 
 
 A 2~ A o .. X - D - L + k(X - Y) . 
 
 A F 
 
 o 
 
 16 
 
Experiments and Discussion 
 
 Remote journaling . In order to model a remote journaling 
 process, we assume that the parameter Y is large; for simplicity we 
 assume that it is equal to F. Thus we are essentially assuming that, 
 whenever the master comes up after a failure, a copy of the up-to-date 
 data base is shipped off to any remote site which contains a copy of the 
 data base. (Or that the remote data base, having been used as a master 
 copy while the master was down, already possesses an up-to-date copy at 
 this time.) 
 
 It is interesting to note that journaling remotely by shipping 
 the data base over the network is not feasible on a regular basis. For 
 example, consider a data base of 4 x 10 bytes (roughly FORSTAT size). 
 At a network throughput of 15 kilobits per second (faster than normal 
 for the ARPANET) , it would take approximately 6 hours to ship a data 
 base of this size. Daily backup by, say, sending tapes by courier 
 would, however, be feasible in many situations. 
 
 The data copy at the remote site will be generally assumed to 
 be on tape. The value L = 0.5 hr. has been assumed in the computations 
 since it is approximately the time to read two to three tapes. The 
 parameter D is probably on the order of one or two seconds, but we have 
 taken it to be .01 hr. as an absolute upper bound. X = 1 hr. seems to 
 be a reasonable value for repair time. With these parameters, we get 
 the following formula for improvement I in availability as a function of 
 
 F and k. 
 
 A - A 
 2 ° 0.49 + k(l - F) 
 
 1 = A " F 
 o 
 
 It is difficult to estimate what a reasonable value of k should be. In 
 a similar analysis, Chandy et al. [1975] suggest that k should be 0.1 or 
 
 17 
 
less. Clearly the value will depend on the usage pattern for the data 
 base; we have already discussed how it may be measured for a real system. 
 However, notice that, with k = 0.1, inequality (1) states that 
 
 .51 + 0.1F < 1.1. 
 Hence for this large a k the time to process the audit trail is so long 
 that, without taking into account stochastic considerations, the master 
 is able to get ready before the backup copy whenever F > 5.9 hrs. This 
 is an unreasonably low value. Furthermore, we show in appendix 2 that 
 for these values of D, L and k, and for all values of F (with Y=F) , there 
 is a better than 10 percent chance that the backup site fails before its 
 copy can be gotten ready. In short, we are unlikely to adopt a remote 
 journaling strategy in these circumstances. 
 
 To get a feel for the value of remote journaling in a case when 
 it may be practical, we therefore take k = .01; i.e., we assume that there 
 are few updates. Inequality (1) then restricts the model to F < 50. A 
 graph of I vs. F in this case may be seen in figure 3. Values of A 
 have also been plotted in the figure for reference. Notice that for 
 reasonable values of F the improvement in availability is less than 5 
 percent. If A is low, this may not be enough to make remote journaling 
 worthwhile. Throughout most of the range of F values, however, A is 
 very close to 1. One then cannot look at the improvement I independently 
 of the associated value of A , since a small I may lead to a sizable 
 decrease in the total time the data base will be unavailable . For 
 example, consider the situation when F = 20 hours. I is only .015, but 
 A q is .9519, which means that A is .9662. Thus, the fraction of the 
 time that the data base is unavailable decreases from 0.048 to 0.034. 
 This translates into a nonnegligible decrease in downtime from 35 hrs. /month 
 to about 24 hrs. /month. 
 
 18 
 
0.1 - 
 
 0.05- 
 
 F(HRS.) 10 
 
 -- 1.0 
 
 - 0.5 
 
 Figure 3 
 Single-site availability A and fractional 
 improvement I through use of strategy 2. 
 Parameters are k = 0.01, D = 0.01 hr., X = 1 hr., 
 L = 0.5 hr., and Y - F. 
 
 19 
 
As a final comment on the remote journaling strategy described 
 here, we note that availability may actually decrease as F increases. 
 For example, suppose X - 2, k = 0.25, L = 0.5 and D = 0. Then A 2 = .7692 
 for F = 4 and A - .7647 when F = 6. Differentiating k^ (for Y = F) 
 with respect to F shows that this decrease will occur whenever 
 
 k(k + 1)X > D + L. 
 Intuitively, this phenomenon occurs because for large k the effect of 
 the lengthening audit trail to be processed outweighs that of the 
 increasing reliability of the host computer. 
 
 Frequently updated remote journal . Clearly, there may be 
 problems with the remote journaling strategy described in the last 
 section because of the need to process an extremely long audit trail. 
 Suppose, then, that we drop the assumption that Y = F and assume instead 
 that the remote copy is periodically brought up to date. As an example, 
 we might assume this updating to take place every two hours. Thus on 
 the average the audit trail has been growing for 1 hour when the remote 
 copy is activated. With all other parameters as specified for figure 3, 
 but with Y = 1, 
 
 I = .49/F. 
 This result is independent of k (because of the cancelling of the kX and 
 kY terms), as long as k and F are such that the model is valid. The 
 improvement is little different from what it was in the Y = F case. 
 However, in appendix 2 we show that by decreasing Y we can considerably 
 reduce the likelihood that the backup site fails before its copy is 
 readied. Hence true availability will improve more than our simple 
 formula indicates. 
 
 20 
 
We do not include a graph of the result for Y = 1, since it 
 would be almost identical to figure 3. To see why this should be so, 
 consider more closely the formula for I. 
 
 A 2~ A o X + kX-D-L-kY 
 
 1 A F 
 
 o 
 
 As long as k is small (or when X = Y as above) it is clear that 
 I % 
 
 f\, X L. 
 
 Running spares . Here we assume that the backup copy is stored 
 on disk for virtually instantaneous access and is kept almost up to 
 date. Reasonable parameters for this case might be L = 0, Y = .1 hr . , 
 and (for comparison with the results above) X = 1 hr . , k = .01. Then we 
 
 have 
 
 0.999 . 
 
 F 
 
 We will not bother to graph this; this curve is again similar to that in 
 figure 3, only now the values of I are approximately doubled . In this case, 
 improvements of 5 to 10 percent are seen for F between 10 and 20 - 
 certainly enough to make the strategy worthwhile. In fact, what happens 
 in this case is that, under our assumptions, availabilities are brought 
 up to very nearly unity. To see this, note that 
 
 .01 + kY 
 A„ = 1 - 
 
 2 F + (1 + k)X 
 
 and for our example kY = 0.001. Increasing k will cause somewhat smaller 
 values of A 2 , but A„ will be over 99 percent for a wide range of reasonable 
 parameter values. 
 
 21 
 
Effect of varying Y . We have looked at three separate cases 
 which differ from one another in large part in the widely differing 
 values for the parameter Y. To better understand the effect of this 
 parameter, we select typical values of the other parameters (X = 1, 
 L = 0.5, D = 0.01, F = 20) and consider A_ as a function of Y for 
 several different values of k. When k = .01, we have 
 
 0.51 + 0.01Y . 
 2 '" ' 21.01 
 
 The small coefficient of Y in this case makes the effect of Y minimal. 
 As Y ranges between and 20, A- decreases linearly from 0.976 to 0.966, 
 Now suppose that k is increased to 0.05. In this case as Y goes from 
 to 20, A„ decreases from 0.976 to 0.953. These are not very dramatic 
 changes, although they will (as we noted above) be more impressive when 
 translated into decreases in downtime. To a large extent, therefore, 
 what makes the "running-spares" approach particularly worthwhile is not 
 the small value of Y but the instantaneous access (L % 0) . 
 
 22 
 
Conclusions 
 
 We have presented here a model for data availability which, 
 while superficial, does seem to reflect the realities of various strate- 
 gies for backup. We have seen that remote journaling, in the sense of 
 storing a copy in archival storage (e.g. tape) at a remote site, leads 
 to availability improvement of at best 5 percent, which may be inadequate 
 if single-copy availability is low. On the other hand, the running 
 spares strategy, in which the remote copy is nearly up to date and 
 almost immediately accessible, brings availability up to over 99 percent 
 and appears to be worthwhile. It should be noted, however, that the 
 running spares strategy is bound to be relatively expensive. Furthermore, 
 before this strategy can be effectively used, many of the problems of 
 multi-copy management must be solved. For example, updating must be 
 synchronized in order to maintain consistency between the master and 
 backup copies. 
 
 One final point should be made. In a sense, the gross availability 
 of a data base is too vague a statistic. Suppose the availability is, 
 say, 23/24. This might mean that approximately every 12 hours the data 
 becomes unavailable for about a half hour. Or it might mean that once a 
 month the data base disappears for more than a day. In a crisis, a 
 half-hour delay in obtaining data might be tolerable but a one-day delay 
 would not. The availability, then, must be looked at in conjunction 
 with F, the mean time between failures. 
 
 23 
 
References 
 
 Abramowitz, M. and Stegun, I. A. (editors) 
 
 1964 Handbook of Mathematical Functions, National Bureau 
 of Standards Applied Math. Series No. 55. 
 
 Barlow, R.E. and Proschan, F. 
 
 1965 Mathematical Theory of Reliability, Wiley-Interscience. 
 
 Borgerson, B.R. and Freitas, R.F. 
 
 1975 "A Reliability Model for Gracefully Degrading and 
 
 Standby-Sparing Systems" IEEE Trans. Computers, C-24 , 
 pp. 517-525. 
 
 Chandy, K.M. ; Browne, J.C.; Dissly, C.W.; and Uhrig, W.R. 
 
 1975 "Analytic Models for Rollback and Recovery Strategies 
 
 in Data Base Systems," IEEE Trans. Software Engineering, 
 SE-1 , pp. 100-110. 
 
 Chu, W.W. 
 
 1973 "Optimal File Allocation in a Computer Network", in 
 Computer-Communications Networks, N. Abramson and 
 F. Kuo, eds., Prentice-Hall, pp. 82-94. 
 
 Frank, H. ; Kahn, R.E.; and Kleinrock, L. 
 
 1972 "Computer Communication Network Design - Experience with 
 Theory and Practice," Proc. AFIPS National Computer Con- 
 ference, AFIPS Press, Montvale, N.J., pp. 255-270. 
 
 Kleinrock, L. 
 
 1975 Queueing Systems. Vol. 1: Theory, Wiley-Interscience. 
 
 Reynolds, C.H. , and Van Kinsbergen, J.E. 
 
 1975 "Tracking Reliability and Availability," Datamation, 
 November 1975, pp. 106-116. 
 
 24 
 
Appendix 1 
 
 Extension of Chu's Formula 
 
 In this appendix, we take up the question posed earlier as to 
 
 data base availability when no time is lost in transferring usage (e.g., 
 
 when downtime is scheduled at one site and transfer of usage to another 
 
 site is prearranged). As we remarked earlier, a good way to study this 
 
 problem would be through an extension of Chu's model [Chu, 1973]. 
 
 Suppose n sites (all remote) have a copy of the data base, and that the 
 
 availability of the ith site is a.. (In general, this availability 
 
 can be computed as the product of the availability of the ith host system 
 
 times the availability of a communication link from the local site 
 
 to site i.) Then the probability that site i is not available is 
 
 (1 - a.)> and the probability that none of the n sites is available is 
 
 U = (1 - ai )(l - a 2 )(l - a 3 )...(l - a n ). 
 
 Hence the probability that at least one site is available is given by 
 
 A = 1 - U. 
 s 
 
 (Unlike Chu, we assume that we have no problem getting access to the 
 network and so do not include Chu's factor for local host availability.) 
 
 To see how this gross availability is increased by the existence 
 of multiple copies, consider the following examples. 
 
 1. Suppose all of the a.'s are equal to 0.8. Then for n = 1, 
 
 A = 0.8; but for n = 2, A = 0.96; and for n = 3, A = 0.992. 
 53 s s 
 
 2. Suppose that the data base is at site 1 where its availability 
 is only 0.5. By placing a copy at a second site with avail- 
 ability 0.7, the overall availability becomes A =0.85. 
 
 s 
 
 25 
 
Finally, notice that if a copy of the data base is held locally, 
 the formulas for A and U need not be changed at all. If we label the 
 local site 1, then the value to be used for a., is the availability of 
 the data base through the local system. The fact that a does not 
 involve network reliability as do the other a.'s means that a., may be 
 slightly larger, but otherwise the formulation is unaffected. 
 
 26 
 
Appendix 2 
 
 Stochastic Considerations 
 
 Basic assumptions . In the text of this paper we have been 
 working exclusively with mean (or expected) values of parameters such 
 as the time between site failures. As we indicated, however, site failure 
 is a random process; not all questions can be answered by looking just at 
 the mean time between failures. In particular, we have noted that our 
 simplistic approach will predict too high an availability if, for example, 
 hosts are likely to fail while the data base is being readied. In 
 this appendix, then, we deal with some of these probabilistic questions 
 in order to get a better understanding of the validity of the results 
 we have computed. 
 
 First, recall that we are assuming the occurrence of failure 
 to be a Poisson process. Essentially, this means that we assume that 
 the probability that a failure occurs in any time interval from t to t 
 is proportional to t. - t . Notice that if the constant of proportionality 
 (which is just the failure rate) is 1/F, then the mean time between 
 failures is F, as we assumed earlier. The basic Poisson hypothesis 
 also implies that the process is memoryless. That is, the probability 
 of a failure in any time interval is independent of whether or when any 
 failures occurred in the past. 
 
 One can then introduce a random variable Z giving the "time 
 to failure" from an arbitrary starting point t = 0. The probability 
 P{Z <_ t} that the machine fails before time t (i.e. in the time interval 
 [0,t]) can be shown to be 
 
 (Al) P{Z •: t } = 1 - exp (-t/F) . 
 
 27 
 
Then the probability that the machine has not yet failed at time t is 
 
 (A2) P{Z > t} = exp (-t/F). 
 These simple formulas are adequate for computing most of the probabilities 
 we are interested in. 
 
 Probability that backup fails before master is ready for use . 
 First, consider the following problem. What is the probability P_ that 
 the backup may fail before the master site is again ready? We will first 
 assume that X + kX is a constant. (The case where repair is also treated 
 as a probabilistic process will be discussed below.) P is then calculated 
 from equation (Al) with t = X + kX, the time required to repair and 
 update the old master. We find that 
 
 (A3) P f = 1 - exp (-(X +kX)/F). 
 For example, suppose that X = 2 hours and k = 0.1. Then 
 
 P f = 1 - exp (-2.2/F). 
 Some values of this function are tabulated below. Values of F are given 
 in hours. (Since only ratios are involved, the time units used do not 
 matter as long as one is consistent.). 
 
 F 
 
 8 
 
 12 
 
 16 
 
 24 
 
 32 
 
 40 
 
 48 
 
 P f 
 
 .24 
 
 .17 
 
 .13 
 
 .09 
 
 .07 
 
 .05 
 
 .04 
 
 The reader may well be dismayed that even for F (the mean time between 
 failures) as large as 48 hours, there is still a 4% chance that the 
 backup will fail before the old master is ready. This would leave 
 a gap in availability which is not accounted for in our simplistic 
 model. On the other hand, if the expected time to ready the master 
 is considerably smaller - say X = 1 hr . , k = 0.1 - then P is only 
 0.09 for F = 12, 0.04 for F = 24, and 0.02 for F = 48. (Notice that 
 
 28 
 
if (X + kX)/F is small, P is conveniently approximated by P f % (X + kX)/F.) 
 If the P computed in any situation is large enough to seriously degrade 
 availability, the solution is to provide a second backup, so that usage 
 may be transferred to it if the first backup fails and the old master is 
 not yet ready. This would, of course, not be worthwhile if the second 
 backup requires so long to get ready that the old master will almost 
 certainly be ready first. 
 
 We have investigated how these conclusions are affected by 
 making the more realistic assumption that repair time is not a constant 
 but also obeys some probability distribution. With this assumption, the 
 probability P f can be shown to be given by 
 
 (A4) P = 7 exp [-(1 + k)t/F]W(t)dt, 
 t=o 
 
 where W(t) is the assumed probability density function for repair time. 
 
 The only real difficulty in making this assumption is the 
 apparent lack of the raw data which is needed before one can choose 
 (and statistically validate) a W. From personal reports and from one 
 study in the literature [Reynolds and Van Kinsbergen, 1975], we have put 
 together the following general description of how repair time is distributed, 
 at least for some systems. 
 
 1. The probability of repair within 15 minutes is essentially 
 negligible. 
 
 2. The probability of repair within a half hour is something like 
 0.3 to 0.4. 
 
 3. In the vicinity of t = 0.5 hr. the probability density curve 
 rises sharply to its peak, so that the likelihood of repair 
 within 45 minutes is between 0.8 and 0.9. 
 
 29 
 
Notice that it should be relatively simple to obtain a good description 
 of this sort for any particular system. All that is needed is a log of 
 repair times. 
 
 There are two known probability distributions which have the 
 right sort of shape to fit our general description of repair time. 
 These are the Beta distribution [Abramowitz and Stegun, 1964; p. 930] 
 and the Weibull distribution [Barlow and Proschan, 1965; ch. 2]. Both 
 of these have two parameters - (a, 3) in standard notation - which can be 
 used to adjust their precise shape. They differ in that the Beta distri- 
 bution has W(t) = for t >_ 1, while for the Weibull distribution W(t) 
 approaches zero exponentially as t ->• °°. Graphs of three such density 
 functions which seem to describe repair time well are shown in figures 4 
 and 5. Figure 4 shows the density function for the Beta distribution 
 with a = 7, 3=5. Figure 5 shows two Weibull distributions; the solid 
 curve corresponds to (a, 3) = (6, 4) and the dashed curve to (a, 3) = (4,3) 
 Note that the scale on the horizontal axis can be adjusted to fit longer 
 (or shorter) expected repair times; i.e., "1" can be assumed an arbitrary 
 time unit. 
 
 P f was computed from equation (A4) for 80 different combinations 
 of distribution type (Beta or Weibull) and values of the parameters 
 
 a, 3, F, and k. We observed that in no case did the value calculated 
 
 -4 
 differ by more than 6 x 10 from that calculated (using the appropriate 
 
 mean value X) from equation (A3) . Since this discrepancy is of roughly 
 
 the same magnitude as the truncation error in numerically computing the 
 
 integral in (A4) , we actually did not discover any difference between 
 
 results computed from the two formulas. We conclude, therefore, that it 
 
 is probably valid for all practical purposes to ignore the distribution 
 
 of repair times and simply use the mean repair time X in equation (A3) 
 
 to compute P r 
 
 f * 30 
 
FIGURE 4 
 
 BETA DENSITY 
 
 >- 
 
 CO 
 
 z 
 
 UJ 
 Q 
 
 FIGURE 5 
 WEIBULL DENSITIES 
 
 CO 
 
 z 
 
 UJ 
 
 o 
 
 31 
 
Probability that backup site fails before its copy is ready 
 Next, what is the probability P that the backup site fails even before 
 the backup copy can be gotten ready? This probability is given by 
 
 P = 1 - exp (-(D + L + kY)/F). 
 r 
 
 (Again we assume that D + L + kY is a constant.) In our analysis of 
 the remote journaling strategy, we assumed that Y = F, the mean time between 
 failures. Let us also assume the nominal values L = 0.5 hr. D = 0.01 hr. 
 and k = 0.1. With these values, 
 
 P = 1 - exp (-(0.51 + 0.1F)/F). 
 Sample values are tabulated below. 
 
 F 
 
 12 
 
 16 
 
 24 
 
 36 
 
 48 
 
 P 
 r 
 
 .13 
 
 .12 
 
 .11 
 
 .11 
 
 .10 
 
 Notice that as F becomes large, P approaches k. Unless k is very small, 
 P is certainly not negligible. And the effect of a failure before the 
 copy is readied could be serious. Again, the existence of a second backup 
 would help, since it will seldom happen that both backup sites will fail 
 before their copies can be readied. 
 
 In the discussion contained in the body of this report, we in 
 fact concluded that for values of k as large as 0.1, a remote journaling 
 strategy with no journaling taking place between failures is not worthwhile. 
 We looked at the possibility that updating the remote journal more frequently 
 might produce a more practical strategy. As the time, Y, that updates have 
 been accumulating decreases, P will also decrease. Suppose that F = 24 hrs., 
 L = 0.5 hr., D = 0.01 hr. and k = 0.1. Then P , as a function of Y, behaves 
 as follows: 
 
 Y 
 
 1 
 
 2 
 
 4 
 
 8 
 
 12 
 
 16 
 
 24 
 
 P 
 r 
 
 .025 
 
 .029 
 
 .037 
 
 .053 
 
 .069 
 
 .084 
 
 .114 
 
 32 
 
Thus when Y is small compared to F, the likelihood that the backup site 
 fails before its copy can be readied is probably within acceptable limits. 
 Finally, consider the running spares strategy. In that case we 
 assumed L = and Y = 0.1. If we again take k = 0.1, we find that 
 P = 1 - exp (-0.02/F). 
 
 Here the probability that the backup site fails before the copy is ready is 
 less than 0.2 percent as long as F is greater than ten hours. This small a 
 figure will have a practically negligible effect on calculated availabilities 
 
 33 
 
Appendix 3 
 
 Time to Process the Audit Trail 
 
 Elementary analysis . In this appendix we consider the question 
 of how long it really takes to "catch up" when a backlog of updates has 
 accumulated during the time a site is down. In the text, we have used 
 Chandy's expression kY [Chandy et al., 1975], where Y is the length of 
 time the updates have been accumulating, and k = u/b, u being the rate 
 of arrival of updates and b being the average rate at which updates are 
 processed. (We assume that k < 1.) However, it is clear that during the 
 
 time interval kY more updates are accumulating, and it takes an additional 
 
 2 
 time k Y to process these. Continuing to add on these correction terms, 
 
 we generate the infinite series 
 
 (k + k 2 + k 3 + .. .)Y 
 as a better formula for the catch-up time T . Computing the sum, we get 
 
 T : kY/(l - k). 
 This is a slightly larger number than the first estimate kY, but the 
 difference is a small percentage for the expected range of k values. 
 (Chandy et al. state that "values of k of order 1/10 or less are to be 
 expected.") 
 
 Analysis using queueing theory . Even this analysis, however, 
 appears to be too simplistic. The arrival and processing of updates 
 should really be modeled by a single-server queueing system. If the 
 arrivals form a Poisson process (arrival rate u) and processing time is 
 exponentially distributed (with mean processing rate b) , the queueing 
 system is one which can be analyzed. In queueing theory, the quantities 
 of interest are p n (t), the probability that there are n items (updates, 
 
 34 
 
in our case) in the queue at time t. After the system has been running 
 
 for a while, the probabilities P (t) will approach equilibrium (or steady 
 
 state) values p . Of course, equilibrium is never actually reached if 
 n 
 
 the initial distribution is not the equilibrium one. But it does make 
 sense to describe the catch-up time T as the time to reach approximate 
 equilibrium. 
 
 Fortunately, both the time-dependent and the equilibrium 
 distributions that we need are available in the literature [Kleinrock, 
 1975] , so that we have at hand the information we need to investigate 
 the approach to equilibrium. 
 
 The equilibrium distribution is given by 
 p n = (1 - k)k n . 
 The time-dependent distribution we are interested in corresponds to having 
 some number i of updates in the queue at time t = 0. That is, we have 
 the initial conditions 
 
 P (0) - 1 for n = i; 
 
 n ' 
 
 P (0) = for n ± i. 
 n 
 
 Kleinrock [1975; p. 77] gives the solution of this problem as 
 P n (t) = exp(-(u + b)t)[k (n ~ i)/2 I n _.(at) 
 
 ♦ k (n - i - 1)/2 i n+1+1 (") 
 
 + (1 - k)k n Z k~ j/2 I.(at)], 
 j = n+i+2 J 
 
 1/2 
 where a - 2uk , and I (standard notation) is the modified Bessel function 
 
 of the first kind of order j . 
 
 In order to study the approach to equilibrium, the following 
 formulas for Bessel functions are needed: 
 
 1) As z + -, i (z) . {e Z /(2TT Z ) 1/2 }{l - 4 J 2 - 1 . } 
 
 O • • • J 
 
 J 8z 
 
 35 
 
2) E k j/2 I.(z) = exp[(z/2)(k 1/2 + k" 1/2 )]. 
 
 j — 
 
 3) I (z) = I (z) 
 
 4) e Z = I (z) + 21-^z) + 2I 2 (z) + ... 
 
 [Abramowitz and Stegun, 1964; p. 374 ff.] 
 Using 2) and 3), we find that 
 
 00 
 
 E k" j/2 I.(at) = exp(t(u + b)). 
 
 j = -co ^ 
 
 Noting that the infinite summation in the expression for P (t) contains 
 only a portion of these terms, we use 4) to show that the summation over 
 the negative powers is negligible, and use 1) to estimate the finite 
 number of missing positive powers. That is, for large t we make the 
 approximation 
 
 00 
 
 Z k" j/2 I.(at) : exp(t(u + b)) - 
 j=n+i+2 J 
 
 / ^ \ n+i+1 . / , . 2 
 exp(at) „ -ill, _ 4.1 - L 
 
 (2iTat) j=0 
 
 2 
 (The j term in the asymptotic expression is needed since unfortunately 
 
 the constant terms cancel in the ultimate expression for P (t) - p .) 
 
 o o 
 
 Furthermore, we notice that, when we substitute the asymptotic formulas 
 for I. (at) into P (t), we obtain products of exponentials 
 exp(-(u + b)t) exp(at) 
 
 which simplify to 
 
 t , 1/2 ,1/2,2 , 
 exp(-(u - b ) t) . 
 
 After considerable algebraic manipulation, we obtain 
 
 P (t) - p ~ ex Pl v u - b ) t)ik 
 
 ° ° n I — TT 3/2 /n . -1/2, 
 
 2ViT v/ub t (1 - k ) 
 
 36 
 
Recall that we have said that the catch-up time T should be the time 
 
 c 
 
 at which the P (t), and in particular P (t) , are "close" to their 
 n o 
 
 equilibrium values. This would mean that the right side of the above 
 expression should be small. However, notice that the exponential factor 
 
 * / 9 
 
 decreases rapidly with t, while the factor ik is very large if the 
 queue is long at time t = 0. To a good approximation, then, we can 
 assume that equilibrium is reached when these factors approximately 
 
 cancel; that is, when 
 
 ,,, 1/2 ,1/2,2 . „ -i/2 
 exp(+(u - b ) t) „ ik 
 
 Taking logarithms, we obtain the following formula for T : 
 
 c 
 
 T = -ilnk + 2lni 
 
 c -, 1/2 ,1/2,2 
 2(u - b ) 
 
 If the term 2ln± is neglected, this expression simplifies in an interesting 
 way. Suppose updates have been accumulating for a time period of length Y. 
 
 Then i = uY, and 
 
 -uY£nk 
 
 T = 
 
 C 2b(k 1/2 - l) 2 
 
 = kY{(-£nk)/2(l - k 1/2 ) 2 }. 
 
 The quantity in brackets has the curious property of lying very close to 
 
 2.5 for k between 0.04 and 0.15. (For k = 0.01 it is still 2.84; although 
 
 it grows rapidly thereafter as k decreases.) Notice also that adding on 
 
 the term 2ln± will serve to increase the effective T . On the other hand. 
 
 c ' 
 
 taking account of the terms in the denominator of the approximate expression 
 
 o In 
 
 for P Q (t) - p Q (in particular the factor t ) will serve to decrease 
 
 it. It seems not unreasonable, then, to claim as we did early in the paper, 
 
 that T : 2kY. 
 c 
 
 37 
 
It should be emphasized that the queueing theory analysis above 
 depends strongly upon the assumption that updates are arriving randomly 
 - i.e., that the arrivals form a Poisson process. If arrivals are instead 
 bunched up during certain time periods, results may be quite different. 
 For example, if a number of updates uY have accumulated and must be 
 processed during a time when no new updates are arriving, then clearly 
 
 uY/b = kY will be the correct expression for T . On the other hand, 
 
 — ■ — c 
 
 if the backlog of updates must be processed during a time period when a 
 
 particularly large number of new updates are being entered, then T will 
 
 c 
 
 be greater than the queueing analysis has indicated. 
 
 38 
 
Appendix 4 
 
 Sensitivity to Parameter Values 
 
 In any model, it is useful to determine how sensitive the output 
 values are to changes in the inputs. Obviously, the inputs are only 
 known approximately or are statistical averages. If the output changes 
 drastically for a small change in an input value, the model is rather 
 useless for predictive or decision purposes. Chandy et al. [1975] use 
 the elasticity E(f,y), essentially the "percentage change in f caused 
 by a percentage change in y", to investigate the sensitivity of a 
 function f with respect to a parameter y. Formally, E is defined by 
 
 E(f,y) = 
 
 3f_y_ 
 
 3y f 
 
 We have investigated the elasticity of U = 1 - A. with respect 
 
 to all of the input variables. (Working with U instead of A ? simplifies 
 
 the algebra without changing the conclusion.) We find that for all 
 
 parameters 
 
 1^*1 < 1. 
 
 For example, taking y = k, 
 
 iH = FY + XY - DX - LX 
 3k (F + X + kX) 2 
 
 , and 
 
 ,_9U ki I k(FY + XY - DX - LX) i 
 lak'U 1 I (F + X + kX)(D + L + kY) ' 
 
 I kFY + kXY - . . . | 
 
 1 kFY + kXY + . . . I < 1 ' 
 
 And for y = Y, 
 
 |M II = kY . F + X + kX _ kY 
 
 'aY'U 1 F + X + kX D + L + kY D + L + kY <:L - 
 
 Similar computations show that the elasticities of U with respect to D, 
 
 L, X, and F are all less than one. Elasticities of U are connected to 
 
 those of A through 
 
 39 
 
1 3y A 2 ! '3y'A 2 ' 3y 'u' 
 
 as long as A_ > U. We may conclude therefore that our model is stable, 
 being relatively insensitive to small changes in parameter values. 
 
 40 
 
UNCLASSIFIED 
 
 SECURITY CLASSIFICATION OF THIS PAGE (When Data Entered) 
 
 REPORT DOCUMENTATION PAGE 
 
 READ INSTRUCTIONS 
 BEFORE COMPLETING FORM 
 
 1. REPORT NUMBER 
 
 CAC Document Number 181 
 CCTC-WAD Document Number 6501 
 
 2. GOVT ACCESSION NO 
 
 3. RECIPIENT'S CATALOG NUMBER 
 
 4. TITLE (and Subtitle) 
 
 Research in Network Data Management and 
 Resource Sharing - The Effect of Backup 
 Strategy on Data Base Availability 
 
 5. TYPE OF REPORT A PERIOD COVERED 
 
 Research Report 
 
 6. PERFORMING ORG. REPORT NUMBER 
 CAC // 181 
 
 7. AuTHORr»; 
 
 Geneva G. Belford, 
 Paul M. Schwartz, and 
 Suzanne Sluizer 
 
 8. CONTRACT OR GRANT NUMBERf*.) 
 
 DCA100-75-C-0021 
 
 9. PERFORMING ORGANIZATION NAME AND ADDRESS 
 
 Center for Advanced Computation 
 University of Illinois at Urbana-Champaign 
 Urbana, Illinois 61801 
 
 10. PROGRAM ELEMENT. PROJECT, TASK 
 AREA ft WORK UNIT NUMBERS 
 
 II. CONTROLLING OFFICE NAME AND ADDRESS 
 
 Command and Control Technical Center 
 
 WWMCCS ADP Directorate 
 
 11440 Isaac Newton Sq., NO., Reston, Va. 22090 
 
 12. REPORT DATE 
 
 February 1, 1976 
 
 13. NUMBER OF PAGES 
 44 
 
 U. MONITORING AGENCY NAME ft ADDRESSf/' dttlerent from Controlling Ottice) 
 
 15. SECURITY CLASS, (ol thle report) 
 
 UNCLASSIFIED 
 
 15«. DEC LASSIFI CATION/ DOWN GRADING 
 SCHEDULE 
 
 16. DISTRIBUTION ST ATEMEN T (ol this Report) 
 
 Copies may be obtained from the 
 
 National Technical Information Service 
 Springfield, Virginia 22151 
 
 17. DISTRIBUTION STATEMENT (ol the abstract entered In Block 20, It dltterent from Report) 
 
 No restriction on distribution 
 
 18. SUPPLEMENTARY NOTES 
 
 None 
 
 19. KEY WORDS (Continue on reverse side it necessary and Identity by block number) 
 
 information system modeling 
 data base availability 
 data base backup 
 distributed data management 
 
 20. ABSTRACT (Continue on reverse side It necessary and identity by block number) 
 
 Formulas are developed for the study of improvements in data base availability 
 due to the existence of a backup copy located at an alternate site in a net- 
 work. Several backup strategies are compared. 
 
 DD 
 
 FORM 
 1 JAN 73 
 
 1473 
 
 EDITION OF I NOV 65 IS OBSOLETE 
 
 IJNr.LASSTFTF.Tj- 
 
 SECURITY CLASSIFICATION OF THIS PAGE (When Data Entered) 
 
BIBLIOGRAPHIC DATA 
 SHEET 
 
 1. Report No. 
 
 UIUC-CAC-DN-76-181 
 
 3. Recipient's Accession No. 
 
 4. Title and Subtitle 
 
 Research in Network Data Management and 
 Resource Sharing - The Effect of Backup 
 Strategy on Data Base Availability 
 
 5- Report Date 
 
 February 1, 1976 
 
 7. Authon'sl 
 
 Geneva G. Belford, Paul M. Schwartz, and Suzanne Sluizer 
 
 8. Performing Organization Rept. 
 
 No. 
 
 CAC #181 
 
 9. Performing Organization Name and Address 
 
 Center for Advanced Computation 
 University of Illinois at Urbana-Champaign 
 Urbana, Illinois 61801 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract/Grant No. 
 
 DCA 100-75-C-0021 
 
 12. Sponsoring Organization Name and Address 
 
 Command and Control Technical Center 
 WWMCCS ADP Directorate 
 11440 Isaac Newton Square, North 
 Reston, Virginia 22090 
 
 13. Type of Report & Period 
 Covered 
 
 Research 
 
 14. 
 
 15. Supplementary Notes 
 
 16. Abstracts 
 
 Formulas are developed for the study of improvements in data base availability 
 due to the existence of a backup copy located at an alternate site in a network. 
 Several backup strategies are compared. 
 
 17. Key Words and Document Analysis. 17a. Descriptors 
 
 information system modeling 
 data base availability 
 data base backup 
 distributed data management 
 
 17b. Identifiers /Open-Ended Terms 
 
 17c. COSATI Field/Group 
 
 18. Availability Statement 
 
 No restriction on distribution 
 Available from the National Technical 
 Information Service, Springfield, VA 22151 
 
 19. Security Class (This 
 Report) 
 
 UNCLASSIFIED 
 
 20. Security Class (This 
 
 Page 
 UNCLASSIFIED 
 
 21. No. of Pages 
 44 
 
 22. Price 
 
 FORM NTIS-35 IREV. 3-72) 
 
 USCOMM-DC I4952-P72