UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN ENGINEERING m 3 logo The person charging ttl?s w material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. To renew call Telephone Center, 333-8400 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN MAR 2 1981? .;-t;i n L161— 0-10% CONFERENCE ROOM ENGINEERING LIBRARY UNIVERSITY OF ILLINOIS URBANA, ILLINOIS UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS 61801 CAC Document Number 181 CCTC-WAD Document Number 6501 Research in Network Data Management and Resource Sharing The Effect of Backup Strategy on Data Base Availability T ** Uhr* February 1, 1976 ^*y n/ *» , s '"'"fcij./ji, „ Digitized by the Internet Archive in 2012 with funding from University of Illinois Urbana-Champaign http://archive.org/details/researchinnetwor181belf CAC Document Number 181 CCTC-WAD Document Number 6501 Research in Network Data Management and Resource Sharing The Effect of Backup Strategy on Data Base Availability by Geneva G. Belford Paul M. Schwartz Suzanne Sluizer Prepared for the Command and Control Technical Center WWMCCS ADP Directorate of the Defense Communication Agency Washington, D.C. under contract DCA100-75-C-0021 Center for Advanced Computation University of Illinois at Urbana-Champaign Urbana, Illinois 61801 February 1, 1976 Approved for release 5^ A Peter A. Alsberg, Principal Jp^^^Np-^nr^. ^ TABLE OF CONTENTS IN6INCEJUNG USRARl Page Executive Summary 1 The problem 1 The model 2 Conclusions 2 Introduction A File allocation 4 Network reliability modeling 5 Modeling computer system reliability 6 Modeling backup and recovery strategies 6 The present work 7 The Model 9 Overview 9 Parameters 11 Single-copy availability 11 Discussion of the parameter k 12 Availabilities for two backup strategies 13 Experiments and Discussion 17 Remote journaling 17 Frequently updated remote journal 20 Running spares 21 Effect of varying Y 22 Conclusions 23 References 24 Appendix 1: Extension of Chu's Formula 25 TABLE OF CONTENTS (continued) Page Appendix 2: Stochastic Considerations 27 Basic assumptions 27 Probability that backup fails before master is ready for use 28 Probability that backup site fails before its copy is ready 32 Appendix 3: Time to Process the Audit Trail 34 Elementary analysis 34 Analysis using queueing theory 34 Appendix 4: Sensitivity to Parameter Values 39 Executive Summary The problem . The availability of a data base may be simply defined as the fraction of time that the data is available to users. Many things can cause a data base to become unavailable in a network, setting. If the data base is stored at the same location as the user, the system through which the data must be accessed may fail, or the device on which the data base is resident may crash. If the data base is located at a remote site on the network, the remote site or system may fail, the network may partition so that the remote site cannot be reached, or some local failure may make the network inaccessible to the user. In most of these cases, availability can be considerably improved if a backup copy of the data base exists. If copies of the data base exist at two sites in the network, the danger of losing access because of network partitioning or site failure is reduced. Furthermore, if a local device holding all or part of the data base crashes, data may be destroyed. It is likely to be much faster (as well as more reliable) to ready a locally archived backup copy of the data for usage than to try to recover the lost or degraded data from audit trails, etc. How much the existence of a backup copy improves availability depends on a number of factors. For example: 1) How available is the backup copy? (Is it stored on disk for immediate access? If it is stored on tapes, a sizeable delay may be incurred while the tapes are located, mounted, and loaded onto a rapid-access device.) 2) How up-to-date is the backup copy? (Are all updates applied to the backup copy as rapidly as possible? Is there a long backlog of updates that must be processed before the data base is really ready for use?) 3) How often is the site (or device) holding the data base likely to fail? (If failures are infrequent, the backup copy may provide little improvement in availability.) Even small improvements in availability can, of course, be important. Availability can be over 0.99 and still be disastrously low if, say, the data is unavailable for one 24-hour period during a year and that period happens to be during a crisis. It is important, therefore, to understand thoroughly how availability is affected by the factors discussed in the preceding paragraph, and hence by the strategy used for backing up a data base. The model . We have developed simple algebraic formulas for availability as a function of the factors listed above. Additional parameters are incorporated to model the delay incurred in initiating the process of readying the backup copy, the rate at which updates are generated, and the rate at which updates are processed. We have assumed the existence of a single backup copy, and have studied the improvement in availability that the existence of a backup provides over single-copy availability. The formulas are kept simple by using average values for parameters that are actually random variables. For example, we use the "mean time between failures" in the availability formula, while system failure is actually a random process. In appendix 2, we look into the validity of this simplification and conclude that its affect on computed availabilities is, in most realistic situations, to make them appear only slightly larger than they actually would be. Conclusions . One main conclusion from studying the model is that a backup copy can improve the availability of a data base by as much as 5 to 10 per cent. To put this result into more concrete terms, suppose that a single copy is likely to be down for two hours per day (availability = .917). A 5 percent improvement would produce an availa- bility of .963, or a reduction of probable down time to about 54 minutes. A second important conclusion is that if the backup copy is readily accessible and kept reasonably up to date, the availability is very close to 1. On the other hand, if the backup copy is stored on tape, so that it is relatively out of date and locating it is a time- consuming process, availability may be little better than was provided by a single copy. (This is because one can probably repair the original system about as rapidly as one can ready the backup.) Indeed, a backup of this sort tends to be mainly useful for recovery from some accident which destroys data in the original data base. In this study, we have necessarily restricted ourselves to trying to answer a few specific questions and to computing availabil- ities for only a limited number, or range, of parameter values. However, the formulas we have developed - and, even more, the simple, straight- forward approach which yielded those formulas - have applicability in a wide variety of settings. The most important aspect of this work is not the particular numbers or formulas obtained but the tools developed for studying availability in general. With little additional effort, these tools can be used to provide answers to other questions regarding the effect of backup strategy on availability. Introduction We here use the terra availability to mean the fraction of time that a data base is available to respond to user requests or queries. In any setting, and particularly in a network, availability is a function of the reliability (or availability) of many components - host computers, network communications lines, etc. - as well as of strategies for backup and recovery. In this section we first discuss some of the past modeling research that has yielded results relevant to database availability, and then introduce the line of work which we have pursued. File allocation . One of the factors to be taken into account in distributing copies of a file to various network sites is the number of copies needed for an acceptable degree of availability. Chu [1973] takes account of this factor in the following way. First, he defines the availability of a piece of equipment (e.g., communication line or computer) as F Availability = p + x , where F is the mean time between failures and X is the mean time to repair. Then, assuming 1) all computers in the network have identical availability A, 2) all communication channels have identical availability c, and 3) the network is completely connected; Chu obtains the following formula for the availability of the j th file: r . A(l - (1 - Ac) J ), where r is the number of copies of the jth file in the network. Once A and c are known, it is a simple matter to choose r. so as to bring the availability of a remote copy up to a satisfactory level. Overall avail- ability, however, is bounded by the factor A, the availability of the requesting computer, which is apparently assumed not to possess a copy of the file. Although Chu's model, with its assumption of complete homo- geneity of network components, may seem oversimplified, an analogous analysis can be readily carried out in the heterogeneous case to yield only slightly more complex expressions. (See appendix 1.) Notice, however, that this model presents another problem. It implicitly assumes that the files are static, or are simultaneously kept up to date by some trouble-free process. In fact, the development of algorithms to keep segments of a data base identical (or nearly so) is a topic of current research. (See the chapter on Automated Backup in CAC Doc. No. 162, JTSA Doc. No. 5509.) Network reliability modeling . Another simplification in Chu's model is the assumption that a direct communication line connects every pair of sites. This assumption allows Chu to use a single parameter to describe availability of a link from one site to another. In a general network, this availability will depend in a complex way upon network topology. Several alternate paths may exist between two given sites. Each of these paths may involve more than one "hop" and so more than one piece of subnet hardware. Indeed, in the ARPA network it has been found that the failure rate for IMP's is about the same as that for communica- tion channels, and that IMP failures therefore have the more drastic effect on communications reliability [Frank, Kahn, and Kleinrock, 1972]. Graph theoretical techniques for computing availability from component reliabilities are, however, well known. The paper by Frank et al. con- tains a brief review of these techniques. No great difficulty is envi- sioned in applying them to any given network (such as the WIN) to obtain host availabilities. These may then be used in the formula given in appendix 1 to obtain rough estimates of file (or data base) availability. Modeling computer system reliability . Another parameter in Chu's model that requires more detailed analysis for complete understanding is computer availability. One source of information on computer avail- ability is direct system measurement. On a lower level, however, failures can be modeled to yield, in addition to overall figures on expected system reliability, useful insights into repair and backup strategies. Borgerson and Freitas [1975] recently published a fairly detailed stochastic model for computer system failure. Their model is based on four distinct causes of crashes and their interrelationships. Their ultimate result is a formula giving the probability density for the event that the system crashes due to a failure. For our availability analysis, however, there seems to be little need to include this level of detail; we are simply concerned with failure rate - a measurable quantity. Modeling backup and recovery strategies . The discussion above has been limited to availability questions involving network and site reliabilities. On a lower level, the data base itself may "crash" or may acquire errors. It is important that strategies for returning a data base to its correct state be devised and studied. A recent paper [Chandy et al., 1975] provides models for rollback and recovery strategies. These strategies run as follows, certain points in time ( checkpoints ) , a copy of the data is made and stored. A listing of subsequent data updates (i.e., an audit trail ) is then kept. When the master data base fails, it may then be recovered by beginning with the old copy from the checkpoint and using the audit trail to bring it up to date. Chandy et al. use queueing theory to 6 model the processing of the audit trail. From the expected time to complete this process, they can compute the total recovery time. The length of the audit trail, and hence the time to recover, is a function of the time interval between checkpoints. Optimization of availability with respect to intercheckpoint time can then be carried out. Models of some complexity are developed which take into consideration the possi- bility of errors during recovery and the possibility of a transaction arrival rate which varies in a cyclic manner (as opposed to being con- stant) . The results appear to be very useful for developing insights into recovery strategies, particularly for single-site systems. In a network environment, however, it may be reasonable to assume that the backup copy is stored remotely. In this case it does not make sense to assume that the data is always restored from the backup, because of the long time required to transfer a data base through the network. The strategy then is to transfer the queries to the available copy. The present work . In this note we attempt to quantify the improvement in data base availability which can be achieved by storing a backup copy at one (or more) remote sites in a network and transferring usage to the backup when the master fails. We also discuss the practi- cality of certain alternative management strategies. To simplify the analysis, we will not consider various possible causes of data base failure, but will assume that the data is available when the host computer is running and is available (if remote) by way of the network. We will therefore not be considering a detailed analysis of the type of Borgerson and Freitas, nor will we be concerned with network reliability modeling. Host failures are so much more common than communi- cations link failures that the latter can be neglected in our simple model, Furthermore, we will not take into account scheduled down time of the host computer, on the assumption that if down time is scheduled, transfer to a backup copy is automatic and immediate, and leads to no loss in availability. The very existence of a backup copy at an alternate network site will of course improve availability considerably over the case where only one site has a copy. Indeed, Chu's model (or a simple modification of it) can be used to determine the improvement in availability due to multiple copies when all copies are equally usable. Since some readers may find this question of interest, we have included a discussion of it in appendix 1. The Model Overview . The process we are modeling may be described as follows. Several sites in a network possess copies of a data base. One of these copies is designated as the master copy . The others are re- ferred to as spares or backups . All queries for the data base are sent to the master site (i.e., the site holding the master copy). Updates are applied to the master copy as soon as possible after they are generated, so that the master copy is kept up to date. Two basic strategies for updating the spares are encompassed by our model: 1) Running spares . Spares are updated almost as rapidly as the master. 2) Remote journaling . Up-to-date copies of the data base are periodically sent to the backup sites for storage. In be- tween this periodic journaling, updates are logged in an audit trail for application to a spare if and when one is needed . Occasionally the master copy becomes unavailable. We assume this is caused by a failure of the host possessing the master copy and not by, say, communication line failure. When the master site fails, some sort of communication among sites takes place to determine which of the spares should take over the responsibility of being the new master. The length of the time interval from when the old master fails to when the new master is decided upon is assumed to be a fixed constant. Once a new master site has been selected, the spare copy at that site must be readied to receive queries. This process of getting the backup ready may involve time-consuming operations such as loading the data from tape and processing the audit trail of updates which have not yet been applied to the backup copy. How close to "ready" the backups should be kept is another strategy question which may be studied by our model. As soon as the old master fails, the process of repairing it begins. After the host has been repaired, the data base itself must be readied. A backlog of updates has been accumulating while the master was being repaired, and these updates must be applied to the data. Thus, after a certain time lapse, the old master (i.e., the primary master) is again ready to receive queries. The question is, should we immediately reinstate the old master, or should we continue to send queries to the new master until it fails? With our model we can study the impact on availability of how we answer this question. There may, however, be other issues involved. For example, most of the queries to the data base may originate at the primary master site. In this case there are cost and/or response advantages to be gained by transferring usage back to the primary site as soon as possible. For simplicity, we have described the process we are modeling in fairly specific terms. It should be noted, however, that little change is needed to model other, similar processes. For example, the backup and master copies may be located at the same site. And the failures we are concerned with may be the crashing (with accompanying data destruction) of the device holding the master copy. In this case there are no network messages to transfer usage to a remote site, nor need we worry about repairing the host. But the need to get the backup ready by loading the copy and then bringing it up to date remains the same. Only trivial changes in the availability formulas we have derived will allow us to study this sort of closely related process. 10 Parameters . The parameters in our model are as follows: F = mean time between computer failures, assumed to be the same for all host computers. X = expected time to repair computer. L = expected time to load the data base copy at the remote site. Y = time that the audit trail of updates has been growing (i.e., time since the copy was correct) . k = the ratio of update arrival rate to update processing rate.* D = time delay between when the master fails and when the remote site determines this fact and starts to get its copy ready for use. Single-copy availability . First, consider the case where there is a single copy of the data base. The availability of this copy is then A = F o F + X + kX This is the usual formula for availability (mean time between failures divided by mean time between failures plus mean time to recover) . The mean time to recover includes repair time X plus the time kX to process the updates accumulated while repairs were made. (This formula for recovery time is that used by Chandy et al. [1975].) There is a question as to whether the term kX should be included here, since the site is technically "up" after time X. But in a network setting, it does seem appropriate to assume that updates initiated at remote sites The parameter k is referred to in the literature as a "compression factor [Chandy et al., 1975]. This is not to be confused with the usual data compression factor which indicates by how much data is com- pressed for storage or transfer. 11 are being logged somewhere, so that there does exist an update list to be processed. In addition, we are interested primarily in comparing A q with availabilities computed for multi-copy strategies, where the copies are assumed to be up to date. Discussion of the parameter k . The rationale for using the formula kY for the time to process an audit trail that has been accumul- ating for a time period of length Y is as follows. Suppose u is the rate of arrival of updates, and b is the rate of processing them. Then during time Y a total of uY updates have accumulated and it takes time uY/b=kY to process these. (We have defined k = u/b.) However, the system can not really be said to be caught up after this much time, since more updates were accumulating while the backlog was being processed. Let us define T as the catch-up time, or time for the system to catch up after a backlog of updates has accumulated. The determination of an appropriate expression for T turns out to be a nontrivial problem. This problem is examined in detail in appendix 3. We find there that for a reasonable range of values of k, 2kY may be a more appropriate expression for T than is kY. In the remainder of this note, however, we will consider k as an effective proportionality constant, defined by the assumption that kY is the time to catch up after updates have been accumulating for time Y. The reader should keep in mind that then k is not equal to u/b but is somewhat larger, perhaps by as much as a factor of 2 or more. It is, of course, possible for a site to obtain an effective k by measurement. A T can be measured as the length of time between the time when processing of the update backlog begins and when the update queue is first noted to be empty. An average over several observations of T /Y should yield an acceptable value for the effective k. 12 Availabilities for two backup strategies . We shall consider two strategies for transferring usage back and forth between master copy and backup copy. Strategy 1 runs as follows. After the master copy is determined to have failed, the remote copy is then brought up (after a time lapse of D + L + kY) and usage is transferred to it. Meanwhile the old master is being repaired. Queries and updates are sent to the new master, however, until it fails, at which time the process repeats: another "new" master is identified and activated. (This may or may not be the "old" master.) Since the remote site may have been up for some time since its last failure, one might think that, after the new master site is identified, time until failure is only F/2. This is only true, however, if the time between failures is always precisely F. If, as we are assuming, failures form a Poisson process (i.e., occur randomly) it may be shown that the expected time until failure is not F/2 but F. (See, for example, [Kleinrock, 1975, pp. 169-174].) This result, known in renewal theory as the "paradox of residual life", may be explained intuitively as occurring because the old master has a higher probability of failing during a rel- atively long inter-failure period at the new site. Strategy 1 is diagrammed in figure 1. Looking at the diagram and ignoring the initial time period, one can see that the fraction of time some copy of the data base is available is A 1 = (F - L - kY)/(F + D). The quantity A is then the data base availability under strategy 1. Notice also that an obvious built-in assumption can be read from the figure. (1) D + L + kY < X + kX If this inequality is not satisfied, it theoretically does not pay to 13 X + kX Ho MASTER UP + kY -H D + L U- + kY COPY (NEW MASTER) UP TIME Figure 1 Diagram of strategy 1. X + kX -■■K- ^ V b Hil j. LV L-_ « I- V U + L + KY r MASTER UP COPY UP MASTER UP TIME Figure 2 Diagram of strategy 2, 14 store a remote copy, since the master is expected to be repaired and updated before the remote copy can be activated. Strategy 2 is to immediately replace the copy by the old master as soon as the latter has been brought back up. This scheme is diagrammed in figure 2. Again, inequality (1) must hold in order for the diagram to be meaningful, and the availability formula can be read from the diagram: D + L + kY A 2 F + X + kX* By looking at the ratio A /A , one can easily show that as long as D is small compared to the other parameters (a realistic assumption) A~ is always greater than A . That is, strategy 2 is the better strategy, as one might intuitively infer from comparison of figures 1 and 2. In the following sections we will therefore restrict consideration to strategy 2. There are two additional assumptions which must be made in order for our model of either strategy to be valid. One assumption is that D + L + kY is sufficiently small compared to F that there is little likelihood of a failure of the remote host during the recovery pro- cess. In addition, we assume that there is a negligible probability that the copy may fail before the master is again ready. If either of these assumptions is false, availability will generally be less than what we compute from our model. A probabilistic analysis of these assumptions is contained in appendix 2. Notice that strategy 2 is a two-copy strategy. Transfer of usage back and forth between the primary site and a single backup is specifically modeled. In strategy 1, however, after the backup fails, usage may be transferred to a third copy instead of to the copy at the primary site. Thus in this strategy, even if the new master 15 is likely to fail before the old one is again ready, the model is not invalidated as long as a second backup is available. In the experiments to be discussed in the next section, we have ignored probabilistic considerations. The reader should simply keep in mind that availabilities are always slightly less than we compute there. The quantities that we investigate are: 1) A„, the availability under strategy 2, and 2) I, the improvement in availability due to the existence of a backup copy. That is, A 2~ A o .. X - D - L + k(X - Y) . A F o 16 Experiments and Discussion Remote journaling . In order to model a remote journaling process, we assume that the parameter Y is large; for simplicity we assume that it is equal to F. Thus we are essentially assuming that, whenever the master comes up after a failure, a copy of the up-to-date data base is shipped off to any remote site which contains a copy of the data base. (Or that the remote data base, having been used as a master copy while the master was down, already possesses an up-to-date copy at this time.) It is interesting to note that journaling remotely by shipping the data base over the network is not feasible on a regular basis. For example, consider a data base of 4 x 10 bytes (roughly FORSTAT size). At a network throughput of 15 kilobits per second (faster than normal for the ARPANET) , it would take approximately 6 hours to ship a data base of this size. Daily backup by, say, sending tapes by courier would, however, be feasible in many situations. The data copy at the remote site will be generally assumed to be on tape. The value L = 0.5 hr. has been assumed in the computations since it is approximately the time to read two to three tapes. The parameter D is probably on the order of one or two seconds, but we have taken it to be .01 hr. as an absolute upper bound. X = 1 hr. seems to be a reasonable value for repair time. With these parameters, we get the following formula for improvement I in availability as a function of F and k. A - A 2 ° 0.49 + k(l - F) 1 = A " F o It is difficult to estimate what a reasonable value of k should be. In a similar analysis, Chandy et al. [1975] suggest that k should be 0.1 or 17 less. Clearly the value will depend on the usage pattern for the data base; we have already discussed how it may be measured for a real system. However, notice that, with k = 0.1, inequality (1) states that .51 + 0.1F < 1.1. Hence for this large a k the time to process the audit trail is so long that, without taking into account stochastic considerations, the master is able to get ready before the backup copy whenever F > 5.9 hrs. This is an unreasonably low value. Furthermore, we show in appendix 2 that for these values of D, L and k, and for all values of F (with Y=F) , there is a better than 10 percent chance that the backup site fails before its copy can be gotten ready. In short, we are unlikely to adopt a remote journaling strategy in these circumstances. To get a feel for the value of remote journaling in a case when it may be practical, we therefore take k = .01; i.e., we assume that there are few updates. Inequality (1) then restricts the model to F < 50. A graph of I vs. F in this case may be seen in figure 3. Values of A have also been plotted in the figure for reference. Notice that for reasonable values of F the improvement in availability is less than 5 percent. If A is low, this may not be enough to make remote journaling worthwhile. Throughout most of the range of F values, however, A is very close to 1. One then cannot look at the improvement I independently of the associated value of A , since a small I may lead to a sizable decrease in the total time the data base will be unavailable . For example, consider the situation when F = 20 hours. I is only .015, but A q is .9519, which means that A is .9662. Thus, the fraction of the time that the data base is unavailable decreases from 0.048 to 0.034. This translates into a nonnegligible decrease in downtime from 35 hrs. /month to about 24 hrs. /month. 18 0.1 - 0.05- F(HRS.) 10 -- 1.0 - 0.5 Figure 3 Single-site availability A and fractional improvement I through use of strategy 2. Parameters are k = 0.01, D = 0.01 hr., X = 1 hr., L = 0.5 hr., and Y - F. 19 As a final comment on the remote journaling strategy described here, we note that availability may actually decrease as F increases. For example, suppose X - 2, k = 0.25, L = 0.5 and D = 0. Then A 2 = .7692 for F = 4 and A - .7647 when F = 6. Differentiating k^ (for Y = F) with respect to F shows that this decrease will occur whenever k(k + 1)X > D + L. Intuitively, this phenomenon occurs because for large k the effect of the lengthening audit trail to be processed outweighs that of the increasing reliability of the host computer. Frequently updated remote journal . Clearly, there may be problems with the remote journaling strategy described in the last section because of the need to process an extremely long audit trail. Suppose, then, that we drop the assumption that Y = F and assume instead that the remote copy is periodically brought up to date. As an example, we might assume this updating to take place every two hours. Thus on the average the audit trail has been growing for 1 hour when the remote copy is activated. With all other parameters as specified for figure 3, but with Y = 1, I = .49/F. This result is independent of k (because of the cancelling of the kX and kY terms), as long as k and F are such that the model is valid. The improvement is little different from what it was in the Y = F case. However, in appendix 2 we show that by decreasing Y we can considerably reduce the likelihood that the backup site fails before its copy is readied. Hence true availability will improve more than our simple formula indicates. 20 We do not include a graph of the result for Y = 1, since it would be almost identical to figure 3. To see why this should be so, consider more closely the formula for I. A 2~ A o X + kX-D-L-kY 1 A F o As long as k is small (or when X = Y as above) it is clear that I % f\, X L. Running spares . Here we assume that the backup copy is stored on disk for virtually instantaneous access and is kept almost up to date. Reasonable parameters for this case might be L = 0, Y = .1 hr . , and (for comparison with the results above) X = 1 hr . , k = .01. Then we have 0.999 . F We will not bother to graph this; this curve is again similar to that in figure 3, only now the values of I are approximately doubled . In this case, improvements of 5 to 10 percent are seen for F between 10 and 20 - certainly enough to make the strategy worthwhile. In fact, what happens in this case is that, under our assumptions, availabilities are brought up to very nearly unity. To see this, note that .01 + kY A„ = 1 - 2 F + (1 + k)X and for our example kY = 0.001. Increasing k will cause somewhat smaller values of A 2 , but A„ will be over 99 percent for a wide range of reasonable parameter values. 21 Effect of varying Y . We have looked at three separate cases which differ from one another in large part in the widely differing values for the parameter Y. To better understand the effect of this parameter, we select typical values of the other parameters (X = 1, L = 0.5, D = 0.01, F = 20) and consider A_ as a function of Y for several different values of k. When k = .01, we have 0.51 + 0.01Y . 2 '" ' 21.01 The small coefficient of Y in this case makes the effect of Y minimal. As Y ranges between and 20, A- decreases linearly from 0.976 to 0.966, Now suppose that k is increased to 0.05. In this case as Y goes from to 20, A„ decreases from 0.976 to 0.953. These are not very dramatic changes, although they will (as we noted above) be more impressive when translated into decreases in downtime. To a large extent, therefore, what makes the "running-spares" approach particularly worthwhile is not the small value of Y but the instantaneous access (L % 0) . 22 Conclusions We have presented here a model for data availability which, while superficial, does seem to reflect the realities of various strate- gies for backup. We have seen that remote journaling, in the sense of storing a copy in archival storage (e.g. tape) at a remote site, leads to availability improvement of at best 5 percent, which may be inadequate if single-copy availability is low. On the other hand, the running spares strategy, in which the remote copy is nearly up to date and almost immediately accessible, brings availability up to over 99 percent and appears to be worthwhile. It should be noted, however, that the running spares strategy is bound to be relatively expensive. Furthermore, before this strategy can be effectively used, many of the problems of multi-copy management must be solved. For example, updating must be synchronized in order to maintain consistency between the master and backup copies. One final point should be made. In a sense, the gross availability of a data base is too vague a statistic. Suppose the availability is, say, 23/24. This might mean that approximately every 12 hours the data becomes unavailable for about a half hour. Or it might mean that once a month the data base disappears for more than a day. In a crisis, a half-hour delay in obtaining data might be tolerable but a one-day delay would not. The availability, then, must be looked at in conjunction with F, the mean time between failures. 23 References Abramowitz, M. and Stegun, I. A. (editors) 1964 Handbook of Mathematical Functions, National Bureau of Standards Applied Math. Series No. 55. Barlow, R.E. and Proschan, F. 1965 Mathematical Theory of Reliability, Wiley-Interscience. Borgerson, B.R. and Freitas, R.F. 1975 "A Reliability Model for Gracefully Degrading and Standby-Sparing Systems" IEEE Trans. Computers, C-24 , pp. 517-525. Chandy, K.M. ; Browne, J.C.; Dissly, C.W.; and Uhrig, W.R. 1975 "Analytic Models for Rollback and Recovery Strategies in Data Base Systems," IEEE Trans. Software Engineering, SE-1 , pp. 100-110. Chu, W.W. 1973 "Optimal File Allocation in a Computer Network", in Computer-Communications Networks, N. Abramson and F. Kuo, eds., Prentice-Hall, pp. 82-94. Frank, H. ; Kahn, R.E.; and Kleinrock, L. 1972 "Computer Communication Network Design - Experience with Theory and Practice," Proc. AFIPS National Computer Con- ference, AFIPS Press, Montvale, N.J., pp. 255-270. Kleinrock, L. 1975 Queueing Systems. Vol. 1: Theory, Wiley-Interscience. Reynolds, C.H. , and Van Kinsbergen, J.E. 1975 "Tracking Reliability and Availability," Datamation, November 1975, pp. 106-116. 24 Appendix 1 Extension of Chu's Formula In this appendix, we take up the question posed earlier as to data base availability when no time is lost in transferring usage (e.g., when downtime is scheduled at one site and transfer of usage to another site is prearranged). As we remarked earlier, a good way to study this problem would be through an extension of Chu's model [Chu, 1973]. Suppose n sites (all remote) have a copy of the data base, and that the availability of the ith site is a.. (In general, this availability can be computed as the product of the availability of the ith host system times the availability of a communication link from the local site to site i.) Then the probability that site i is not available is (1 - a.)> and the probability that none of the n sites is available is U = (1 - ai )(l - a 2 )(l - a 3 )...(l - a n ). Hence the probability that at least one site is available is given by A = 1 - U. s (Unlike Chu, we assume that we have no problem getting access to the network and so do not include Chu's factor for local host availability.) To see how this gross availability is increased by the existence of multiple copies, consider the following examples. 1. Suppose all of the a.'s are equal to 0.8. Then for n = 1, A = 0.8; but for n = 2, A = 0.96; and for n = 3, A = 0.992. 53 s s 2. Suppose that the data base is at site 1 where its availability is only 0.5. By placing a copy at a second site with avail- ability 0.7, the overall availability becomes A =0.85. s 25 Finally, notice that if a copy of the data base is held locally, the formulas for A and U need not be changed at all. If we label the local site 1, then the value to be used for a., is the availability of the data base through the local system. The fact that a does not involve network reliability as do the other a.'s means that a., may be slightly larger, but otherwise the formulation is unaffected. 26 Appendix 2 Stochastic Considerations Basic assumptions . In the text of this paper we have been working exclusively with mean (or expected) values of parameters such as the time between site failures. As we indicated, however, site failure is a random process; not all questions can be answered by looking just at the mean time between failures. In particular, we have noted that our simplistic approach will predict too high an availability if, for example, hosts are likely to fail while the data base is being readied. In this appendix, then, we deal with some of these probabilistic questions in order to get a better understanding of the validity of the results we have computed. First, recall that we are assuming the occurrence of failure to be a Poisson process. Essentially, this means that we assume that the probability that a failure occurs in any time interval from t to t is proportional to t. - t . Notice that if the constant of proportionality (which is just the failure rate) is 1/F, then the mean time between failures is F, as we assumed earlier. The basic Poisson hypothesis also implies that the process is memoryless. That is, the probability of a failure in any time interval is independent of whether or when any failures occurred in the past. One can then introduce a random variable Z giving the "time to failure" from an arbitrary starting point t = 0. The probability P{Z <_ t} that the machine fails before time t (i.e. in the time interval [0,t]) can be shown to be (Al) P{Z •: t } = 1 - exp (-t/F) . 27 Then the probability that the machine has not yet failed at time t is (A2) P{Z > t} = exp (-t/F). These simple formulas are adequate for computing most of the probabilities we are interested in. Probability that backup fails before master is ready for use . First, consider the following problem. What is the probability P_ that the backup may fail before the master site is again ready? We will first assume that X + kX is a constant. (The case where repair is also treated as a probabilistic process will be discussed below.) P is then calculated from equation (Al) with t = X + kX, the time required to repair and update the old master. We find that (A3) P f = 1 - exp (-(X +kX)/F). For example, suppose that X = 2 hours and k = 0.1. Then P f = 1 - exp (-2.2/F). Some values of this function are tabulated below. Values of F are given in hours. (Since only ratios are involved, the time units used do not matter as long as one is consistent.). F 8 12 16 24 32 40 48 P f .24 .17 .13 .09 .07 .05 .04 The reader may well be dismayed that even for F (the mean time between failures) as large as 48 hours, there is still a 4% chance that the backup will fail before the old master is ready. This would leave a gap in availability which is not accounted for in our simplistic model. On the other hand, if the expected time to ready the master is considerably smaller - say X = 1 hr . , k = 0.1 - then P is only 0.09 for F = 12, 0.04 for F = 24, and 0.02 for F = 48. (Notice that 28 if (X + kX)/F is small, P is conveniently approximated by P f % (X + kX)/F.) If the P computed in any situation is large enough to seriously degrade availability, the solution is to provide a second backup, so that usage may be transferred to it if the first backup fails and the old master is not yet ready. This would, of course, not be worthwhile if the second backup requires so long to get ready that the old master will almost certainly be ready first. We have investigated how these conclusions are affected by making the more realistic assumption that repair time is not a constant but also obeys some probability distribution. With this assumption, the probability P f can be shown to be given by (A4) P = 7 exp [-(1 + k)t/F]W(t)dt, t=o where W(t) is the assumed probability density function for repair time. The only real difficulty in making this assumption is the apparent lack of the raw data which is needed before one can choose (and statistically validate) a W. From personal reports and from one study in the literature [Reynolds and Van Kinsbergen, 1975], we have put together the following general description of how repair time is distributed, at least for some systems. 1. The probability of repair within 15 minutes is essentially negligible. 2. The probability of repair within a half hour is something like 0.3 to 0.4. 3. In the vicinity of t = 0.5 hr. the probability density curve rises sharply to its peak, so that the likelihood of repair within 45 minutes is between 0.8 and 0.9. 29 Notice that it should be relatively simple to obtain a good description of this sort for any particular system. All that is needed is a log of repair times. There are two known probability distributions which have the right sort of shape to fit our general description of repair time. These are the Beta distribution [Abramowitz and Stegun, 1964; p. 930] and the Weibull distribution [Barlow and Proschan, 1965; ch. 2]. Both of these have two parameters - (a, 3) in standard notation - which can be used to adjust their precise shape. They differ in that the Beta distri- bution has W(t) = for t >_ 1, while for the Weibull distribution W(t) approaches zero exponentially as t ->• °°. Graphs of three such density functions which seem to describe repair time well are shown in figures 4 and 5. Figure 4 shows the density function for the Beta distribution with a = 7, 3=5. Figure 5 shows two Weibull distributions; the solid curve corresponds to (a, 3) = (6, 4) and the dashed curve to (a, 3) = (4,3) Note that the scale on the horizontal axis can be adjusted to fit longer (or shorter) expected repair times; i.e., "1" can be assumed an arbitrary time unit. P f was computed from equation (A4) for 80 different combinations of distribution type (Beta or Weibull) and values of the parameters a, 3, F, and k. We observed that in no case did the value calculated -4 differ by more than 6 x 10 from that calculated (using the appropriate mean value X) from equation (A3) . Since this discrepancy is of roughly the same magnitude as the truncation error in numerically computing the integral in (A4) , we actually did not discover any difference between results computed from the two formulas. We conclude, therefore, that it is probably valid for all practical purposes to ignore the distribution of repair times and simply use the mean repair time X in equation (A3) to compute P r f * 30 FIGURE 4 BETA DENSITY >- CO z UJ Q FIGURE 5 WEIBULL DENSITIES CO z UJ o 31 Probability that backup site fails before its copy is ready Next, what is the probability P that the backup site fails even before the backup copy can be gotten ready? This probability is given by P = 1 - exp (-(D + L + kY)/F). r (Again we assume that D + L + kY is a constant.) In our analysis of the remote journaling strategy, we assumed that Y = F, the mean time between failures. Let us also assume the nominal values L = 0.5 hr. D = 0.01 hr. and k = 0.1. With these values, P = 1 - exp (-(0.51 + 0.1F)/F). Sample values are tabulated below. F 12 16 24 36 48 P r .13 .12 .11 .11 .10 Notice that as F becomes large, P approaches k. Unless k is very small, P is certainly not negligible. And the effect of a failure before the copy is readied could be serious. Again, the existence of a second backup would help, since it will seldom happen that both backup sites will fail before their copies can be readied. In the discussion contained in the body of this report, we in fact concluded that for values of k as large as 0.1, a remote journaling strategy with no journaling taking place between failures is not worthwhile. We looked at the possibility that updating the remote journal more frequently might produce a more practical strategy. As the time, Y, that updates have been accumulating decreases, P will also decrease. Suppose that F = 24 hrs., L = 0.5 hr., D = 0.01 hr. and k = 0.1. Then P , as a function of Y, behaves as follows: Y 1 2 4 8 12 16 24 P r .025 .029 .037 .053 .069 .084 .114 32 Thus when Y is small compared to F, the likelihood that the backup site fails before its copy can be readied is probably within acceptable limits. Finally, consider the running spares strategy. In that case we assumed L = and Y = 0.1. If we again take k = 0.1, we find that P = 1 - exp (-0.02/F). Here the probability that the backup site fails before the copy is ready is less than 0.2 percent as long as F is greater than ten hours. This small a figure will have a practically negligible effect on calculated availabilities 33 Appendix 3 Time to Process the Audit Trail Elementary analysis . In this appendix we consider the question of how long it really takes to "catch up" when a backlog of updates has accumulated during the time a site is down. In the text, we have used Chandy's expression kY [Chandy et al., 1975], where Y is the length of time the updates have been accumulating, and k = u/b, u being the rate of arrival of updates and b being the average rate at which updates are processed. (We assume that k < 1.) However, it is clear that during the time interval kY more updates are accumulating, and it takes an additional 2 time k Y to process these. Continuing to add on these correction terms, we generate the infinite series (k + k 2 + k 3 + .. .)Y as a better formula for the catch-up time T . Computing the sum, we get T : kY/(l - k). This is a slightly larger number than the first estimate kY, but the difference is a small percentage for the expected range of k values. (Chandy et al. state that "values of k of order 1/10 or less are to be expected.") Analysis using queueing theory . Even this analysis, however, appears to be too simplistic. The arrival and processing of updates should really be modeled by a single-server queueing system. If the arrivals form a Poisson process (arrival rate u) and processing time is exponentially distributed (with mean processing rate b) , the queueing system is one which can be analyzed. In queueing theory, the quantities of interest are p n (t), the probability that there are n items (updates, 34 in our case) in the queue at time t. After the system has been running for a while, the probabilities P (t) will approach equilibrium (or steady state) values p . Of course, equilibrium is never actually reached if n the initial distribution is not the equilibrium one. But it does make sense to describe the catch-up time T as the time to reach approximate equilibrium. Fortunately, both the time-dependent and the equilibrium distributions that we need are available in the literature [Kleinrock, 1975] , so that we have at hand the information we need to investigate the approach to equilibrium. The equilibrium distribution is given by p n = (1 - k)k n . The time-dependent distribution we are interested in corresponds to having some number i of updates in the queue at time t = 0. That is, we have the initial conditions P (0) - 1 for n = i; n ' P (0) = for n ± i. n Kleinrock [1975; p. 77] gives the solution of this problem as P n (t) = exp(-(u + b)t)[k (n ~ i)/2 I n _.(at) ♦ k (n - i - 1)/2 i n+1+1 (") + (1 - k)k n Z k~ j/2 I.(at)], j = n+i+2 J 1/2 where a - 2uk , and I (standard notation) is the modified Bessel function of the first kind of order j . In order to study the approach to equilibrium, the following formulas for Bessel functions are needed: 1) As z + -, i (z) . {e Z /(2TT Z ) 1/2 }{l - 4 J 2 - 1 . } O • • • J J 8z 35 2) E k j/2 I.(z) = exp[(z/2)(k 1/2 + k" 1/2 )]. j — 3) I (z) = I (z) 4) e Z = I (z) + 21-^z) + 2I 2 (z) + ... [Abramowitz and Stegun, 1964; p. 374 ff.] Using 2) and 3), we find that 00 E k" j/2 I.(at) = exp(t(u + b)). j = -co ^ Noting that the infinite summation in the expression for P (t) contains only a portion of these terms, we use 4) to show that the summation over the negative powers is negligible, and use 1) to estimate the finite number of missing positive powers. That is, for large t we make the approximation 00 Z k" j/2 I.(at) : exp(t(u + b)) - j=n+i+2 J / ^ \ n+i+1 . / , . 2 exp(at) „ -ill, _ 4.1 - L (2iTat) j=0 2 (The j term in the asymptotic expression is needed since unfortunately the constant terms cancel in the ultimate expression for P (t) - p .) o o Furthermore, we notice that, when we substitute the asymptotic formulas for I. (at) into P (t), we obtain products of exponentials exp(-(u + b)t) exp(at) which simplify to t , 1/2 ,1/2,2 , exp(-(u - b ) t) . After considerable algebraic manipulation, we obtain P (t) - p ~ ex Pl v u - b ) t)ik ° ° n I — TT 3/2 /n . -1/2, 2ViT v/ub t (1 - k ) 36 Recall that we have said that the catch-up time T should be the time c at which the P (t), and in particular P (t) , are "close" to their n o equilibrium values. This would mean that the right side of the above expression should be small. However, notice that the exponential factor * / 9 decreases rapidly with t, while the factor ik is very large if the queue is long at time t = 0. To a good approximation, then, we can assume that equilibrium is reached when these factors approximately cancel; that is, when ,,, 1/2 ,1/2,2 . „ -i/2 exp(+(u - b ) t) „ ik Taking logarithms, we obtain the following formula for T : c T = -ilnk + 2lni c -, 1/2 ,1/2,2 2(u - b ) If the term 2ln± is neglected, this expression simplifies in an interesting way. Suppose updates have been accumulating for a time period of length Y. Then i = uY, and -uY£nk T = C 2b(k 1/2 - l) 2 = kY{(-£nk)/2(l - k 1/2 ) 2 }. The quantity in brackets has the curious property of lying very close to 2.5 for k between 0.04 and 0.15. (For k = 0.01 it is still 2.84; although it grows rapidly thereafter as k decreases.) Notice also that adding on the term 2ln± will serve to increase the effective T . On the other hand. c ' taking account of the terms in the denominator of the approximate expression o In for P Q (t) - p Q (in particular the factor t ) will serve to decrease it. It seems not unreasonable, then, to claim as we did early in the paper, that T : 2kY. c 37 It should be emphasized that the queueing theory analysis above depends strongly upon the assumption that updates are arriving randomly - i.e., that the arrivals form a Poisson process. If arrivals are instead bunched up during certain time periods, results may be quite different. For example, if a number of updates uY have accumulated and must be processed during a time when no new updates are arriving, then clearly uY/b = kY will be the correct expression for T . On the other hand, — ■ — c if the backlog of updates must be processed during a time period when a particularly large number of new updates are being entered, then T will c be greater than the queueing analysis has indicated. 38 Appendix 4 Sensitivity to Parameter Values In any model, it is useful to determine how sensitive the output values are to changes in the inputs. Obviously, the inputs are only known approximately or are statistical averages. If the output changes drastically for a small change in an input value, the model is rather useless for predictive or decision purposes. Chandy et al. [1975] use the elasticity E(f,y), essentially the "percentage change in f caused by a percentage change in y", to investigate the sensitivity of a function f with respect to a parameter y. Formally, E is defined by E(f,y) = 3f_y_ 3y f We have investigated the elasticity of U = 1 - A. with respect to all of the input variables. (Working with U instead of A ? simplifies the algebra without changing the conclusion.) We find that for all parameters 1^*1 < 1. For example, taking y = k, iH = FY + XY - DX - LX 3k (F + X + kX) 2 , and ,_9U ki I k(FY + XY - DX - LX) i lak'U 1 I (F + X + kX)(D + L + kY) ' I kFY + kXY - . . . | 1 kFY + kXY + . . . I < 1 ' And for y = Y, |M II = kY . F + X + kX _ kY 'aY'U 1 F + X + kX D + L + kY D + L + kY <:L - Similar computations show that the elasticities of U with respect to D, L, X, and F are all less than one. Elasticities of U are connected to those of A through 39 1 3y A 2 ! '3y'A 2 ' 3y 'u' as long as A_ > U. We may conclude therefore that our model is stable, being relatively insensitive to small changes in parameter values. 40 UNCLASSIFIED SECURITY CLASSIFICATION OF THIS PAGE (When Data Entered) REPORT DOCUMENTATION PAGE READ INSTRUCTIONS BEFORE COMPLETING FORM 1. REPORT NUMBER CAC Document Number 181 CCTC-WAD Document Number 6501 2. GOVT ACCESSION NO 3. RECIPIENT'S CATALOG NUMBER 4. TITLE (and Subtitle) Research in Network Data Management and Resource Sharing - The Effect of Backup Strategy on Data Base Availability 5. TYPE OF REPORT A PERIOD COVERED Research Report 6. PERFORMING ORG. REPORT NUMBER CAC // 181 7. AuTHORr»; Geneva G. Belford, Paul M. Schwartz, and Suzanne Sluizer 8. CONTRACT OR GRANT NUMBERf*.) DCA100-75-C-0021 9. PERFORMING ORGANIZATION NAME AND ADDRESS Center for Advanced Computation University of Illinois at Urbana-Champaign Urbana, Illinois 61801 10. PROGRAM ELEMENT. PROJECT, TASK AREA ft WORK UNIT NUMBERS II. CONTROLLING OFFICE NAME AND ADDRESS Command and Control Technical Center WWMCCS ADP Directorate 11440 Isaac Newton Sq., NO., Reston, Va. 22090 12. REPORT DATE February 1, 1976 13. NUMBER OF PAGES 44 U. MONITORING AGENCY NAME ft ADDRESSf/' dttlerent from Controlling Ottice) 15. SECURITY CLASS, (ol thle report) UNCLASSIFIED 15«. DEC LASSIFI CATION/ DOWN GRADING SCHEDULE 16. DISTRIBUTION ST ATEMEN T (ol this Report) Copies may be obtained from the National Technical Information Service Springfield, Virginia 22151 17. DISTRIBUTION STATEMENT (ol the abstract entered In Block 20, It dltterent from Report) No restriction on distribution 18. SUPPLEMENTARY NOTES None 19. KEY WORDS (Continue on reverse side it necessary and Identity by block number) information system modeling data base availability data base backup distributed data management 20. ABSTRACT (Continue on reverse side It necessary and identity by block number) Formulas are developed for the study of improvements in data base availability due to the existence of a backup copy located at an alternate site in a net- work. Several backup strategies are compared. DD FORM 1 JAN 73 1473 EDITION OF I NOV 65 IS OBSOLETE IJNr.LASSTFTF.Tj- SECURITY CLASSIFICATION OF THIS PAGE (When Data Entered) BIBLIOGRAPHIC DATA SHEET 1. Report No. UIUC-CAC-DN-76-181 3. Recipient's Accession No. 4. Title and Subtitle Research in Network Data Management and Resource Sharing - The Effect of Backup Strategy on Data Base Availability 5- Report Date February 1, 1976 7. Authon'sl Geneva G. Belford, Paul M. Schwartz, and Suzanne Sluizer 8. Performing Organization Rept. No. CAC #181 9. Performing Organization Name and Address Center for Advanced Computation University of Illinois at Urbana-Champaign Urbana, Illinois 61801 10. Project/Task/Work Unit No. 11. Contract/Grant No. DCA 100-75-C-0021 12. Sponsoring Organization Name and Address Command and Control Technical Center WWMCCS ADP Directorate 11440 Isaac Newton Square, North Reston, Virginia 22090 13. Type of Report & Period Covered Research 14. 15. Supplementary Notes 16. Abstracts Formulas are developed for the study of improvements in data base availability due to the existence of a backup copy located at an alternate site in a network. Several backup strategies are compared. 17. Key Words and Document Analysis. 17a. Descriptors information system modeling data base availability data base backup distributed data management 17b. Identifiers /Open-Ended Terms 17c. COSATI Field/Group 18. Availability Statement No restriction on distribution Available from the National Technical Information Service, Springfield, VA 22151 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 44 22. Price FORM NTIS-35 IREV. 3-72) USCOMM-DC I4952-P72